After a long gap in my writing, now I want to put some light on Lucene Analysers.
So, What are Lucene Analysers?
According to technical definition, an Analyser is some function or block of code, which take a stream of characters and break those to number of tokens, which are again useful to make index of words in a search engine. The search library like lucene also take character streams as input, break then into useful tokens, and put these tokens into index for facilitating the search query.
In general the tokens are referred as words (we are discussing this topic in reference with English language only) to the analysers, but for special analysers the token can be with more than one words, which includes the spaces also.
For example, the information “Dr. Amit Agarwal” can be treated as a doctor token, which is an advanced type of token preparation, and it is outside of the scope of this post.
In general Lucene Analysers are designed in following stpes –
Actual Text –> Basic Token Preparation –> lower case filtering –> stop word filtering (negation of not so useful words, which comprise in the 40-50% of words in a content) –> Filtering by Custom Logic –> Final Token preparation for indexing in lucene, which will be referenced in the searching of lucene.
Different analyzers use different tokenizers and on the basis of that the output token streams – sequences of group of text will be different.
Stemmers are used to get the root of a word in question. For example, for the words beginning, began, begin etc. the root word will be begin. This feature is used in analysers to make the search scope higher in the content by the search api. If the root word is referred in the index then may be for the exact word, we can have more than one option in the index for searching and the probability of phrase matching may be higher here. So this concept, referred as stemmers are used in analyser design.
Stop words are the frequent less useless words in english language. For English these words are “a”, “the”, “I” etc.
In different analysers, the token streams are cleaned from Stop-words to make the index more useful for search results.
some of different analysers are –
Whitespace analyzer –
The Whitespace analyzer processes text into tokens based on whitespaces. All characters between whitespaces are indexed. Here stop words are not used for analysing and the letter cases are not changed.
SimpleAnalyser uses letter tokenizer and lower case filtering to extract token from the contents and put it in lucene indexing.
StopAnalyser removes common English words that are not so useful for indexing. These are accoplised by providing the analyser a list of STOP_WORDS lists.
StandardAnalyser is a general purpose analyser. It generally converts tokens to lowercase, take help of standard stop words to analyze the texts and also governed by other rules.
There are other analysers in lucene, which I have not described here. But the most important part of the lucene analyser is that – we can make our own custom analyser here to solve our application specific problems. I will try to describe design of a custom lucene analyser in my latter posts.
That is all for today….