Lucene Analysers


After a long gap in my writing, now I want to put some light on Lucene Analysers.

So, What are Lucene Analysers?

According to technical definition, an Analyser is some function or block of code, which take a stream of characters and break those to number of tokens, which are again useful to make index of words in a search engine. The search library like lucene also take character streams as input, break then into useful tokens, and put these tokens into index for facilitating the search query.

In general the tokens are referred as words (we are discussing this topic in reference with English language only) to the analysers, but for special analysers the token can be with more than one words, which includes the spaces also.

For example, the information “Dr. Amit Agarwal” can be treated as a doctor token, which is an advanced type of token preparation, and it is outside of the scope of this post.

In general Lucene Analysers are designed in following stpes

Actual Text –> Basic Token Preparation –> lower case filtering –> stop word filtering (negation of not so useful words, which comprise in the 40-50% of words in a content) –> Filtering by Custom Logic –> Final Token preparation for indexing in lucene, which will be referenced in the searching of lucene.

Different analyzers use different tokenizers and on the basis of that the output token streams – sequences of group of text will be different.

Stemmers are used to get the root of a word in question. For example, for the words beginning, began, begin etc. the root word will be begin. This feature is used in analysers to make the search scope higher in the content by the search api. If the root word is referred in the index then may be for the exact word, we can have more than one option in the index for searching and the probability of phrase matching may be higher here. So this concept, referred as stemmers are used in analyser design.

Stop words are the frequent less useless words in english language. For English these words are “a”, “the”, “I” etc.
In different analysers, the token streams are cleaned from Stop-words to make the index more useful for search results.

In Lucene,

some of different analysers are –

Whitespace analyzer –
The Whitespace analyzer processes text into tokens based on whitespaces. All characters between whitespaces are indexed. Here stop words are not used for analysing and the letter cases are not changed.

SimpleAnalyser –
SimpleAnalyser uses letter tokenizer and lower case filtering to extract token from the contents and put it in lucene indexing.

StopAnalyzer –
StopAnalyser removes common English words that are not so useful for indexing. These are accoplised by providing the analyser a list of STOP_WORDS lists.

StandardAnalyzer –
StandardAnalyser is a general purpose analyser. It generally converts tokens to lowercase, take help of standard stop words to analyze the texts and also governed by other rules.

There are other analysers in lucene, which I have not described here. But the most important part of the lucene analyser is that – we can make our own custom analyser here to solve our application specific problems. I will try to describe design of a custom lucene analyser in my latter posts.

That is all for today….

Enter your email address:

Delivered by FeedBurner

Big Data in Enterprise Applications


Before doing much digging in Hadoop, HBase and their application areas, today I will discuss some preliminary concepts here.

For Searching Big Data in different enterprises, today Hadoop is a core part of the computing infrastructure for many content based organisations, such as Yahoo , Facebook , LinkedIn , and Twitter .

Many more traditional businesses, such as media and telecom, are beginning to adopt this system too.

And also many sectors are waiting with their big data to be processed….and probably we will be there to have some bits-and-pieces…

I have discussed evolve of  Hadoop in my previous blog.

Now some technical concepts….

Hadoop cluster is a set of commodity machines networked together in one location.

Different users can submit computing jobs (data processing jobs) to Hadoop from individual clients, which can be their own desktop machines in remote locations from the Hadoop cluster.

In present days, volume of enterprise data is so high, that building bigger and bigger servers is no longer necessarily the best solution to large-scale problems (and also is not cost-effective).

An alternative that has gained popularity is to tie together many low-end/commodity machines together as a single functional distributed system and where hadoop comes in picture.

The Hadoop Distributed File System (HDFS™) based system to current I/O technology price performance.

Here the Basic Philosophy is – Let the data remain where it is and move the executable code to its hosting machine. So just put your input data to the HDFS and schedule the job, at end get your desired output statistics.

SQL is by design targeted at structured. data.

Many of Hadoop’s initial applications deal with unstructured data such as text.

From this perspective Hadoop provides a more general paradigm than SQL for these data.

Hadoop uses key/value pairs as its basic data unit, which is able to work with the less-structured data types. In Hadoop, data can originate in any form, but it transforms into (key/value) pairs for the processing functions to work on.

Text documents, images, and XML files are popular examples of less-structured data types.

Now Hbase – HBase is a distributed database and developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of  Hadoop Distributed Filesystem, providing fail-safe way of storing large quantities of data for Hadoop.

Apache Hive is a warehouse infrastructure built on top of  Hadoop for providing data processing results.

Now in all of these systems, the core processing framework is Map-Reduce framework, thanks to google people to provide the world such a simple(may be not so simple) and robust framework for data processing and analytics.

So fellow readers, please concentrate on the terms like hadoop, hbase, hive, pig, map-reduce framework, before drive in more and more technical concepts….

Happy reading…

Enter your email address:

Delivered by FeedBurner

Lucene Indexing Automation – Conceptual Idea


This time I want to describe some idea regarding Lucene Indexing Automation.

If you have followed some of my previous posts in Lucene – Open Source Search Engine,  Lucene search – a workable example and Lucene Indexing and Searching in Multiple Tables (Conceptual Representaion) and also have gone through Lucenetutorial.com,

you already got some idea in Lucene Search work.

Here text based searching is made simple in great ways by our fellow and veteran J2ee Stalwarts.

Also you should google this term and put some quality time on it – at least I can say that this type of api is not easy cup of tea for casual people, and also in other way I am sure, by applying this type of open source technologies in your application areas, at least you can have your bread-and-butter for life.

Still if you are confused in putting lucene in your application areas , please put your confusion area in comments.

I will try to put some light on each particular situation as per my knowledge here.

Now apart from the above , let us dig some idea regarding Lucene Index Automation.

So what is the idea….

Already we know, there are two main parts for lucene – indexing and searching.

Here we want to put some idea about lucene indexing automation.

In many small content management applications, we need to index the contents to be rightly searched everyday.

I have said in my earlier posts that the indexing should not be a part of main business logic handling code and it should be a separate process.

Of course we can index every document when inserted or deleted or updated to make it available instantly in search.

But as per my view if we index all the document at some fixed time intervals in every day,

it will not hamper the main business much for small companies.

But why we do so?

The indexing process always take some time to tokenise the documents… and during indexing the documents are locked…

So my point is, how we can make this indexing process to be run in system?

Either we can put some person to start the process at a system idle time every day or we can make it as corn job – schedule the job.

The first option is not viable.

In j2ee, we have one more open source product here to help us in scheduling applications – it is Quartz scheduler.

By using quartz scheduling we can implement the job interface to execute a job (in our case the lucene indexing process) and schedule it to be executed in some pre-specified time interval.

In this way, we can have a perfect lucene indexing automation.

Try it in your application area….

I have implemented it in applications and got the desired results….

I have not put any code here, in assumption that reader will explore this him/herself…

Also I will be happy to help….

Share your application ideas and thoughts also…

Enter your email address:

Delivered by FeedBurner