Hadoop understanding – Simple Derivation


So today we want to have some knowledge regarding basic understanding on evolve of apache hadoop.

I am starting this post with great thanks to Mr. Doug Cutting. for his wonderful work towards making such a nice big data processing framework.

Today industry is growing with big to bigger dataset.

What that mean???

Industry is having data in electronic form for last 10-15-20 years.They are storing it in server to bigger server to even bigger server for years.

But the processing and storing capacity for each machine is depreciated as the years go(even for server machines). I do not know, but as far my view, a server with 4GB ram and dual core processor which was used to be a server machine for a organisation 5-7 years back, now become a desktop machine for people.

With invention from every day new market entrant, we are now-a-days finding newer gadgets for almost every 2/3 days (as per my view).

Now what the gadgets are for now?

They are now more and more socially featured (our special thanks to google talk, gmail, facebook, twitter and many more with inline). So allmost every gadget are now enabled with social media features. Tomorrow will be more.

Also, thanks to marketing guys, that people now believe engaging and sharing themselves with every moments of life, is the most happening thing now.

So here the point is –

The data volume with every dimension- official, unofficial, structured, non-structured, sense-able, nonsense – in nature are increasing.

And management gurus are expecting more and more refined information from this data to make their strategic decision  like branding, promotion, budgeting, research & development etc correct and make their investors happy at the end of day.

So more and more requirement of complex and refined reports are growing.

So the hardware companies are benefiting by supplying monster configuration servers and also the same case for data analytic – consultant firms with out-of-the-box business intellegence software installation and customisation.

But these are mostly for structured data.

Unfortunately most useful information are hidden in unstructured data (world wide 80% of e-formed data are un-structured.)

Example – We can have the sale data of canon camera information from structured database – a centrally processed RDBMS.

But from conversation from social media, we can have no of users who have used a new brand of canon camera and they found some exceptional feature of this camera which is worthy to them. So here may be in ref feral process, in next qtr that camera may have more market share than others. And so to derive some strategic decision….

Here we have a real need of intensive data processing from all the available input source and some programming part to achieve the result. BI solutions are unfortunately can not be so dynamic that every complex business requirement can be solved with that.

Also now management realises that power of single or a few servers are not enough to process these. End of the day, there is a costing factor to get the macro data as processed and refined information.

So some of them had realised this earlier. And so the tech persons….

Now to achieve the goal to process the structured and unstructured data in a fail-safe and cost effective way the tech world came here with apache hadoop stack.

And here ends the questions like what is hadoop and why hadoop….

In a general way, hadoop runs on any linux machine (I have started to play with it in ubuntu) and the interesting thing is hadoop runs on any commodity machine (big love for big data solutions from management sides of organisation)

Yes…To mention to process the program, hadoop  runs map reduce framework – a two phase programming practice, in first phase (map) it process the data and in second phase(reduce) it accumulate and compute the data for statistics generation.

But all are not very smooth….some skills are required here….quality gray matters….again a gray area…ha ha…

Now end of joke and I am serious and it seems that you are …

Read about hadoop stack – hadoop, hbase, hive, pig etc from the world of google…On next post on hadoop I will discuss the hadoop architecture…

And keep commenting….

Enter your email address:

Delivered by FeedBurner

Lucene based Image Search – A Conceptual Idea


As we have some idea about Lucene Search Capabilities, We can mix up image related information search via lucene.

So how we are going to do it?

I am trying to give a conceptual idea here.

We have the open source Tesseract OCR engine to be able to extract data from Images.

So we are ready to play with our open source weapons –

The steps for building the Search Engine may be  –

1> Process JPG, JPEG or TIFF images with Tesseract to extract contents from the images.

This may be a batch program to process the images.

2>As we have seen from sample cases, 80% of contents are correct, as far as we have seen. ( this was used for jpg and tiff images)

Only words from a picture of hand-written cheque words are not recognised. And also some fonts are not recognised as well.

But 80% of Data from a image  (of course where letters are prominent, are recognised very well) is good for test a proof of concept on image search.

As more the content, more is the probability to be a particular image.

But the question is how?

Yes, here is the trick…. 

3>Take the content and name as similar to name of the picture or as per application convention.

4>Make the lucene index with each of the content as fields in documents with lucene indexwriter.

5>Link the documents with actual pictures.

6>Also we can take the image metadata from image and can store in lucene index.

7>So here is the Image search engine ready to show as an application Proof of Concept.

We can use this concept to make more powerful search engine and can use the search capability in those application areas, where every day more faxed documents are used in electronic format.

Hope, this may help someone in application.

Comment, if you have better ideas related to this….

Also… all other comments are welcome…

Enter your email address:

Delivered by FeedBurner

Lucene Indexing and Searching in Multiple Tables (Conceptual Representaion)


Lucene Indexing and Searching
Now-a-days it is common in a content managed system,

that many contents of diffrent category of informations are stored in different tables in database.

Example of a site can be with – News, Articles, User Pages(Blogs) Etc.

Problem Schenario –

Now we want to search some term or specific set of words. Our Expectation is to look for results from all the tables and get the result in Search page 1with scoring higher to lower.

So how to solve this  –

Steps to solve –

1> Choose the fields of each table which we want to be indexed in lucene query.

2>In the above statement, it may have 2 fields in a table and 1 field in another table for search and so on.

So here is the trckiest part.

3>Fetch record in particular table/tables.

4>We can build each lucene document with each field for a table to be stored in lucene index (i.e to be analyzed)

5>We can put our own index key for the document id for the lucene document.

6>We can take a unique identified for each row (that may be the primary key of the table and the table name).

This will vary case to case as per business logic.

7>So now we can add the lucene  document to the lucene indexer.

8>This indexing process should be run through a separate process at a regular interval.

9>Indexing should not be mixed up with regular application as this may be simple to distributed as per requirement.(Then may be a definite introduction of Apache Hadoop with all of it’s map-reduce programming functionalities which will again be tricky and should not be mixed with general business logic programming.)

10>Now our lucene index is ready with standard analyzing capability and ready to be searched by user.

Here I have not discussed the analyzing part of lucene which I will discuss in later posts, as I had said earlier.

The above may not be a optimal solution, but definitely a solution to make clients happy (I have the experience).

Comments are welcome…

Also my question to the reader, how should we approach to index multiple tables in multiple databases ???…

Do not hesitate to answer my query…. We can have discussion here in the blog comments…may be there will be a simple idea which we can investigate…I have the idea…but i want your participation please….

So keep reading and happy programming….

Enter your email address:

Delivered by FeedBurner

Lucene and Hibernate – Text searching within ORM wrapper


So we have one kind of searching application with lucene query in Lucene search – a workable example.

Also let us assume, we have hibernate related knowledge in our previous works. If we do not have that much, we can google with Hibernate tutorial. Because again this post is not the place to elaborate hibernate ORM ideas.

Now as we have fantastic object relational mapping tool hibernate and search library Lucene, let us dig in about the topic that how hibernate and lucene can be used together…

No… we do not have to invent any wheel, just look for Hibernate Lucene Search in hibernate site. The hard part i.e. the integration of these two libraries is already done. We will use it to get advantage of ORM over text searching.

So what are we going to do now??…

1> We will make an Eclipse based Project.

2> We add Hibernate search 3.4.1.jar and the reference jars in the Project Library.

3> Make Simple POJO class in hibernate.

4> Make Annotations for the hbernate to table mappings. Like (@Entity, @Table) and others as required.

5> Put hibernate lucene specific annotations in the pojo file. This is a highly important step. I am trying to put some imortant annotaions here.
A> @Indexed – by which name lucene will create the directory for indexing in lucene.
B> @AnalyzerDefs and @AnalyzerDef, these are required to tell lucene by which analyser library lucene will analyze the conts for every object.

For example, if we use phonetic analyzer for our application search, the example code snippet is below –

@AnalyzerDef(name="phonetic",
tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
filters = { @TokenFilterDef(factory = PhoneticFilterFactory.class,
params = {
@Parameter(name = "encoder", value = "Metaphone"),
@Parameter(name = "inject", value = "false")
})
})

Here library specific to phonetic analyzer will be provided in hibernate seacrh jars. So we do not have to care about this.

Also we have to define the analyzer definition which is below –

@Analyzer(definition="doublephonetic")

Now we have to put an unique id for the document for lucene, which we will do by giving @EmbeddedId @DocumentId annotations.

Sample of this is –

@EmbeddedId @DocumentId

For the fields to be marked as lucene serach fields and the indexing we will use @Field annotation.

Sample of this is –

@Field(index=Index.TOKENIZED , store=Store.YES)
private String symptomdescription;

Parameter “index” will tell lucene whether this field is to be tokenised by analyzer or not.
Parameter “store” will tell lucene whether this field will be stored in the lucene directory or not.

6>Next step is to use lucene indexing in the cotainer of the Hibernate.

As we have put different information already in pojo file, now we have to put only threee key lines to make the index work.

Session session = SessionFactoryUtil.getFactory().getCurrentSession();

Opening a session throgh hibernate.

FullTextSession fullTextSession = Search.getFullTextSession(session);

open a session for lucene text search.

fullTextSession.createIndexer().startAndWait();

indexing the documents and prepared for any further indexing.

7> Now we look at the searching part through hibernate lucene search –

After opening the Session with full textsession we have to start transaction –

fSession.beginTransaction();

assumption – fSession is fulltextsession variable.

Prepare the searchfield string array –

String[] searchFields = {“Application Search fields”}; // targeted fields

Create a parser to parse a full text query

SearchFactory sf = fSession.getSearchFactory();

Creating the QueryParser Object –

QueryParser parser = null;
parser = new MultiFieldQueryParser(Version.LUCENE_31, profileFields, sf.getAnalyzer("phonetic"),boostPerField);

Create the Lucene Query Objects –

Query lucenceQuery = null;
lucenceQuery = parser.parse(searchQuery);

And get the actual query object lists –

org.hibernate.search.FullTextQuery hibQuery =
fSession.createFullTextQuery(lucenceQuery, <>.class);
List results = hibQuery.list();

Lastly close the tarnsaction session –

fSession.getTransaction().commit();

This is it.
As we are getting the object list here, we can type convert it to our required pojo object list and work on those as per application requirement.

I have deployed and tested the application in Tomcat 7 and JBoss 7.1.0Beta1. Both are working fine.

Screeshots to show the sample application –

Actual Search Result
Search Screen

Here “Get Actual Result” button returns the search result.
And “Get Score Result” button returns the score.
Also I have provided Phonetic and Double Phonetic Analyze option in UI.
Below screens are showing the results.

Actual Search Result –

Search Result
Search Result

Above is the search result screen with the highlight term option. I have shown the all the fields where I have populated data.  I have not shown the highlighting code here, because my aim was to work with the hibernate search with basic code. I will explain the highlighting in later posts that I have planned.

Scoring Result Screen : –

Lucene Scoring
Lucene Scoring

This is the score result screen. It confirms that results are in higher to lower ranking order.

So last but not the least –
My suggestion is –
Gain as much as knowledge by using Hibernate Search as much possible in various complex business cases. And check for updates in my next posts in searching techniques with lucene.

Mail me for source code reference in piyas.de@gmail.com – I will be happy to share those and discuss complex search areas of yours.

Enter your email address:

Delivered by FeedBurner

Lucene search – a workable example


As we have already some idea regarding search techniques in Lucene – Open Source Search Engine or by search in google, we can now go straight to a search application with use of lucene.

I will try to explain this problem and solution with bits and Pisces of code and explanation.

First : The Problem –

We have thousands of XML files in repository (file system) and try to find information from the xml files through a web interface.

What approach we have taken –

1> Create a simple “Dynamic Java Project” in eclipse – any release. I have used Eclipse Indigo for this.

2>Add Lucene specific jars in the project lib. (I will not go to the “How to” portion of this. Please google it.)

I have added libraries specific to lucene 3.0.2 and all the reference jars of it.

3>Configure Server runtime for the project. I have configured Tomcat 7 for this application.

4>Now to work with Lucene, there are two main parts of it.
A> Lucene Indexing
B> Lucene Searching

A> Lucene Indexing –
Look for all the xml file based repository/directory in the system by java program.
Hint is : use java.io.File library for it.

Open the Lucene Directory with the provided path of xml files –

indexdir = FSDirectory.open(new File(indexFilePath));

Here idexDir is a varible for org.apache.lucene.store.Directory and FSDirectory is from org.apache.lucene.store.FSDirectory.

Open the Lucene Index Writer –

writer = new IndexWriter(indexdir,new StandardAnalyzer( Version.LUCENE_30),true, IndexWriter.MaxFieldLength.UNLIMITED);

writer is a object for org.apache.lucene.index.IndexWriter.

I am not going to explain the analyser work here, which I will cover in detail in later posts.

I am trying to make the application workable first to show the output.

Now for each file in the directory put the code –

SAXXMLDocument handler = new SAXXMLDocument();
Document doc = handler.getDocument( new FileInputStream(new File(dir.getAbsolutePath() +"\\"+children[i])));
doc.add(new Field("filepath",dir.getAbsolutePath()+"\\"+children[i],Field.Store.YES, Field.Index.NOT_ANALYZED));

So what is happening here?

As we are getting xml file as input, we are trying to get the InputStream from the Sax Parser by opening the file with it. Then we are setting the inputstream in org.apache.lucene.document.Document object. So the lucene document is prepared now.

Next we are adding the field to analyse in the lucene document and tell the lucene index to store the value of the field and also that value to be analysed by lucene or not.

We can add as-many-as fields in the lucene document.

Next we will attach this prepared document to the Lucene index writer.

writer.addDocument(doc);

So above is a brief lucene indexing example.

B> Lucene Searching –


Now lucene query is an important part here in our application.

Going end-to-end –

Open Lucene Directory

Directory dir = FSDirectory.open(new File(indexDir));

Open Lucene Index Searcher with the Directory –

IndexSearcher is = new IndexSearcher(dir);

Specifying search fields –

String[] xmlFields = {"search field name"};

The above can be an string array of field names.

Open the analzser –
Code –

StandardAnalyzer anal = new StandardAnalyzer( Version.LUCENE_30);

Open the query parser with analzser –

QueryParser parser = new MultiFieldQueryParser(Version.LUCENE_30, xmlFields, anal);

Make the actual query

query = parser.parse(“queryString made with the search fields”);

Return every lucene document in the search result in the application specific data structure – (Yes… that is one of my favourite area – through collection framework.)

5> So after making the above functional encapsulation in a java class, call the clear methods from a servlet, which again can have simple to complex interface.

References/ External Links –

Apache Lucene

Lucene Book from Manning – a great book to start with….

As always…

I have explained and hinted about lucene indexing, searching and querying with a target of creation of a primary lucene based searching application.

In next few days I will come back with In-Depth expalnation of lucene core components, such as lucene analysers, indexing tools and many more.

So watch this site for updates and happy coding….

Enter your email address:

Delivered by FeedBurner

Lucene – Open Source Search Engine Library


Now-a-days searching in applications are becoming more and more important feature. After all, Web is all about information and it is all about getting information at right time and at right hand.

Today I will try to put some light on search technologies in J2ee open source software areas for beginners. Also I will go to to different functional implementations on Search techniques in different posts which I have planned to post within next few weeks.

Now one thing to worth mention here, search algorithms are complex in nature. And big thanks to Mr. Doug Cutting and Apache Software Foundation by helping us by giving us a highly important search Library – Apache Lucene with standard set of API to make the life far easy to live (at least for me and fellow j2ee developers…)

Lucene is a search engine library – this is a well known fact in application development world.

Our question is, where is the need to implement such a library….

Yes, it is off course for search. But world is already there with database search – with much known SQL (Structured Query Language).

And the search is quite fast there for a table with 1000000 records with indexing the search field. Then….???

We need to think of flood of Contents in Web or in electronic library where the millions of unstructured data/documents are required to be stored as raw content in the Database.

I have gone through this above no of structured and unstructured data handling scenarios and I am sorry to say that relational databases are not performing well there. (May be I lack the knowledge related to Database optimisations, but I leave this portion of optimisation work for my fellow DBA friends and Hardware Guys…)

So, is there a low cost solution???

Yes, we can put lucene there and put a bit little extra logical (magic word) work to get rid of this situation.

In the language of lucene and from application development perspective –

We can

1>Store the files in file system.

2>While storing the files, just add a document in lucene index (Put content in Lucene Document Field).

3>While removing files, remove entry from lucene index.

4> Analyse the document with Lucene Standard Analyser.

5>Update the lucene index with an extra field such as file path in the document.

6>Start search with Lucene Standard Analyser and

7>Finally get the result at a far improved speed than the relational database search for millions of documents.

So how this magic happens….Because Lucene is doing only text based search in it’s index and return us the result.

8>Now if we have the link for the file in the file system, we can browse it…which may be one of our appliaction goal.

The above use case may not be all complex real-world problem interface. And database search requirement also exists there.

So, on this first post I am not putting any code related to Lucene use. I just want to give an idea related to open source search technologies – the primary of which is Lucene.

I will put more and more in-depth and scenario based posts related to search techniques and Lucene in near future.

So for now, just google about Lucene and try to grab as many as ideas related to Search techniques.

And just to mention….wait for my next posts…related to search techniques….

Enter your email address:

Delivered by FeedBurner

Web Application Testing with Selenium Part -2


I have written getting started post on Selenium RC  Testing previously and now I am trying to write this post related to eclipse setup with Selenium RC. I am trying to explain this with helping pictures.

Selenium RC with Eclipse
Eclipse is a multi-language software development platform comprising an IDE and a plug-in system to extend it. It is written primarily in Java and is used to develop applications in this language.

Following lines describes configuration of Selenium-RC with eclipse-jee-indigo-win32 (Indigo Release).

Configuring Selenium-RC with Eclipse

It should not be too different for higher versions of Eclipse
• Launch Eclipse.

• Select File > New > Other.

Selenium Eclipse Integration -1
Step -1

• Java > Java Project > Next

Selenium Eclipse Integration -2
Step -2
  • Provide Name to this project [Test Automation], Select JDK in ‘Use a project Specific JRE’ option (JDK 1.6 selected

in this example) > click Next

Selenium - Eclipse Integration 3
Step-3

Keep ‘JAVA Settings’ intact in next window. Project specific libraries can be added here.

Eclipse Selenium Integration -4
Step -4

Click Finish > Click on Yes in Open Associated Perspective pop up window

Eclipse Selenium Integration 5
Step -5

This would create Project ‘Test Automation’ in Package Explorer/Navigator pane.

Eclipse Selenium Integration 6
Step-6

Right click on src folder and click on New > Folder

Eclipse Selenium Integration 7
Step-7

Name this folder as com and click on Finish button.

This should get com package insider src folder.

Following the same steps create core folder inside com

Eclipse Selenium Integration 8
Step-8

NewTest class can be kept inside core package.

Create one more package inside src folder named testscripts. This is a place holder for test scripts.

Please notice this is about the organization of project and it entirely depends on individual’s choice /organization’s standards. Test scripts package can further be segregated depending upon the project requirements.

Create a folder called lib inside project Test Automation. Right click on Project name > New > Folder. This is a place holder for jar files to project (i.e. Selenium client driver, selenium server etc)

Eclipse Selenium Integration 9
Step-9

This would create lib folder in Project directory.
Right click on lib folder > Build Path > Configure build Path

Eclipse Selenium Intehration 10
Step-10
Eclipse Selenium Integration 11
Step-11

Under Library tab click on Add External Jars to navigate to directory where jar files are saved. Select the jar files which are to be added and click on Open button.

Eclipse Selenium Integration 12
Step-12

After having added jar files click on OK button. Added libraries would appear in Package Explorer.

 Test Suit Creation 

A Junit Report can be generated by Selenium RC test execution using Ant script. I have used a “build.xml” for Ant for executing Selenium cases and generating a Junit report for the said selenium execution. Following are the steps to generate a Junit report for Selenium Execution.

Setting up a Java Project for test execution using Ant


Preconditions:

1. Install Ant from apache.

2. Install Java JDK (not JRE) and add its bin folder to “PATH” environment variable of user’s system. For checking JDK is mapped check whether “javac” is recognized in the system by typing “javac” in your command prompt.

3. Download the Selenium RC.

4. Test cases are already written in Selenium Junit cases.

Steps:

1. Create a Java project in Java IDE. I am using Eclipse. [Already mentioned how to create it]

2. Now write Selenium Junit classes to this new “com.testscripts” package.

3. Create a new folder called “lib” under the said Java Project. [Already mentioned how to create it]

4. Copy ant-junit.jar, selenium-server-standalone-2.21.0.jar, selenium-java-2.21.0.jar , selenium-java-2.21.0-srcs.jar and junit-4.8.2.jar to your “lib” folder of Java Project.

5. Now create a folder called “build” under Java Project and copy the “build.xml” file attached with this example to the build folder you created.

6. Now open the build.xml file for edit.

7. Search for “run-single-test” in build.xml. Go to the “<test” under the “run-single-test” target .Replace the values for “name=” variable with the Selenium Junit class name. For ex if class name is TestSelenium.java and it is under package “test” under “src” then replace the value with “test.TestSelenium” then save the build.xml file.

8. Now start the Selenium RC server.

9. Now go to the build folder of the Java Project using command prompt and run the command “ant run-single-test”.

10.It should execute Selenium cases on selenium server and generate an html report onto the “report/html” folder of the Java project.

Ok.

That is it for now. More testing schenarios will be explained in next Posts.

References/External Links  :

http://seleniumhq.org/

http://jroller.com/selenium/

Thanks :

To my wife Ketaki Nandi (J2ee Professional also…) for helping me in preparing contents.

Enter your email address:

Delivered by FeedBurner