Here is a really interesting example of how to build an inverted index using Pig. As I have seen in Hadoop, to create a Lucene index you must start from a text file and use MapReduce jobs to build it. Pig however, allows you to retrieve data not just from a text file but from SQL databases, HBase or other data sources.
After checking the example with detail, what comes now to my mind is if it would be possible to create a Lucene index using Pig and MapReduce jobs retrieving data from a distributed HBase data store system… I am wandering if there would be Lucene analyzers problems (or any other), for example.
I have read that Pig is not specially fast accessing to data. However, in indexing cases, probably this would be more than compensated with the MapReduce jobs.
How fast would it be? I still have lots of research and tests to do…
Hi Marc,
Did you give this a shot? I’m in a research setting working with tons of data. This means that I try many different indexing/weighting strategies that take a long time every time. I’m already using Pig to precompute some stats for me, so I thought: why not have it build the index already.
The main advantage for me is that pig gives me a quick way to manipulate my input data before I give it to the index, all of that over a cluster.
So I thought that somebody must have thought of this before me. Did you get anywhere with your idea, or did you drop it for some reason?
Cheers,
Pablo
Hi Pablo,
I just tested the example in the post I pointed at and it worked great. But in the end I thought that trying to build a Lucene index using Pig wasn’t worth it. I mean, I thing it’s much more easy if you write your own MapReduce job to do that.
There are some examples out there.
I succeeded building Lucene index retrieving data from HBase using MapReduce jobs.