Here is a really interesting example of how to build an inverted index using Pig. As I have seen in Hadoop, to create a Lucene index you must start from a text file and use MapReduce jobs to build it. Pig however, allows you to retrieve data not just from a text file but from SQL databases, HBase or other data sources.
After checking the example with detail, what comes now to my mind is if it would be possible to create a Lucene index using Pig and MapReduce jobs retrieving data from a distributed HBase data store system… I am wandering if there would be Lucene analyzers problems (or any other), for example.
I have read that Pig is not specially fast accessing to data. However, in indexing cases, probably this would be more than compensated with the MapReduce jobs.
How fast would it be? I still have lots of research and tests to do…