Life, code and stuff
Posted by Marc Sturlese

09 Mar 09 Lucene 2.4.1 available from today

A new official release of Lucene in now available! Lucene 2.4.1 is a bug fix version.

We will be able to see more new features in the Lucene 2.9 release (available in developers version).

Here I mention all the improvements of Lucene 2.4.1, wich I read from the official lucene’s site:

  • Fixed silent data-loss case whereby binary fields are truncated to 0 bytes during merging if the segments being merged are non-congruent (same field name maps to different field numbers).
  • Don’t throw incorrect IllegalStateException from IndexWriter.close() if you’ve hit an OOM when autoCommit is true.
  • If IndexReader.flush() is called twice when there were pending deletions, it could lead to later false AssertionError during IndexReader.open.
  • Fix false AlreadyClosedException from IndexReader.open (masking an actual IOException) that takes String or File path.
  • Multiple-valued NOT_ANALYZED fields can double-count token offsets.
  • Ensure IndexReader.reopen() does not result in incorrectly closing the shared FSDirectory. This bug would only happen if you use IndexReader.open with a File or String argument.
  • Fix possible overflow bugs during binary searches.
  • Fix CachingWrapperFilter to not throw exception if both bits() and getDocIdSet() methods are called.
  • Fix int overflow bug during segment merging.
  • Fix int overflow bug when flushing segment.
  • Fix deadlock in IndexWriter.addIndexes(IndexReader[]).
  • NearSpansOrdered returns payloads from first possible match rather than the correct, shortest match; Payloads could be returned even if the max slop was exceeded; The wrong payload could be returned in certain situations.
  • Add Analyzer.close() to free internal ThreadLocal resources.
  • Fix IndexWriter.addIndexes(IndexReader[]) to properly rollback IndexWriter’s internal state on hitting an exception.

Tags: ,

Posted by Marc Sturlese

02 Mar 09 Index scalability using Pig

Here is a really interesting example of how to build an inverted index using Pig. As I have seen in Hadoop, to create a Lucene index you must start from a text file and use MapReduce jobs to build it. Pig however, allows you to retrieve data not just from a text file but from SQL databases, HBase or other data sources.

After checking the example with detail, what comes now to my mind is if it would be possible to create a Lucene index using Pig and MapReduce jobs retrieving data from a distributed HBase data store system… I am wandering if there would be Lucene analyzers problems (or any other), for example.

I have read that Pig is not specially fast accessing to data. However, in indexing cases, probably this would be more than compensated with the MapReduce jobs.

How fast would it be? I still have lots of research and tests to do…

Tags: , , ,