Lucene FieldCache.StringIndex and multiValued fields

Lately I’ve been doing some tests with Lucene MultiValued fields and FieldCache.
I’ve load FieldCache.StringIndex of a multiValued  field and I’ve seen some weird stuff happening which I think it’s worth to mention.

FieldCache.StringIndex loads an int[] (order) and a String[] (lookup). The String[] contains all the terms on a field. The int[] array contains for each document an index to the lookup array.
It was curious to see that loading this structure for some multiValued fields on the index was working all rite. However, for some others was giving me back a RuntimeException I haven’t seen before, saying there were more terms than documents in the field ‘x’.

FieldCache is a structure meant to be used on single token (per document) fields. All trouble starts because in my tests I am not respecting that.
FieldCache can not hanlde more than one value per field. When loading FieldCache.StringIndex it does a test to ensure there’s no more than a term per field (it checks if the number of unique terms is greater than the number of docs). In my tests I am creating false negatives of these checks and seeing unexpected behavior.

So, let’s say I have an index with 100 docs and a multiValued field. The multiValued field has 2 values per document. If none of the field values is the same in the whole index I will get the exception. That’s due to the check done by the StringIndex.If I just have two different values and all the documents have these two values, no exception is thrown (false negative of the check). We can see that when the number of unique terms exceeds the number of docs the exception is thrown. That explains why when loading a FieldCache.StringIndex on a field with more than just one term can end up with a nasty exception or act as nothing is wrong.

There have been some fixes in the latter Lucene versions  (trunk, 3x, 3.0, 2.9 branches). The behavior now it that once the number of terms  > total documents, the array will not grow anymore so at least no RunTimeExceptions is going to happen.

More info can be found in the jira for the issue LUCENE-2142

Related posts

You can leave a response, or trackback from your own site.

2 Responses to “Lucene FieldCache.StringIndex and multiValued fields”

  1. Amit Nithian says:

    This is a good post. I have run into a similar issue before where getting all multiple values for a multi-valued field was difficult. I ended up extending the FieldCacheImpl of Solr 1.2 and had this working.. of course the API changed and so porting this to 1.5 is a bit harder but am working on it. I have some applications where I need access to all the values to make some decisions so this feature is necessary.

  2. Marc Sturlese says:

    Hey Amit,
    I’m not sure, but maybe since Solr 1.4 you can do something similar using the UnInvertedField class, used to facet on multiValued fields.
    http://bit.ly/ig3aFC

Leave a Reply

Subscribe to RSS Feed Follow me on Twitter!