<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Marc Sturlese &#187; Lucene</title>
	<atom:link href="http://www.marcsturlese.com/tag/lucene/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.marcsturlese.com</link>
	<description>Life, code and stuff</description>
	<lastBuildDate>Sun, 27 Jun 2010 14:45:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Lucene FieldCache.StringIndex and multiValued fields</title>
		<link>http://www.marcsturlese.com/2010/06/27/lucene-fieldcache-stringindex-and-multivalued-fields/</link>
		<comments>http://www.marcsturlese.com/2010/06/27/lucene-fieldcache-stringindex-and-multivalued-fields/#comments</comments>
		<pubDate>Sun, 27 Jun 2010 14:42:11 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[FieldCache]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=169</guid>
		<description><![CDATA[Lately I&#8217;ve been doing some tests with Lucene MultiValued fields and FieldCache. I&#8217;ve load FieldCache.StringIndex of a multiValued  field and I&#8217;ve seen some weird stuff happening which I think it&#8217;s worth to mention. FieldCache.StringIndex loads an int[] (order) and a String[] (lookup). The String[] contains all the terms on a field. The int[] array contains [...]]]></description>
			<content:encoded><![CDATA[<p>Lately I&#8217;ve been doing some tests with <strong>Lucene</strong> MultiValued fields and <strong>FieldCache</strong>.<br />
I&#8217;ve load <strong>FieldCache.StringIndex</strong> of a multiValued  field and I&#8217;ve seen some weird stuff happening which I think it&#8217;s worth to mention.</p>
<p><strong>FieldCache.StringIndex</strong> loads an int[] (order) and a String[] (lookup). The String[] contains all the terms on a field. The int[] array contains for each document an index to the lookup array.<br />
It was curious to see that loading this structure for some multiValued fields on the index was working all rite. However, for some others was giving me back a RuntimeException I haven&#8217;t seen before, saying there were more terms than documents in the field &#8216;x&#8217;.</p>
<p><strong>FieldCache</strong> is a structure meant to be used on single token (per document) fields. All trouble starts because in my tests I am not respecting that.<br />
<strong>FieldCache</strong> can not hanlde more than one value per field. When loading <strong>FieldCache.StringIndex</strong> it does a test to ensure there&#8217;s no more than a term per field (it checks if the number of unique terms is greater than the number of docs). In my tests I am creating false negatives of these checks and seeing unexpected behavior.</p>
<p>So, let&#8217;s say I have an index with 100 docs and a multiValued field. The multiValued field has 2 values per document. If none of the field values is the same in the whole index I will get the exception. That&#8217;s due to the check done by the StringIndex.If I just have two different values and all the documents have these two values, no exception is thrown (false negative of the check). We can see that when the number of unique terms exceeds the number of docs the exception is thrown. That explains why when loading a <strong>FieldCache.StringIndex</strong> on a field with more than just one term can end up with a nasty exception or act as nothing is wrong.</p>
<p>There have been some fixes in the latter <strong>Lucene</strong> versions  (trunk, 3x, 3.0, 2.9 branches). The behavior now it that once the number of terms  &gt; total documents, the array will not grow anymore so at least no RunTimeExceptions is going to happen.</p>
<p>More info can be found in the jira for the issue<a href="http://"></a> <a title="Lucene FieldCache.StringIndex" href="https://issues.apache.org/jira/browse/LUCENE-2142" target="_blank"><strong>LUCENE-2142</strong></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2010/06/27/lucene-fieldcache-stringindex-and-multivalued-fields/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Apache Lucene EuroCon 2010</title>
		<link>http://www.marcsturlese.com/2010/05/24/apache-lucene-eurocon-2010/</link>
		<comments>http://www.marcsturlese.com/2010/05/24/apache-lucene-eurocon-2010/#comments</comments>
		<pubDate>Mon, 24 May 2010 12:07:43 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[EuroCon]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=159</guid>
		<description><![CDATA[Yesterday I came back from the Lucene EuroCon 2010, wich took place in Prague. There have been many interesting talks there these days. Some of the slides are already on Slide Share.  Can&#8217;t wait for the others to be uploaded. I gave a talk on Thursday about our usage of Solr at Trovit. Covered an [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I came back from the <strong>Lucene EuroCon</strong> 2010, wich took place in Prague.<br />
There have been many interesting talks there these days. Some of the slides are already on Slide Share.  Can&#8217;t wait for the others to be uploaded.</p>
<p>I gave a talk on Thursday about our usage of <strong>Solr</strong> at Trovit. Covered an overview of our architecture, different of our 0ut 0f the box and custom features and some of the future lines we have in mind.</p>
<p>&#8220;Munching and Crunching: <strong>Lucene</strong> Index post-processing&#8221; was definitelly my favourite talk. Andrzej Bialecki covered topics I have never even thought about. Among other things there was a pretty complete explanation about index splitting, pruning and multi-tiered search.<br />
People tends to think all data processing must be done during indexing time. Andrzej showed us that many good stuff can be done once the index is already built.</p>
<p>Yonik explained in an hour the main features that are coming with new Solr releases, &#8220;<strong>Solr</strong> 1.5 and  Beyond&#8221;. Extended DisMax query parser, quick introduction to <strong>SolrCloud</strong>, Spatial Search, Realtime Time and Field Collapsing where covered.</p>
<p>Grant Ingersoll spoke about <strong>Lucene</strong> / <strong>Solr</strong> relevance: &#8220;Practical Relevance: Tips and Tricks for Understanding and Improving Search Quality&#8221;.<br />
It was very interesting to hear about the most commonly used techincques to do relevance testing:<br />
A/B test, log analysis, empirical tests, asking or using related projects as Open Relevance or TREC.</p>
<p>Mark Miller talked about <strong>SolrCloud</strong>. It promises to make life so much easier to <strong>Solr</strong> distributed installations admins.</p>
<p>There were really good topics in the MeetUp as well. &#8220;How We Scaled <strong>Solr</strong> to 3+ Billion Documents&#8221; by Jason Rutherglen was the one I was expecting the most. I always like to hear about big <strong>Solr</strong> deployments and <strong>Hadoop</strong> usage related to <strong>Lucene</strong> and <strong>Solr </strong>indexing. This one I think is the biggest I know.</p>
<p>So, these days have been really useful. Many new ideas, many stuff to test.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2010/05/24/apache-lucene-eurocon-2010/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lucene 2.9.2 and 3.0.1 released</title>
		<link>http://www.marcsturlese.com/2010/02/27/lucene-2-9-2-and-3-0-1-released/</link>
		<comments>http://www.marcsturlese.com/2010/02/27/lucene-2-9-2-and-3-0-1-released/#comments</comments>
		<pubDate>Sat, 27 Feb 2010 14:52:21 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open source]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=154</guid>
		<description><![CDATA[Lucene 2.9.2 and 3.0.1 versions have been released. Both are mainly bug fix versions from the previous ones. The main difference between 2 and 3 versions is that version 3 has no support for java 1.4 and has a more clean API as deprecated stuff has been removed. This means if you want to upgrade [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Lucene</strong> 2.9.2 and 3.0.1 versions have been released. Both are mainly bug fix versions from the previous ones.<br />
The main difference between 2 and 3 versions is that version 3 has no support for java 1.4 and has a more clean API as deprecated stuff has been removed. This means if you want to upgrade your <strong>Lucene</strong> JARs to v.3 you must use at least Java 1.5 and have no deprecation warnings in you code.<br />
More details of both releases can be found in the <a title="Lucene official announcement" href="http://www.search-lucene.com/m?id=000501cab6bc$7d2cdf00$77869d00$@de||[ANNOUNCE]%20Release%20of%20Lucene%20Java%203.0.1%20and%202.9.2" target="_blank">official announcement</a>:</p>
<blockquote><p><em>Hello <strong>Lucene</strong> users,</em></p>
<p><em>On behalf of the <strong>Lucene</strong> development community I would like to announce the release of <strong>Lucene</strong> Java versions 3.0.1 and 2.9.2:</em></p>
<p><em>Both releases fix bugs in the previous versions:</em></p>
<p><em>- 2.9.2 is a bugfix release for the <strong>Lucene</strong> Java 2.x series, based on Java 1.4<br />
- 3.0.1 has the same bug fix level but is for the <strong>Lucene</strong> Java 3.x series, based on Java 5.</em></p>
<p><em>New users of <strong>Lucene</strong> are advised to use version 3.0.1 for new developments, because it has a clean, type-safe API.</em></p>
<p><em>Important improvements in these releases include:</em></p>
<p><em>- An increased maximum number of unique terms in each index segment.<br />
- Fixed experimental CustomScoreQuery to respect per-segment search. This introduced an API change!<br />
- Important fixes to IndexWriter: a commit() thread-safety issue, lost document deletes in near real-time indexing.<br />
- Bugfixes for Contrib&#8217;s Analyzers package.<br />
- Restoration of some public methods that were lost during deprecation removal.<br />
- The new Attribute-based TokenStream API now works correctly with different class loaders.</em></p>
<p><em>Both releases are fully compatible with the corresponding previous versions. We strongly recommend upgrading to 2.9.2 if you are using 2.9.1 or 2.9.0; and to 3.0.1 if you are using 3.0.0.</em></p>
<p><em>See core changes at<br />
<a title="apache lucene" href="http://lucene.apache.org/java/3_0_1/changes/Changes.html" target="_blank">http://lucene.apache.org/java/3_0_1/changes/Changes.html</a><br />
<a title="apache lucene" href="http://lucene.apache.org/java/2_9_2/changes/Changes.html" target="_blank">http://lucene.apache.org/java/2_9_2/changes/Changes.html</a></em></p>
<p><em>and contrib changes at<br />
<a title="apache lucene" href="http://lucene.apache.org/java/3_0_1/changes/Contrib-Changes.html" target="_blank">http://lucene.apache.org/java/3_0_1/changes/Contrib-Changes.html</a><br />
<a title="apache lucene" href="http://lucene.apache.org/java/2_9_2/changes/Contrib-Changes.html" target="_blank">http://lucene.apache.org/java/2_9_2/changes/Contrib-Changes.html</a></em></p>
<p><em>Binary and source distributions are available at<br />
<a title="apache lucene" href="http://www.apache.org/dyn/closer.cgi/lucene/java/" target="_blank">http://www.apache.org/dyn/closer.cgi/lucene/java/</a></em></p>
<p><em><strong>Lucene</strong> artifacts are also available in the Maven2 repository at<br />
<a title="apache lucene" href="http://repo1.maven.org/maven2/org/apache/lucene/" target="_blank">http://repo1.maven.org/maven2/org/apache/lucene/</a></em></p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2010/02/27/lucene-2-9-2-and-3-0-1-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ElasticSearch</title>
		<link>http://www.marcsturlese.com/2010/02/12/elasticsearch/</link>
		<comments>http://www.marcsturlese.com/2010/02/12/elasticsearch/#comments</comments>
		<pubDate>Fri, 12 Feb 2010 00:33:20 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[ElasticSearch]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=145</guid>
		<description><![CDATA[It has been a long time since my last post. I have been very busy so unfortunatelly, I have not had the time to write about all I wish. This week I have discovered via twitter a really interesting open source search project, ElasticSearch for the cloud. ElasticSearch has been createded by Shay Banon. It&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>It has been a long time since my last post. I have been very busy so unfortunatelly, I have not had the time to write about all I wish.</p>
<p>This week I have discovered via twitter a really interesting open source search project, <strong><a title="ElasticSearch" href="http://www.elasticsearch.com/" target="_blank">ElasticSearch for the cloud</a></strong>. <strong>ElasticSearch</strong> has been createded by Shay Banon. It&#8217;s a RESTful search engine built on top of <strong><a title="Lucene" href="http://lucene.apache.org/java/docs/" target="_blank">Lucene</a></strong> and very well prepared for high scalability. It includes shard merging, replication and much more features.</p>
<p>Lately I have been working a lot with search scalability and what I liked the most for the moment of <strong>ElasticSearch</strong> is that it allows 4 different types of distributed requests.</p>
<p>The most simple (Query and fetch) is just one request per relevant shard. Once all the requests are done, results are merged and&#8230; that&#8217;s it!<br />
In this type of search, all fields of a document are returned to the merger for all the returned documents.</p>
<p>In another search type (Query then fetch, this one is not that simple), a first request is done across all shards. Here you don&#8217;t ask for the document content at the moment. Once the results are merged, you only need to ask for the whole document data of the most relevant documents, the ones you want to show.<br />
If you have to search across lots of shards that&#8217;s definitely the way to go (the merger will just receive the fields of the important documents, wich means less data is sent across the network).</p>
<p>Both options present a typical problem in distributed search. The relevance is calculated relative to the shard, it&#8217;s not absolute across all of them.<br />
To solve this, in <strong>ElasticSearch</strong>, both search options can be supplemented with an initial request. This one queries for the necessary term frequencies information to allow an &#8220;absolute relevance&#8221;.<br />
This is not for free, you are paying with an extra trip (even it can be cached). It&#8217;s good if you can avoid that. A good way to do that is at indexing time, when you decide in wich shard a document must be added. Choosing it randomly will more or less ensure you that term frequencies won&#8217;t differ so much among shards.</p>
<p>Still have not had the chance to dig into the source but already have downloaded it from the <a title="ElasticSearch" href="http://github.com/elasticsearch/elasticsearch" target="_blank">git repository</a>.<br />
Anyone that want to share experiences with <strong>ElasticSearch</strong> is more than welcome.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2010/02/12/elasticsearch/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lucene TrieRangeQuery</title>
		<link>http://www.marcsturlese.com/2009/04/08/lucene-trierangequery/</link>
		<comments>http://www.marcsturlese.com/2009/04/08/lucene-trierangequery/#comments</comments>
		<pubDate>Wed, 08 Apr 2009 21:50:43 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[TrieRangeQuery]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=97</guid>
		<description><![CDATA[Lucene TrieRangeQuery is a cool contrib in Lucene (think not yet in the official release) created by Uwe  Schindler. I had heard about it before but learned about it in the LuceneMeetUp in ApacheCon EU. Uwe gave a great speach about it. As I found it a really useful feature will try to explain the [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Lucene TrieRangeQuery</strong> is a cool contrib in <strong>Lucene</strong> (think not yet in the official release) created by Uwe  Schindler. I had heard about it before but learned about it in the LuceneMeetUp in ApacheCon EU. Uwe gave a great speach about it. As I found it a really useful feature will try to explain the basics.</p>
<p><strong>TrieRangeQuery</strong> mainly sort out some RangeQuery problems:</p>
<ul>
<li>Tipical RangeQuery can end in <strong>TooManyClausesException</strong> if our ranges are so large.</li>
</ul>
<ul>
<li>Tipical RangeQuery or even ConstantScoreRangeQuery are slow if have to classify using large ranges or the index is huge.</li>
</ul>
<p>To explain it in an easy way, what <strong>TrieRangeQuery </strong>do is to search the data values skipping the less relevant &#8220;digits&#8221; in function of a precision parameter.</p>
<p>Let&#8217;s say for example we need to classify thousands of numbers of 6 figures. This could be a slow process using ConstantScoreRangeQuery in a huge index, not with<strong> TrieRangeQuery</strong>. Ranges will be divided recurively in function of  a precision parámeter (set at index time). Numbers from the middle of the range will be classified using the minimum precision value while numbers from extrems will use a higher precision. This will make the query run extremely much faster.</p>
<p>Depending on the level of presicionStep parameter given at index time we will be able to search with more or less precision.  The more precision marging we choose the more the lucene document will occuppy. It is due to we will have to index the field more times with the different precisions.</p>
<p>We need to index data in a special way to be able to search it using <strong>Lucene TrieRangeQuery</strong>. We must index our fields using <strong>TrieUtils</strong>. We can index numbers directly. It supports java signed int, long, float, double. There&#8217;s no loss of precision for doubles or floats. There&#8217;s no round for their creation, instead a long/int representation is used for cents.<br />
Indexing numbers with <strong>TrieUtils</strong> will make us forget about maual padding.<br />
We can index Dates aswell (from java timestamps data type).</p>
<p>As seen, <strong>Lucene TrieRangeQuery</strong> is totally a step forward for <strong>Lucene</strong> queries <strong>scalability</strong>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2009/04/08/lucene-trierangequery/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ApacheCon Europe 2009</title>
		<link>http://www.marcsturlese.com/2009/04/01/apachecon-europe-2009/</link>
		<comments>http://www.marcsturlese.com/2009/04/01/apachecon-europe-2009/#comments</comments>
		<pubDate>Wed, 01 Apr 2009 21:55:54 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Random]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Pig]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=80</guid>
		<description><![CDATA[Last week I had the chance to go to the ApacheCon Europe 2009. The event took place in Mövenpick Hotel, Amsterdam. I had a really good time in there. Was good to share use cases and experiences in person with people who I had just spoken with in forums. I spend the first two days [...]]]></description>
			<content:encoded><![CDATA[<p>Last week I had the chance to go to the <a title="ApacheCon Europe 2009" href="http://www.eu.apachecon.com/c/aceu2009/"><strong>ApacheCon Europe 2009</strong></a>. The event took place in Mövenpick Hotel, Amsterdam. I had a really good time in there.</p>
<p>Was good to share use cases and experiences in person with people who I had just spoken with in forums.<br />
I spend the first two days in the <strong>hackathon</strong> doing some research and test of different ASF projects. Put special interest in <a title="Pig" href="http://hadoop.apache.org/pig/" target="_blank"><strong>Pig</strong></a>.</p>
<p>There were really interesting chats. I found specially great <a title="Lucene mahout" href="http://lucene.apache.org/mahout/" target="_blank"><strong>Mahout</strong></a> project. I had discovered it in <strong>ApacheCon</strong> 2008 in New Orleans, I almost just heard about it in there but paid more atention this time and looks full of possibilities. It is used for machine learning and runs under <a title="Lucene" href="http://hadoop.apache.org/" target="_blank"><strong>Hadoop</strong></a>.<br />
Was also good to get some info about Servlet 3.0 and learn about servlets doFilter function and some other stuff.<br />
<a title="HBase" href="http://hadoop.apache.org/hbase/" target="_blank"><strong>HBase</strong> </a>is another project I was interested in. Looks good to be used as a &#8220;data warehouse&#8221; but seems really difficult (at least at first impression) to deal with the stored data.</p>
<p>Meetups were so good too. There was a presentation about the new <a title="Lucene" href="http://lucene.apache.org/java/docs/" target="_blank"><strong>Lucene</strong></a> contrib <strong>TrieRangeQuery</strong>. It is still not available in the official release but you can use it graving a nightly build. In the next few days I will try to write with more detail about this and other presented projects.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2009/04/01/apachecon-europe-2009/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lucene 2.4.1 available from today</title>
		<link>http://www.marcsturlese.com/2009/03/09/lucene-241-available-from-today/</link>
		<comments>http://www.marcsturlese.com/2009/03/09/lucene-241-available-from-today/#comments</comments>
		<pubDate>Mon, 09 Mar 2009 22:19:34 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=76</guid>
		<description><![CDATA[A new official release of Lucene in now available! Lucene 2.4.1 is a bug fix version. We will be able to see more new features in the Lucene 2.9 release (available in developers version). Here I mention all the improvements of Lucene 2.4.1, wich I read from the official lucene&#8217;s site: Fixed silent data-loss case [...]]]></description>
			<content:encoded><![CDATA[<p>A new official release of <strong>Lucene</strong> in now <a title="Lucene 2.4.1" href="http://www.apache.org/dyn/closer.cgi/lucene/java/" target="_blank">available</a>! <strong>Lucene 2.4.1</strong> is a bug fix version.</p>
<p>We will be able to see more<a title="Lucene 2.9 issues" href="https://issues.apache.org/jira/browse/LUCENE/fixforversion/12312682;jsessionid=2E3B285A668328BAB28882BA53ECBB82" target="_blank"> new features</a> in the <strong>Lucene 2.9</strong> release (available in <a title="Lucene 2.9-dev" href="http://svn.apache.org/repos/asf/lucene/solr/trunk/lib/">developers version</a>).</p>
<p>Here I mention all the improvements of <strong>Lucene 2.4.1</strong>, wich I read from the <strong><a title="Lucene 2.4.1 changes" href="http://lucene.apache.org/java/2_4_1/changes/Changes.html" target="_blank">official lucene&#8217;s site</a></strong>:</p>
<ul>
<li>Fixed silent data-loss case whereby binary fields are truncated to 0 bytes during merging if the segments being merged are non-congruent (same field name maps to different field numbers).</li>
<li>Don&#8217;t throw incorrect <strong>IllegalStateException</strong> from <strong>IndexWriter.close()</strong> if you&#8217;ve hit an OOM when <strong>autoCommit</strong> is true.</li>
<li> If<strong> IndexReader.flush() </strong>is called twice when there were pending deletions, it could lead to later false AssertionError during <strong>IndexReader.open</strong>.</li>
<li>Fix false <strong>AlreadyClosedException</strong> from <strong>IndexReader.open</strong> (masking an actual IOException) that takes String or File path.</li>
<li>Multiple-valued <strong>NOT_ANALYZED</strong> fields can double-count token offsets.</li>
<li>Ensure <strong>IndexReader.reopen() </strong>does not result in incorrectly closing the shared FSDirectory.  This bug would only happen if you use <strong>IndexReader.open</strong> with a File or String argument.</li>
<li>Fix possible overflow bugs during binary searches.</li>
<li>Fix <strong>CachingWrapperFilter </strong>to not throw exception if both <strong>bits()</strong> and <strong>getDocIdSet() </strong>methods are called.</li>
<li>Fix int overflow bug during segment merging.<span class="attrib"><br />
</span></li>
<li>Fix int overflow bug when flushing segment.</li>
<li>Fix deadlock in <strong>IndexWriter.addIndexes(IndexReader[])</strong>.</li>
<li><strong>NearSpansOrdered</strong> returns <strong>payloads</strong> from first possible match rather than the correct, shortest match; Payloads could be returned even if the max slop was exceeded; The wrong payload could be returned in certain situations.</li>
<li>Add <strong>Analyzer.close()</strong> to free internal <strong>ThreadLocal</strong> resources.</li>
<li>Fix <strong>IndexWriter.addIndexes(IndexReader[]) </strong>to properly rollback IndexWriter&#8217;s internal state on hitting an exception.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2009/03/09/lucene-241-available-from-today/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Index scalability using Pig</title>
		<link>http://www.marcsturlese.com/2009/03/02/index-scalability-using-pig/</link>
		<comments>http://www.marcsturlese.com/2009/03/02/index-scalability-using-pig/#comments</comments>
		<pubDate>Mon, 02 Mar 2009 22:37:41 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Pig]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=70</guid>
		<description><![CDATA[Here is a really interesting example of how to build an inverted index using Pig. As I have seen in Hadoop, to create a Lucene index you must start from a text file and use MapReduce jobs to build it. Pig however, allows you to retrieve data not just from a text file but from [...]]]></description>
			<content:encoded><![CDATA[<p>Here is a really <a title="Build index with Hadoop Pig" href="http://squarecog.wordpress.com/2009/01/17/building-an-inverted-index-with-hadoop-and-pig/" target="_blank">interesting example</a> of how to build an inverted index using <strong>Pig</strong>. As I have seen in <strong>Hadoop</strong>, to create a <strong>Lucene index</strong> you must start from a text file and use <strong>MapReduce</strong> jobs to build it. <strong>Pig</strong> however, allows you to retrieve data not just from a text file but from <strong>SQL databases, HBase</strong> or other data sources.</p>
<p>After checking the example with detail, what comes now to my mind is if it would be possible to create a <strong>Lucene </strong>index using <strong>Pig</strong> and <strong>MapReduce</strong> jobs retrieving data from a distributed <strong>HBase</strong> data store system&#8230; I am wandering if there would be <strong>Lucene</strong> analyzers problems (or any other), for example.</p>
<p>I have read that <strong>Pig</strong> is not specially fast accessing to data. However, in indexing cases, probably this would be more than compensated with the <strong>MapReduce</strong> jobs.</p>
<p>How fast would it be? I still have lots of research and tests to do&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2009/03/02/index-scalability-using-pig/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Solr and Hadoop integration against scalability problems</title>
		<link>http://www.marcsturlese.com/2009/02/06/solr-and-hadoop-integration-against-scalability-problems/</link>
		<comments>http://www.marcsturlese.com/2009/02/06/solr-and-hadoop-integration-against-scalability-problems/#comments</comments>
		<pubDate>Thu, 05 Feb 2009 23:01:27 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=62</guid>
		<description><![CDATA[Recently I read an article explaining how Rackspace solved their huge log data deal with problem. They have implemented the best Hadoop and Solr integration I have seen until now, it really looks amazing. I don&#8217;t know hadoop with detail but to run Solr instances from a Tomcat server stored in HDFS (Hadoop&#8217;s distributed file [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I read an <a href="http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data">article</a><a title="Solr and hadoop integration" href="http://http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data" target="_blank"> </a>explaining how Rackspace solved their huge log data deal with problem. They have implemented the best <a title="Hadoop" href="http://hadoop.apache.org/core/" target="_blank"><strong>Hadoop</strong></a> and <a title="Solr" href="http://lucene.apache.org/solr/" target="_blank"><strong>Solr</strong></a> integration I have seen until now, it really looks amazing.<br />
I don&#8217;t know <strong>hadoop</strong> with detail but to run <strong>Solr</strong> instances from a Tomcat server stored in <strong>HDFS</strong> (Hadoop&#8217;s distributed file system) sounds like pretty good job!<br />
All the process is well described in the article, I just want to mention the basic steps they followed:</p>
<ul>
<li>Store huge amounts of log data in the <strong>HDFS</strong>.</li>
<li><strong>MapReduce</strong> is used to create <strong>Lucene</strong> indexs from the stored data using <strong>Solr</strong>.</li>
<li>Once built, indexes are compressed in <strong>Hadoop nodes</strong>.</li>
<li>These index are merged using <strong>Solr</strong> webapps, deployed in Tomcat servers witch are stored in <strong>Hadoop nodes</strong> too (that is for me the most impressive part). These <strong>Solr</strong> instances allow fast search request aswell.</li>
</ul>
<p>Probably this kind of arquitecture could be used to sort scalability problems in other fields not just log deal with. Search engines, for example. Maybe there the amount of data to deal with would be less but probably much more features would be needed.</p>
<p style="text-align: center;"><img class="aligncenter" title="Hadoop open source" src="http://www.marcsturlese.com/wp-content/images/hadoop-logo.jpg" alt="Hadoop open source" width="250" height="59" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2009/02/06/solr-and-hadoop-integration-against-scalability-problems/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Opening lucene 2.9-dev indexes with Luke Lucene Index Toolbox</title>
		<link>http://www.marcsturlese.com/2009/01/13/lucene29_luke_index_toolbox/</link>
		<comments>http://www.marcsturlese.com/2009/01/13/lucene29_luke_index_toolbox/#comments</comments>
		<pubDate>Tue, 13 Jan 2009 22:47:20 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Luke]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=1</guid>
		<description><![CDATA[Lately I have started using the developers version of Lucene (2.9-dev). When I wanted to open an index using Luke to check some content it just did not work, I got a &#8220;lucene invalid index&#8221; error. After a while I realized it was totally normal. The cause of the error is that the latest Luke&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>Lately I have started using the developers version of <strong>Lucene (2.9-dev)</strong>. When I wanted to open an index using <strong>Luke</strong> to check some content it just did not work, I got a &#8220;lucene invalid index&#8221; error. After a while I realized it was totally normal. The cause of the error is that the latest Luke&#8217;s release uses lucene 2.4 libraries. If the index was created using the 2.9-dev libs, Luke will think that the index was malformed. We just need to update the libs to make it work.</p>
<p>First of all I downloaded the <a title="Luke Lucene Tool Box" href="http://www.getopt.org/luke/" target="_blank">Luke&#8217;s source code </a>. Opened the lib folder and replaced the 2.4 for the ones of the lucene developers release. Once done I compiled the surce with ant and&#8230; that&#8217;s it, I could check my indexed data using my compiled version <strong>Luke</strong>!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2009/01/13/lucene29_luke_index_toolbox/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
