<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Marc Sturlese &#187; Solr</title>
	<atom:link href="http://www.marcsturlese.com/category/solr/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.marcsturlese.com</link>
	<description>Life, code and stuff</description>
	<lastBuildDate>Sun, 27 Jun 2010 14:45:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Apache Lucene EuroCon 2010</title>
		<link>http://www.marcsturlese.com/2010/05/24/apache-lucene-eurocon-2010/</link>
		<comments>http://www.marcsturlese.com/2010/05/24/apache-lucene-eurocon-2010/#comments</comments>
		<pubDate>Mon, 24 May 2010 12:07:43 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[EuroCon]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=159</guid>
		<description><![CDATA[Yesterday I came back from the Lucene EuroCon 2010, wich took place in Prague. There have been many interesting talks there these days. Some of the slides are already on Slide Share.  Can&#8217;t wait for the others to be uploaded. I gave a talk on Thursday about our usage of Solr at Trovit. Covered an [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I came back from the <strong>Lucene EuroCon</strong> 2010, wich took place in Prague.<br />
There have been many interesting talks there these days. Some of the slides are already on Slide Share.  Can&#8217;t wait for the others to be uploaded.</p>
<p>I gave a talk on Thursday about our usage of <strong>Solr</strong> at Trovit. Covered an overview of our architecture, different of our 0ut 0f the box and custom features and some of the future lines we have in mind.</p>
<p>&#8220;Munching and Crunching: <strong>Lucene</strong> Index post-processing&#8221; was definitelly my favourite talk. Andrzej Bialecki covered topics I have never even thought about. Among other things there was a pretty complete explanation about index splitting, pruning and multi-tiered search.<br />
People tends to think all data processing must be done during indexing time. Andrzej showed us that many good stuff can be done once the index is already built.</p>
<p>Yonik explained in an hour the main features that are coming with new Solr releases, &#8220;<strong>Solr</strong> 1.5 and  Beyond&#8221;. Extended DisMax query parser, quick introduction to <strong>SolrCloud</strong>, Spatial Search, Realtime Time and Field Collapsing where covered.</p>
<p>Grant Ingersoll spoke about <strong>Lucene</strong> / <strong>Solr</strong> relevance: &#8220;Practical Relevance: Tips and Tricks for Understanding and Improving Search Quality&#8221;.<br />
It was very interesting to hear about the most commonly used techincques to do relevance testing:<br />
A/B test, log analysis, empirical tests, asking or using related projects as Open Relevance or TREC.</p>
<p>Mark Miller talked about <strong>SolrCloud</strong>. It promises to make life so much easier to <strong>Solr</strong> distributed installations admins.</p>
<p>There were really good topics in the MeetUp as well. &#8220;How We Scaled <strong>Solr</strong> to 3+ Billion Documents&#8221; by Jason Rutherglen was the one I was expecting the most. I always like to hear about big <strong>Solr</strong> deployments and <strong>Hadoop</strong> usage related to <strong>Lucene</strong> and <strong>Solr </strong>indexing. This one I think is the biggest I know.</p>
<p>So, these days have been really useful. Many new ideas, many stuff to test.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2010/05/24/apache-lucene-eurocon-2010/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Solr and Hadoop integration against scalability problems</title>
		<link>http://www.marcsturlese.com/2009/02/06/solr-and-hadoop-integration-against-scalability-problems/</link>
		<comments>http://www.marcsturlese.com/2009/02/06/solr-and-hadoop-integration-against-scalability-problems/#comments</comments>
		<pubDate>Thu, 05 Feb 2009 23:01:27 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=62</guid>
		<description><![CDATA[Recently I read an article explaining how Rackspace solved their huge log data deal with problem. They have implemented the best Hadoop and Solr integration I have seen until now, it really looks amazing. I don&#8217;t know hadoop with detail but to run Solr instances from a Tomcat server stored in HDFS (Hadoop&#8217;s distributed file [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I read an <a href="http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data">article</a><a title="Solr and hadoop integration" href="http://http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data" target="_blank"> </a>explaining how Rackspace solved their huge log data deal with problem. They have implemented the best <a title="Hadoop" href="http://hadoop.apache.org/core/" target="_blank"><strong>Hadoop</strong></a> and <a title="Solr" href="http://lucene.apache.org/solr/" target="_blank"><strong>Solr</strong></a> integration I have seen until now, it really looks amazing.<br />
I don&#8217;t know <strong>hadoop</strong> with detail but to run <strong>Solr</strong> instances from a Tomcat server stored in <strong>HDFS</strong> (Hadoop&#8217;s distributed file system) sounds like pretty good job!<br />
All the process is well described in the article, I just want to mention the basic steps they followed:</p>
<ul>
<li>Store huge amounts of log data in the <strong>HDFS</strong>.</li>
<li><strong>MapReduce</strong> is used to create <strong>Lucene</strong> indexs from the stored data using <strong>Solr</strong>.</li>
<li>Once built, indexes are compressed in <strong>Hadoop nodes</strong>.</li>
<li>These index are merged using <strong>Solr</strong> webapps, deployed in Tomcat servers witch are stored in <strong>Hadoop nodes</strong> too (that is for me the most impressive part). These <strong>Solr</strong> instances allow fast search request aswell.</li>
</ul>
<p>Probably this kind of arquitecture could be used to sort scalability problems in other fields not just log deal with. Search engines, for example. Maybe there the amount of data to deal with would be less but probably much more features would be needed.</p>
<p style="text-align: center;"><img class="aligncenter" title="Hadoop open source" src="http://www.marcsturlese.com/wp-content/images/hadoop-logo.jpg" alt="Hadoop open source" width="250" height="59" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2009/02/06/solr-and-hadoop-integration-against-scalability-problems/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
