<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Marc Sturlese &#187; Hadoop</title>
	<atom:link href="http://www.marcsturlese.com/tag/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.marcsturlese.com</link>
	<description>Life, code and stuff</description>
	<lastBuildDate>Sun, 27 Jun 2010 14:45:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Apache Lucene EuroCon 2010</title>
		<link>http://www.marcsturlese.com/2010/05/24/apache-lucene-eurocon-2010/</link>
		<comments>http://www.marcsturlese.com/2010/05/24/apache-lucene-eurocon-2010/#comments</comments>
		<pubDate>Mon, 24 May 2010 12:07:43 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[EuroCon]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[Hadoop]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=159</guid>
		<description><![CDATA[Yesterday I came back from the Lucene EuroCon 2010, wich took place in Prague. There have been many interesting talks there these days. Some of the slides are already on Slide Share.  Can&#8217;t wait for the others to be uploaded. I gave a talk on Thursday about our usage of Solr at Trovit. Covered an [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I came back from the <strong>Lucene EuroCon</strong> 2010, wich took place in Prague.<br />
There have been many interesting talks there these days. Some of the slides are already on Slide Share.  Can&#8217;t wait for the others to be uploaded.</p>
<p>I gave a talk on Thursday about our usage of <strong>Solr</strong> at Trovit. Covered an overview of our architecture, different of our 0ut 0f the box and custom features and some of the future lines we have in mind.</p>
<p>&#8220;Munching and Crunching: <strong>Lucene</strong> Index post-processing&#8221; was definitelly my favourite talk. Andrzej Bialecki covered topics I have never even thought about. Among other things there was a pretty complete explanation about index splitting, pruning and multi-tiered search.<br />
People tends to think all data processing must be done during indexing time. Andrzej showed us that many good stuff can be done once the index is already built.</p>
<p>Yonik explained in an hour the main features that are coming with new Solr releases, &#8220;<strong>Solr</strong> 1.5 and  Beyond&#8221;. Extended DisMax query parser, quick introduction to <strong>SolrCloud</strong>, Spatial Search, Realtime Time and Field Collapsing where covered.</p>
<p>Grant Ingersoll spoke about <strong>Lucene</strong> / <strong>Solr</strong> relevance: &#8220;Practical Relevance: Tips and Tricks for Understanding and Improving Search Quality&#8221;.<br />
It was very interesting to hear about the most commonly used techincques to do relevance testing:<br />
A/B test, log analysis, empirical tests, asking or using related projects as Open Relevance or TREC.</p>
<p>Mark Miller talked about <strong>SolrCloud</strong>. It promises to make life so much easier to <strong>Solr</strong> distributed installations admins.</p>
<p>There were really good topics in the MeetUp as well. &#8220;How We Scaled <strong>Solr</strong> to 3+ Billion Documents&#8221; by Jason Rutherglen was the one I was expecting the most. I always like to hear about big <strong>Solr</strong> deployments and <strong>Hadoop</strong> usage related to <strong>Lucene</strong> and <strong>Solr </strong>indexing. This one I think is the biggest I know.</p>
<p>So, these days have been really useful. Many new ideas, many stuff to test.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2010/05/24/apache-lucene-eurocon-2010/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>CloudCamp Barcelona 2009</title>
		<link>http://www.marcsturlese.com/2009/06/18/cloudcamp-barcelona-2009/</link>
		<comments>http://www.marcsturlese.com/2009/06/18/cloudcamp-barcelona-2009/#comments</comments>
		<pubDate>Wed, 17 Jun 2009 23:43:49 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Cloud computing]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Abicloud]]></category>
		<category><![CDATA[CloudCamp]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Sqoop]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=134</guid>
		<description><![CDATA[Last Monday took place in Barcelona the first CloudCamp ever done in the city. Altough I was expecting more technical stuff it was good to be there and listen to what people have to say. The first part of the event consisted of some quick explanations from different companies related with cloud computing. Basically, were [...]]]></description>
			<content:encoded><![CDATA[<p>Last Monday took place in Barcelona the first <a title="CloudCamp" href="http://www.cloudcamp.com/?page_id=902" target="_blank"><strong>CloudCamp</strong></a> ever done in the city. Altough I was expecting more technical stuff it was good to be there and listen to what people have to say.<br />
The first part of the event consisted of some quick explanations from different companies related with cloud computing. Basically, were explaining the cloud choises and advantages they were offering. The one I enjoyed the most was the Abiquo&#8217;s presentation of their new software, <a title="Abicloud" href="http://www.abiquo.com/en/products/abicloud" target="_blank">Abicloud</a>. Through a really nice GUI developed with Flex, Abicloud, among other stuff, allows you to set up virtual machines configuring automatically an apache server, mysql database&#8230; with just a few drag &amp; drop actions. You can use you own machines, servers from an ISP or even combine both. Elastically, you can increase or decrease the number of virtual machines. This can be very convenient for sites with hight traffic peaks or testing environements.<br />
I am not going to talk more about it as with a five minutes presentation just could get the main idea. Can&#8217;t wait to have some free time to start playing with it. Just will add that Abicloud is completely open source.</p>
<p>After the quick talks, the following topics were discussed:</p>
<ul>
<li> What guarantees do I have with <strong>Cloud Computing</strong>?</li>
<li> What legal issues are there with your data?</li>
<li> Are standards important? If so, wich ones?</li>
<li> What is the benefit for a company with only a few dozens of servers?</li>
<li> Best platfrom to starting a cloud hosting company?</li>
<li> Is cloud computing green? If so, what?</li>
</ul>
<p>In the end people were divided in groups depending on in wich topic wanted to go deeper. I attended to &#8220;How to develope applications that are going to run in the cloud&#8221;. There I could have an interesting quick chat about application scalability and how to dump mysql databases to <strong>HDFS</strong> using the Cloudera&#8217;s tool <strong><a title="Hadoop's Sqoop" href="http://www.cloudera.com/hadoop-sqoop#getting_sqoop" target="_blank">Sqoop</a></strong>.</p>
<p><img class="aligncenter size-medium wp-image-140" title="cloudcamp" src="http://www.marcsturlese.com/wp-content/images/cloudcamp-300x72.jpg" alt="cloudcamp" width="260" height="62" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2009/06/18/cloudcamp-barcelona-2009/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ApacheCon Europe 2009</title>
		<link>http://www.marcsturlese.com/2009/04/01/apachecon-europe-2009/</link>
		<comments>http://www.marcsturlese.com/2009/04/01/apachecon-europe-2009/#comments</comments>
		<pubDate>Wed, 01 Apr 2009 21:55:54 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Random]]></category>
		<category><![CDATA[ApacheCon]]></category>
		<category><![CDATA[Events]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Mahout]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Pig]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=80</guid>
		<description><![CDATA[Last week I had the chance to go to the ApacheCon Europe 2009. The event took place in Mövenpick Hotel, Amsterdam. I had a really good time in there. Was good to share use cases and experiences in person with people who I had just spoken with in forums. I spend the first two days [...]]]></description>
			<content:encoded><![CDATA[<p>Last week I had the chance to go to the <a title="ApacheCon Europe 2009" href="http://www.eu.apachecon.com/c/aceu2009/"><strong>ApacheCon Europe 2009</strong></a>. The event took place in Mövenpick Hotel, Amsterdam. I had a really good time in there.</p>
<p>Was good to share use cases and experiences in person with people who I had just spoken with in forums.<br />
I spend the first two days in the <strong>hackathon</strong> doing some research and test of different ASF projects. Put special interest in <a title="Pig" href="http://hadoop.apache.org/pig/" target="_blank"><strong>Pig</strong></a>.</p>
<p>There were really interesting chats. I found specially great <a title="Lucene mahout" href="http://lucene.apache.org/mahout/" target="_blank"><strong>Mahout</strong></a> project. I had discovered it in <strong>ApacheCon</strong> 2008 in New Orleans, I almost just heard about it in there but paid more atention this time and looks full of possibilities. It is used for machine learning and runs under <a title="Lucene" href="http://hadoop.apache.org/" target="_blank"><strong>Hadoop</strong></a>.<br />
Was also good to get some info about Servlet 3.0 and learn about servlets doFilter function and some other stuff.<br />
<a title="HBase" href="http://hadoop.apache.org/hbase/" target="_blank"><strong>HBase</strong> </a>is another project I was interested in. Looks good to be used as a &#8220;data warehouse&#8221; but seems really difficult (at least at first impression) to deal with the stored data.</p>
<p>Meetups were so good too. There was a presentation about the new <a title="Lucene" href="http://lucene.apache.org/java/docs/" target="_blank"><strong>Lucene</strong></a> contrib <strong>TrieRangeQuery</strong>. It is still not available in the official release but you can use it graving a nightly build. In the next few days I will try to write with more detail about this and other presented projects.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2009/04/01/apachecon-europe-2009/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Index scalability using Pig</title>
		<link>http://www.marcsturlese.com/2009/03/02/index-scalability-using-pig/</link>
		<comments>http://www.marcsturlese.com/2009/03/02/index-scalability-using-pig/#comments</comments>
		<pubDate>Mon, 02 Mar 2009 22:37:41 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Pig]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=70</guid>
		<description><![CDATA[Here is a really interesting example of how to build an inverted index using Pig. As I have seen in Hadoop, to create a Lucene index you must start from a text file and use MapReduce jobs to build it. Pig however, allows you to retrieve data not just from a text file but from [...]]]></description>
			<content:encoded><![CDATA[<p>Here is a really <a title="Build index with Hadoop Pig" href="http://squarecog.wordpress.com/2009/01/17/building-an-inverted-index-with-hadoop-and-pig/" target="_blank">interesting example</a> of how to build an inverted index using <strong>Pig</strong>. As I have seen in <strong>Hadoop</strong>, to create a <strong>Lucene index</strong> you must start from a text file and use <strong>MapReduce</strong> jobs to build it. <strong>Pig</strong> however, allows you to retrieve data not just from a text file but from <strong>SQL databases, HBase</strong> or other data sources.</p>
<p>After checking the example with detail, what comes now to my mind is if it would be possible to create a <strong>Lucene </strong>index using <strong>Pig</strong> and <strong>MapReduce</strong> jobs retrieving data from a distributed <strong>HBase</strong> data store system&#8230; I am wandering if there would be <strong>Lucene</strong> analyzers problems (or any other), for example.</p>
<p>I have read that <strong>Pig</strong> is not specially fast accessing to data. However, in indexing cases, probably this would be more than compensated with the <strong>MapReduce</strong> jobs.</p>
<p>How fast would it be? I still have lots of research and tests to do&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2009/03/02/index-scalability-using-pig/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Solr and Hadoop integration against scalability problems</title>
		<link>http://www.marcsturlese.com/2009/02/06/solr-and-hadoop-integration-against-scalability-problems/</link>
		<comments>http://www.marcsturlese.com/2009/02/06/solr-and-hadoop-integration-against-scalability-problems/#comments</comments>
		<pubDate>Thu, 05 Feb 2009 23:01:27 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=62</guid>
		<description><![CDATA[Recently I read an article explaining how Rackspace solved their huge log data deal with problem. They have implemented the best Hadoop and Solr integration I have seen until now, it really looks amazing. I don&#8217;t know hadoop with detail but to run Solr instances from a Tomcat server stored in HDFS (Hadoop&#8217;s distributed file [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I read an <a href="http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data">article</a><a title="Solr and hadoop integration" href="http://http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data" target="_blank"> </a>explaining how Rackspace solved their huge log data deal with problem. They have implemented the best <a title="Hadoop" href="http://hadoop.apache.org/core/" target="_blank"><strong>Hadoop</strong></a> and <a title="Solr" href="http://lucene.apache.org/solr/" target="_blank"><strong>Solr</strong></a> integration I have seen until now, it really looks amazing.<br />
I don&#8217;t know <strong>hadoop</strong> with detail but to run <strong>Solr</strong> instances from a Tomcat server stored in <strong>HDFS</strong> (Hadoop&#8217;s distributed file system) sounds like pretty good job!<br />
All the process is well described in the article, I just want to mention the basic steps they followed:</p>
<ul>
<li>Store huge amounts of log data in the <strong>HDFS</strong>.</li>
<li><strong>MapReduce</strong> is used to create <strong>Lucene</strong> indexs from the stored data using <strong>Solr</strong>.</li>
<li>Once built, indexes are compressed in <strong>Hadoop nodes</strong>.</li>
<li>These index are merged using <strong>Solr</strong> webapps, deployed in Tomcat servers witch are stored in <strong>Hadoop nodes</strong> too (that is for me the most impressive part). These <strong>Solr</strong> instances allow fast search request aswell.</li>
</ul>
<p>Probably this kind of arquitecture could be used to sort scalability problems in other fields not just log deal with. Search engines, for example. Maybe there the amount of data to deal with would be less but probably much more features would be needed.</p>
<p style="text-align: center;"><img class="aligncenter" title="Hadoop open source" src="http://www.marcsturlese.com/wp-content/images/hadoop-logo.jpg" alt="Hadoop open source" width="250" height="59" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2009/02/06/solr-and-hadoop-integration-against-scalability-problems/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
