<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Marc Sturlese &#187; Hadoop</title>
	<atom:link href="http://www.marcsturlese.com/category/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.marcsturlese.com</link>
	<description>Life, code and stuff</description>
	<lastBuildDate>Sun, 27 Jun 2010 14:45:42 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Index scalability using Pig</title>
		<link>http://www.marcsturlese.com/2009/03/02/index-scalability-using-pig/</link>
		<comments>http://www.marcsturlese.com/2009/03/02/index-scalability-using-pig/#comments</comments>
		<pubDate>Mon, 02 Mar 2009 22:37:41 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Pig]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=70</guid>
		<description><![CDATA[Here is a really interesting example of how to build an inverted index using Pig. As I have seen in Hadoop, to create a Lucene index you must start from a text file and use MapReduce jobs to build it. Pig however, allows you to retrieve data not just from a text file but from [...]]]></description>
			<content:encoded><![CDATA[<p>Here is a really <a title="Build index with Hadoop Pig" href="http://squarecog.wordpress.com/2009/01/17/building-an-inverted-index-with-hadoop-and-pig/" target="_blank">interesting example</a> of how to build an inverted index using <strong>Pig</strong>. As I have seen in <strong>Hadoop</strong>, to create a <strong>Lucene index</strong> you must start from a text file and use <strong>MapReduce</strong> jobs to build it. <strong>Pig</strong> however, allows you to retrieve data not just from a text file but from <strong>SQL databases, HBase</strong> or other data sources.</p>
<p>After checking the example with detail, what comes now to my mind is if it would be possible to create a <strong>Lucene </strong>index using <strong>Pig</strong> and <strong>MapReduce</strong> jobs retrieving data from a distributed <strong>HBase</strong> data store system&#8230; I am wandering if there would be <strong>Lucene</strong> analyzers problems (or any other), for example.</p>
<p>I have read that <strong>Pig</strong> is not specially fast accessing to data. However, in indexing cases, probably this would be more than compensated with the <strong>MapReduce</strong> jobs.</p>
<p>How fast would it be? I still have lots of research and tests to do&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2009/03/02/index-scalability-using-pig/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Solr and Hadoop integration against scalability problems</title>
		<link>http://www.marcsturlese.com/2009/02/06/solr-and-hadoop-integration-against-scalability-problems/</link>
		<comments>http://www.marcsturlese.com/2009/02/06/solr-and-hadoop-integration-against-scalability-problems/#comments</comments>
		<pubDate>Thu, 05 Feb 2009 23:01:27 +0000</pubDate>
		<dc:creator>Marc Sturlese</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://www.marcsturlese.com/?p=62</guid>
		<description><![CDATA[Recently I read an article explaining how Rackspace solved their huge log data deal with problem. They have implemented the best Hadoop and Solr integration I have seen until now, it really looks amazing. I don&#8217;t know hadoop with detail but to run Solr instances from a Tomcat server stored in HDFS (Hadoop&#8217;s distributed file [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I read an <a href="http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data">article</a><a title="Solr and hadoop integration" href="http://http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data" target="_blank"> </a>explaining how Rackspace solved their huge log data deal with problem. They have implemented the best <a title="Hadoop" href="http://hadoop.apache.org/core/" target="_blank"><strong>Hadoop</strong></a> and <a title="Solr" href="http://lucene.apache.org/solr/" target="_blank"><strong>Solr</strong></a> integration I have seen until now, it really looks amazing.<br />
I don&#8217;t know <strong>hadoop</strong> with detail but to run <strong>Solr</strong> instances from a Tomcat server stored in <strong>HDFS</strong> (Hadoop&#8217;s distributed file system) sounds like pretty good job!<br />
All the process is well described in the article, I just want to mention the basic steps they followed:</p>
<ul>
<li>Store huge amounts of log data in the <strong>HDFS</strong>.</li>
<li><strong>MapReduce</strong> is used to create <strong>Lucene</strong> indexs from the stored data using <strong>Solr</strong>.</li>
<li>Once built, indexes are compressed in <strong>Hadoop nodes</strong>.</li>
<li>These index are merged using <strong>Solr</strong> webapps, deployed in Tomcat servers witch are stored in <strong>Hadoop nodes</strong> too (that is for me the most impressive part). These <strong>Solr</strong> instances allow fast search request aswell.</li>
</ul>
<p>Probably this kind of arquitecture could be used to sort scalability problems in other fields not just log deal with. Search engines, for example. Maybe there the amount of data to deal with would be less but probably much more features would be needed.</p>
<p style="text-align: center;"><img class="aligncenter" title="Hadoop open source" src="http://www.marcsturlese.com/wp-content/images/hadoop-logo.jpg" alt="Hadoop open source" width="250" height="59" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.marcsturlese.com/2009/02/06/solr-and-hadoop-integration-against-scalability-problems/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
