Life, code and stuff
Posted by Marc Sturlese

06 Feb 09 Solr and Hadoop integration against scalability problems

Recently I read an article explaining how Rackspace solved their huge log data deal with problem. They have implemented the best Hadoop and Solr integration I have seen until now, it really looks amazing.
I don’t know hadoop with detail but to run Solr instances from a Tomcat server stored in HDFS (Hadoop’s distributed file system) sounds like pretty good job!
All the process is well described in the article, I just want to mention the basic steps they followed:

  • Store huge amounts of log data in the HDFS.
  • MapReduce is used to create Lucene indexs from the stored data using Solr.
  • Once built, indexes are compressed in Hadoop nodes.
  • These index are merged using Solr webapps, deployed in Tomcat servers witch are stored in Hadoop nodes too (that is for me the most impressive part). These Solr instances allow fast search request aswell.

Probably this kind of arquitecture could be used to sort scalability problems in other fields not just log deal with. Search engines, for example. Maybe there the amount of data to deal with would be less but probably much more features would be needed.

Hadoop open source

Related posts

Tags: , ,

Leave a Comment