Life, code and stuff
Posted by Marc Sturlese

06 Feb 09 Solr and Hadoop integration against scalability problems

Recently I read an article explaining how Rackspace solved their huge log data deal with problem. They have implemented the best Hadoop and Solr integration I have seen until now, it really looks amazing.
I don’t know hadoop with detail but to run Solr instances from a Tomcat server stored in HDFS (Hadoop’s distributed file system) sounds like pretty good job!
All the process is well described in the article, I just want to mention the basic steps they followed:

  • Store huge amounts of log data in the HDFS.
  • MapReduce is used to create Lucene indexs from the stored data using Solr.
  • Once built, indexes are compressed in Hadoop nodes.
  • These index are merged using Solr webapps, deployed in Tomcat servers witch are stored in Hadoop nodes too (that is for me the most impressive part). These Solr instances allow fast search request aswell.

Probably this kind of arquitecture could be used to sort scalability problems in other fields not just log deal with. Search engines, for example. Maybe there the amount of data to deal with would be less but probably much more features would be needed.

Hadoop open source

Related posts

Tags: , ,

Reader's Comments

  1. |

    I know this is an old post as of today, but I just wanted to add a comment regarding the tomcat instance. Reading the article, and having just done some POC work with Hadoop, I think that the tomcat instances are simply running on the same machines that are nodes in the Hadoop cluster. I don’t know the details, but I think this would allow each Solr instance to leverage data locality in the cluster by working with the index(s) that exist on that particular node. If replication was configured such that all nodes in the hadoop cluster received a copy of every index, then all of the solr instances would be leveraging the same set of indexes. However, I’m sure it is much more complicated than that, especially since it is not typical to have data replicated to all nodes in a hadoop cluster of any significant size.

  2. |

    Thankfully I have learned a bit more about Hadoop and Solr since I writed the post.
    What you say makes sense. However, I think there is no need to have all the set of index in all nodes.
    I imagine the scenario as:
    An index shard is built in each datanode and is stored in HDFS. This datanode also contains a Solr instance in the local file system. The shard has to be copied to the local file system, this way Solr can serach into it.
    Having a Solr instance in another server you could search across all the shards in the different datanodes using Solr distributed search.
    What do you think?

  3. |

    Thanks for your post.
    I just have a question:
    as i know HDFS is not suitable for frequent operation.so how do you solve the update of your index for searching in situation of real time.
    Expect your answer!

  4. |

    Hi Yun,
    As you mentioned, HDFS is not good place to build the Lucene/Solr index as it requires lots of operation.so. If you look for open source examples of MapReduce indexing you’ll see that when building an index, it’s built on the local file system and then uploaded to HDFS. When all the process is done, index/shards have to be downloaded to local file system again.
    MapReduce and HDFS are very good to build huge index in a fast way. But if you need real time updates I think it’s better to execute the updates straight to the index/shards placed on the local file system.

Leave a Comment