Recently I read an article explaining how Rackspace solved their huge log data deal with problem. They have implemented the best Hadoop and Solr integration I have seen until now, it really looks amazing.
I don’t know hadoop with detail but to run Solr instances from a Tomcat server stored in HDFS (Hadoop’s distributed file system) sounds like pretty good job!
All the process is well described in the article, I just want to mention the basic steps they followed:
Probably this kind of arquitecture could be used to sort scalability problems in other fields not just log deal with. Search engines, for example. Maybe there the amount of data to deal with would be less but probably much more features would be needed.
