Lucene 2.9.2 and 3.0.1 versions have been released. Both are mainly bug fix versions from the previous ones.
The main difference between 2 and 3 versions is that version 3 has no support for java 1.4 and has a more clean API as deprecated stuff has been removed. This means if you want to upgrade your Lucene JARs to v.3 you must use at least Java 1.5 and have no deprecation warnings in you code.
More details of both releases can be found in the official announcement:
Hello Lucene users,
On behalf of the Lucene development community I would like to announce the release of Lucene Java versions 3.0.1 and 2.9.2:
Both releases fix bugs in the previous versions:
- 2.9.2 is a bugfix release for the Lucene Java 2.x series, based on Java 1.4
- 3.0.1 has the same bug fix level but is for the Lucene Java 3.x series, based on Java 5.New users of Lucene are advised to use version 3.0.1 for new developments, because it has a clean, type-safe API.
Important improvements in these releases include:
- An increased maximum number of unique terms in each index segment.
- Fixed experimental CustomScoreQuery to respect per-segment search. This introduced an API change!
- Important fixes to IndexWriter: a commit() thread-safety issue, lost document deletes in near real-time indexing.
- Bugfixes for Contrib’s Analyzers package.
- Restoration of some public methods that were lost during deprecation removal.
- The new Attribute-based TokenStream API now works correctly with different class loaders.Both releases are fully compatible with the corresponding previous versions. We strongly recommend upgrading to 2.9.2 if you are using 2.9.1 or 2.9.0; and to 3.0.1 if you are using 3.0.0.
See core changes at
http://lucene.apache.org/java/3_0_1/changes/Changes.html
http://lucene.apache.org/java/2_9_2/changes/Changes.htmland contrib changes at
http://lucene.apache.org/java/3_0_1/changes/Contrib-Changes.html
http://lucene.apache.org/java/2_9_2/changes/Contrib-Changes.htmlBinary and source distributions are available at
http://www.apache.org/dyn/closer.cgi/lucene/java/Lucene artifacts are also available in the Maven2 repository at
http://repo1.maven.org/maven2/org/apache/lucene/
Tags: Java, Lucene, Open source
It has been a long time since my last post. I have been very busy so unfortunatelly, I have not had the time to write about all I wish.
This week I have discovered via twitter a really interesting open source search project, ElasticSearch for the cloud. ElasticSearch has been createded by Shay Banon. It’s a RESTful search engine built on top of Lucene and very well prepared for high scalability. It includes shard merging, replication and much more features.
Lately I have been working a lot with search scalability and what I liked the most for the moment of ElasticSearch is that it allows 4 different types of distributed requests.
The most simple (Query and fetch) is just one request per relevant shard. Once all the requests are done, results are merged and… that’s it!
In this type of search, all fields of a document are returned to the merger for all the returned documents.
In another search type (Query then fetch, this one is not that simple), a first request is done across all shards. Here you don’t ask for the document content at the moment. Once the results are merged, you only need to ask for the whole document data of the most relevant documents, the ones you want to show.
If you have to search across lots of shards that’s definitely the way to go (the merger will just receive the fields of the important documents, wich means less data is sent across the network).
Both options present a typical problem in distributed search. The relevance is calculated relative to the shard, it’s not absolute across all of them.
To solve this, in ElasticSearch, both search options can be supplemented with an initial request. This one queries for the necessary term frequencies information to allow an “absolute relevance”.
This is not for free, you are paying with an extra trip (even it can be cached). It’s good if you can avoid that. A good way to do that is at indexing time, when you decide in wich shard a document must be added. Choosing it randomly will more or less ensure you that term frequencies won’t differ so much among shards.
Still have not had the chance to dig into the source but already have downloaded it from the git repository.
Anyone that want to share experiences with ElasticSearch is more than welcome.
Tags: Cloud computing, ElasticSearch, Java, Lucene, Open source
Last Monday took place in Barcelona the first CloudCamp ever done in the city. Altough I was expecting more technical stuff it was good to be there and listen to what people have to say.
The first part of the event consisted of some quick explanations from different companies related with cloud computing. Basically, were explaining the cloud choises and advantages they were offering. The one I enjoyed the most was the Abiquo’s presentation of their new software, Abicloud. Through a really nice GUI developed with Flex, Abicloud, among other stuff, allows you to set up virtual machines configuring automatically an apache server, mysql database… with just a few drag & drop actions. You can use you own machines, servers from an ISP or even combine both. Elastically, you can increase or decrease the number of virtual machines. This can be very convenient for sites with hight traffic peaks or testing environements.
I am not going to talk more about it as with a five minutes presentation just could get the main idea. Can’t wait to have some free time to start playing with it. Just will add that Abicloud is completely open source.
After the quick talks, the following topics were discussed:
In the end people were divided in groups depending on in wich topic wanted to go deeper. I attended to “How to develope applications that are going to run in the cloud”. There I could have an interesting quick chat about application scalability and how to dump mysql databases to HDFS using the Cloudera’s tool Sqoop.

Tags: Abicloud, Cloud computing, CloudCamp, Events, Hadoop, MySQL, Open source, Sqoop
Last week was launched a new release of JMeter. JMeter 2.3.3 is a powerful java application designed to do web application functionality testing and performance measurement, allowing you to do powerful server stress tests.
I have been doing some practices with it and I really liked the easy way you can set up a test plan and start stressing your machines to check response times when lot’s of threads are doing requests.
You just need to create a .jmx file wich will contain all the information needed to do the requests. Host name, port number, protocol, method, url path, url variables… You can actually tell JMeter to read the url variables from an external .dat file. It will allow you to give different values to the variables for each request.
The .jmx can be written manually but it’s much easier to create it via the JMeter’s GUI.
You will have to tell JMeter the number of threads that must be executing requests and the number of requests per thread. It allows you to leave the threads making requests indefinitely.
Once a test is launched you can see in real time the number of samples that have been executed and the Deviation, Throughput, Average and Median of the requests done by the threads (think of a thread as a user doing a request via browser).
This is just how to do a basic test plan but the application is really more complete than this and has much more interesting features.
Tags: Java, JMeter, Open source
Jmap and jhat are a couple of tools really useful to analyze the memory consume of a java program. Both are included in the JVM 1.6 so there is no need to install any extra stuff.
Jmap allows you to create a dump of the java memory heap at any moment in the life of your running application. It will contain all the live objects and classes at that moment. To create the heap dump it’s as easy as:
jmap -dump:file=my_stack.bin 4365
Where my_stack.bin is the name of the file where you want the dump and 4365 is the pid of the java application process.
If you are running a servlet application under a java server and it ends with a:
java.lang.OutOfMemoryError: Java heap space
You can trigger a dump of the java heap at the OutOfMemory moment specifying these parameters to the server:
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/sturlese/stack_test/
This will create a .hprof file (named with the pid’s process) containing the dump in the specified path.
HeapDumpPath param is not compulsory. If we don’t specify it the dump will be created in the folder where Tomcat launches the webapps.
Now we have the dump of the java heap. To analyze it we will use jhat. Once we launch jhat specifying the dump to analyze it will start an HTTP server (in the port 7000 by default) and will let you surf along all the classes and objects. You will be able to check how many instances of each class where alive in the moment the heap was created. To launch jhat:
jhat my_stack.bin
It’s easy to get an OutOfMememory exception when opening the java heap. The dump file can be very memory consuming if you app was in the moment it was taken. If you experience the problem you should give to the JVM as much memory as you can:
jhat -J-mx2000m my_stack.bin
Now is the moment to point your web browser to http://localhost:7000 and start analyzing the heap!
Tags: Java, jhat, jmap, Open source
Today I needed to check some old java source from wich only I just kept the class files.
Find a java decompiler for my Ubuntu was not as easy job as I tought. Couldn’t find one in the repositories and all what I found in the network was not updated at all.
JAD Java Decompiler is definitely not new stuff but it is really easy to use and did pretty good job for me. The problem was that almost all links guided me to http://www.kpdus.com but the software is not available in there anymore.
In the end I found it not just for Ubuntu but for other platforms aswell.
I leave here the JAD version for Ubuntu (and other linux distributions) that worked for me.
Tags: Decompiler, JAD, Java, Linux, Open source, Ubuntu
Lucene TrieRangeQuery is a cool contrib in Lucene (think not yet in the official release) created by Uwe Schindler. I had heard about it before but learned about it in the LuceneMeetUp in ApacheCon EU. Uwe gave a great speach about it. As I found it a really useful feature will try to explain the basics.
TrieRangeQuery mainly sort out some RangeQuery problems:
To explain it in an easy way, what TrieRangeQuery do is to search the data values skipping the less relevant “digits” in function of a precision parameter.
Let’s say for example we need to classify thousands of numbers of 6 figures. This could be a slow process using ConstantScoreRangeQuery in a huge index, not with TrieRangeQuery. Ranges will be divided recurively in function of a precision parámeter (set at index time). Numbers from the middle of the range will be classified using the minimum precision value while numbers from extrems will use a higher precision. This will make the query run extremely much faster.
Depending on the level of presicionStep parameter given at index time we will be able to search with more or less precision. The more precision marging we choose the more the lucene document will occuppy. It is due to we will have to index the field more times with the different precisions.
We need to index data in a special way to be able to search it using Lucene TrieRangeQuery. We must index our fields using TrieUtils. We can index numbers directly. It supports java signed int, long, float, double. There’s no loss of precision for doubles or floats. There’s no round for their creation, instead a long/int representation is used for cents.
Indexing numbers with TrieUtils will make us forget about maual padding.
We can index Dates aswell (from java timestamps data type).
As seen, Lucene TrieRangeQuery is totally a step forward for Lucene queries scalability.
Tags: ApacheCon, Java, Lucene, Open source, TrieRangeQuery
Last week I had the chance to go to the ApacheCon Europe 2009. The event took place in Mövenpick Hotel, Amsterdam. I had a really good time in there.
Was good to share use cases and experiences in person with people who I had just spoken with in forums.
I spend the first two days in the hackathon doing some research and test of different ASF projects. Put special interest in Pig.
There were really interesting chats. I found specially great Mahout project. I had discovered it in ApacheCon 2008 in New Orleans, I almost just heard about it in there but paid more atention this time and looks full of possibilities. It is used for machine learning and runs under Hadoop.
Was also good to get some info about Servlet 3.0 and learn about servlets doFilter function and some other stuff.
HBase is another project I was interested in. Looks good to be used as a “data warehouse” but seems really difficult (at least at first impression) to deal with the stored data.
Meetups were so good too. There was a presentation about the new Lucene contrib TrieRangeQuery. It is still not available in the official release but you can use it graving a nightly build. In the next few days I will try to write with more detail about this and other presented projects.
Tags: ApacheCon, Events, Hadoop, HBase, Lucene, Mahout, Open source, Pig
A new official release of Lucene in now available! Lucene 2.4.1 is a bug fix version.
We will be able to see more new features in the Lucene 2.9 release (available in developers version).
Here I mention all the improvements of Lucene 2.4.1, wich I read from the official lucene’s site:
Here is a really interesting example of how to build an inverted index using Pig. As I have seen in Hadoop, to create a Lucene index you must start from a text file and use MapReduce jobs to build it. Pig however, allows you to retrieve data not just from a text file but from SQL databases, HBase or other data sources.
After checking the example with detail, what comes now to my mind is if it would be possible to create a Lucene index using Pig and MapReduce jobs retrieving data from a distributed HBase data store system… I am wandering if there would be Lucene analyzers problems (or any other), for example.
I have read that Pig is not specially fast accessing to data. However, in indexing cases, probably this would be more than compensated with the MapReduce jobs.
How fast would it be? I still have lots of research and tests to do…