write down,forget

Lucandra,when Lucene meet Cassandra

<Category: Lucene, 云里雾里> 查看评论

转自:http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/

GitHub地址:http://github.com/tjake/Lucandra/blob/master/README

 关于Lucandra的介绍:

 
Lucanadra线上应用:http://sparse.ly

What is ?

Cassandra is a scalable and easy to administer column-oriented data store, modeled after Google’s BigTable, but built by the designers of Amazon’s S3. One of the big differentiators of Cassandra is it does not rely on a global file system as Hbase and BigTable do. Rather, Cassandra uses decentralized peer to peer “Gossip” which means two things:
  1. It has no single point of failure, and
  2. Adding nodes to the cluster is as simple as pointing it to any one live node

Cassandra also has built-in multi-master writes, replication, rack awareness, and can handle downed nodes gracefully. Cassandra has a thriving community and is currently being used at companies like Facebook, Digg and Twitter to name a few.

Enter Lucandra

Lucandra is a Cassandra backend for . Since Cassandra’s original use within Facebook was for , integrating Lucene with Cassandra seemed like a “no brainer”. Lucene’s core design makes it fairly simple to strip away and plug in custom Analyzer, Writer, Reader, etc. implementations. Rather than trying to build a Lucene Directory interface on top of Lucene as some backends do (DbDirectory for example), our approach was to implement a an IndexReader and IndexWriter directly on top of Cassandra.

Here’s how Terms and Documents are stored in Cassandra. A Term is a composite key made up from the index, field and term with the document id as the column name and position vector as the column value.

 

      Term Key                    ColumnName   Value
      "indexName/field/term" => { documentId , positionVector }
      Document Key
      "indexName/documentId" => { fieldName , value }
Cassandra allows us to pull ranges of keys and groups of columns so we can really tune the performance of reads as well as minimize network IO for each query. Also, since writes are indexed and replicated by Cassandra we don’t need to worry about optimizing the indexes or reopening the index to see new writes. This means we get a soft real-time distributed search engine.
There is a impact on Lucandra searches when compared to native Lucene searches. In our testing we see Lucandra’s IndexReader is ~10% slower, than the default IndexReader. However, this is still quite acceptable to us given what you get in return.
For writes Lucadra is comparatively slow to regular Lucene, since every term is effectively written under its own key. Luckily, this will be fixed in the next version of Cassandra, which will allow batched writes for keys.
One other major caveat is, there is no term scoring in the current code. This simply hasn’t been needed yet. Adding is relatively trivial – via another column.
To see Lucandra in action you can try out the Twitter search app http://sparse.ly that is built on Lucandra. This service uses the Lucandra store exclusively and does not use any sort of relational or other type of database.

Lucandra in Action

Using Lucandra is extremely simple and switching a regular Lucene search application to use Lucandra is a matter of just several lines of code. Let’s have a look.

First we need to create the connection to Cassandra


import lucandra.CassandraUtils;
import lucandra.IndexReader;
import lucandra.IndexWriter;
...
Cassandra.Client client = CassandraUtils.createConnection();

Next, we create Lucandra’s IndexWriter and IndexReader, and Lucene’s own IndexSearcher.


IndexWriter indexWriter = new IndexWriter("bookmarks", client);
IndexReader indexReader = new IndexReader("bookmarks", client);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);

From here on, you work with IndexWriter and IndexSearcher just like you in vanilla Lucene. Look at the BookmarksDemo for the complete class.

What’s next? Solandra!

Now that we have a Lucandra we can use it with anything built on Lucene. For example, we can integrate Lucandra with Solr and simplify our Solr administration. If fact this has already been attempted and we plan to support this in our code soon.

本文来自: Lucandra,when Lucene meet Cassandra