Wednesday, February 10, 2010

Indexing millions of documents with Solr and SolrNet

When working with Solr, it's not uncommon to see indexes with hundreds of thousands of even millions of documents. Say you build those millions of documents from a RDBMS, which is a common case.

Solr has a tool to do this: the Data Import Handler. It's configurable with XML, like so many things in Java. Problem is, when you need to do some complex processing, it quickly turns into executable XML, and nobody likes that. More importantly, the process is not testable: you can't run a unit test that doesn't involve the actual database and the actual Solr instance. So I prefer to import data to Solr with code. More precisely: .NET code, using SolrNet.

Since adding documents one by one would be terribly inefficient (1000000 documents would mean 1000000 HTTP requests), SolrNet has a specific method to add documents in batch: Add(IEnumerable<T> documents). Let's try adding a huge amount of documents with this.

Setup

To keep this post focused, I'll abstract away the database. So, first thing I'll do is set up some fake documents [1]:

string text = new string('x', 1000); 
IEnumerable<Dictionary<string, object>> docs = Enumerable.Range(0, 150000)
    .Select(i => new Dictionary<string, object> { 
        {"id", i.ToString()}, 
        {"title", text} 
    }); 

This sets up 150000 documents, each one with a size of about 1 KB, lazily. They don't exist anywhere yet, until we start enumerating docs.

Tests

After setting up SolrNet we call:

solr.Add(docs);

and shortly after executing it the process grows its memory usage to some gigabytes and then crashes with an OutOfMemoryException. Holy crap![2]

Reducing the amount of documents to 100000 completed the process successfully, but it took 32s (3125 docs/s) and the peak memory usage was 850MB.  This clearly isn't working!

What happened is that SolrNet tried to fit all the documents in a single HTTP request. Not very smart, eh? But that's out of SolrNet's scope, at least for now. What we need to do is feed it with manageable chunks of documents. So we grab a partition function like this one, courtesy of Jon Skeet[3]. Armed with this function we partition the 100000 docs into chunks of 1000 docs:

foreach (var group in Partition(docs, 1000))
   solr.Add(group);

This completes in 34s which is slightly worse than without grouping, but memory usage is pretty constant at 50MB. Now we're getting somewhere!

But wait! What if we parallelize these groups? The Task Parallel Library (TPL)[4] makes it very easy to do so:

Parallel.ForEach(Partition(docs, 1000), group => solr.Add(group));

This one took 21.2s to complete on my dual-core CPU but peak memory usage was 140MB since it has to keep several groups in memory simultaneously. This is pretty much what SolrJ (the Java Solr client) does with its StreamingUpdateSolrServer, except the Java folks had to manually queue and manage the threads, while we can just leverage the TPL in a single line of code.

Playing a bit with the group size I ended up with these charts of memory size and throughput:

solr-mem

solr-throughput

 

Memory size seems to increase linearly with group size, while throughput shows an asymptotic growth.

By now I bet you must be saying: "Hey, wait a minute! The title of the post promised millions of documents but you only show us a mere 100000! Where's the rest of it?!?". Well, I did benchmark a million documents as well, and with group size = 1000, in parallel, it took 3:57 minutes. For these tests I used 100000 documents instead to keep times down.

Conclusion and final notes

In this experiment I left a lot of variables fixed: document size, network throughput and latency (I used a local Solr instance so there is no network), CPU (since I ran Solr on the same box as the tests, they competed for CPU)... With a quad-core CPU I would expect this to consume more memory but it would also be faster. Bigger documents would also increase memory usage and make the whole process more network-sensitive. Is memory more important to you than throughput? Then you would use the non-parallel approach. So I prefer to leave these things out of SolrNet's scope for now. It depends too much on the structure of your particular data and setup to just pick some default values. And I don't want to take a dependency on the TPL yet.

Some general advice:

  • Keep your process as linear (O(n)) and as lazy as possible.
  • While increasing the group size can increase the throughput (and memory), also keep in mind that with big groups you'll start to see timeouts from Solr.
  • When fetching data from the database, always do it with a forward-only enumerator, like a IDataReader or a LINQ2SQL enumerable. Loading the whole resultset in a List or DataTable will simply kill your memory and performance.
  • It can also make sense to fetch the data from the database in several groups (I just assumed a single IEnumerable as an origin to keep it simple) and parallelize on that.

Footnotes:

  1. Dictionary documents for SolrNet is implemented in trunk, it will be included in the next release
  2. I know that even though this approach isn't scalable at all, it shouldn't throw OOM with only 150000 docs.
  3. I chose that particular Partition() function because it's one-pass. If you write a partitioning function with LINQ's groupby you'll traverse your IEnumerable (at least) twice. If you use a forward-only enumerable (e.g. LINQ2SQL), which I recommend, you only get to enumerate the result once.
  4. You can get the latest version of the TPL for .NET 3.5 SP1 from the Reactive Extensions.

15 comments:

J said...

I'm building a large project with SolrNet and this should help a lot. Great post, thanks!

eric said...

I think your point about DIH can lead to "executable xml" is very key. DIH is great for integrating a database or other data source that doesn't require any massaging. But the more massaging you need, the better going with executable code seems to be! And I love you can use SolrNet and I can use rsolr!

John said...
This comment has been removed by a blog administrator.
David Craft said...

Great post.. This should help me with my solr re-indexer that i've nearly finished writing.

KevM said...

Nice grouped use of task partitioning and the parallel task library.

Just wanted to say thanks for the SolrNet info and library. I basically implemented my own, less awesome, version of Solr for a database indexing project.

Terrance said...

I have been doing professional SOLR integration at a large financial company and I can say for sure the DIH is perfectly fine when you take the following approach.

What you are doing is pushing data into SOLR via ETL.

Assuming you are using an RDMS might as well just use DIH over a database and be done with it.

If you do have messaging to do, I imagine you do it everywhere, including the UI and other various places. You may even have a service layer that handles this.

Point SOLRs DIH at a RESTful service using the same amount of code you would of wrote using the bulk operation you described. This time uou get a RESTful service out of it, and you get to use DIH out of it as well.

mausch said...

@Terrance: interesting, but if you shift the main logic to a RESTful service you get basically what I describe in my article, an ETL process with DIH only acting as a light adapter. Anyway I would never expose this as a RESTful service by default unless I had a very good reason, this should remain internal to the application.

Satish Chandra Mishra said...

I am trying to do the following...

1. On the facet field of column A, I want to display the distinct count of column B matching the records with the value of column
A against facet field names. Simple facet field gives me the record count from the document instead of facet field A.
2. On the facet pivot table of columns A and C, I want to display the distinct counts of column B with matching values of facet pivot columns.

To do so, I am trying to create multiple entities with detailed and distinct field details under the same entities under the single document


entity 1 -- detailed records including all fields.
entity 2 -- Disctint records for Column A and column B


when I am trying to load these entities, it overwrites the first loaded entity.

Alternatively I am trying to create multiple entities under different documents but it doesn't the data either. It loads 0 records with no errors.

I am new to Solr, Please advise, what should be my approach and steps.

Mauricio Scheffer said...

@Satish Chandra Mishra : please ask all questions about Solr in the Solr mailing list: http://lucene.apache.org/solr/discussion.html

Anonymous said...

Hi,
Could you advise me how to add xml formatted data through solrnet.
To replace solr.Add(docs);
Thanks,
laszlo

Anonymous said...

Hi,
Could you advise me how to add xml formatted data through solrnet.
To replace solr.Add(docs);
Thanks,
laszlo

Mauricio Scheffer said...

laszlo: Post all questions about SolrNet to the mailing list https://groups.google.com/forum/#!forum/solrnet

nopAccelerate said...

Hi, thanks for your excellent article. We have been looking for something that could improve indexing speed. I hope this helps. We'll try this as you advised. Thanks again.

Krunal
nopAccelerate.com

Kanja said...

Thanks for the post Mauricio.
For this line, how would I access the docs directly from the database.

Parallel.ForEach(Partition(docs, 1000), group => solr.Add(group));

Do I get the records from the database and store it in docs first before executing the Parallel comman above?

Mauricio Scheffer said...

@Kanja: yes, but the important thing is to fetch the records from the database lazily