When working with Solr, it's not uncommon to see indexes with hundreds of thousands of even millions of documents. Say you build those millions of documents from a RDBMS, which is a common case.
Solr has a tool to do this: the Data Import Handler. It's configurable with XML, like so many things in Java. Problem is, when you need to do some complex processing, it quickly turns into executable XML, and nobody likes that. More importantly, the process is not testable: you can't run a unit test that doesn't involve the actual database and the actual Solr instance. So I prefer to import data to Solr with code. More precisely: .NET code, using SolrNet.
Since adding documents one by one would be terribly inefficient (1000000 documents would mean 1000000 HTTP requests), SolrNet has a specific method to add documents in batch: Add(IEnumerable<T> documents). Let's try adding a huge amount of documents with this.
Setup
To keep this post focused, I'll abstract away the database. So, first thing I'll do is set up some fake documents [1]:
string text = new string('x', 1000);
IEnumerable<Dictionary<string, object>> docs = Enumerable.Range(0, 150000)
.Select(i => new Dictionary<string, object> {
{"id", i.ToString()},
{"title", text}
});
This sets up 150000 documents, each one with a size of about 1 KB, lazily. They don't exist anywhere yet, until we start enumerating docs.
Tests
After setting up SolrNet we call:
solr.Add(docs);
and shortly after executing it the process grows its memory usage to some gigabytes and then crashes with an OutOfMemoryException. Holy crap![2]
Reducing the amount of documents to 100000 completed the process successfully, but it took 32s (3125 docs/s) and the peak memory usage was 850MB. This clearly isn't working!
What happened is that SolrNet tried to fit all the documents in a single HTTP request. Not very smart, eh? But that's out of SolrNet's scope, at least for now. What we need to do is feed it with manageable chunks of documents. So we grab a partition function like this one, courtesy of Jon Skeet[3]. Armed with this function we partition the 100000 docs into chunks of 1000 docs:
foreach (var group in Partition(docs, 1000))
solr.Add(group);
This completes in 34s which is slightly worse than without grouping, but memory usage is pretty constant at 50MB. Now we're getting somewhere!
But wait! What if we parallelize these groups? The Task Parallel Library (TPL)[4] makes it very easy to do so:
Parallel.ForEach(Partition(docs, 1000), group => solr.Add(group));
This one took 21.2s to complete on my dual-core CPU but peak memory usage was 140MB since it has to keep several groups in memory simultaneously. This is pretty much what SolrJ (the Java Solr client) does with its StreamingUpdateSolrServer, except the Java folks had to manually queue and manage the threads, while we can just leverage the TPL in a single line of code.
Playing a bit with the group size I ended up with these charts of memory size and throughput:
Memory size seems to increase linearly with group size, while throughput shows an asymptotic growth.
By now I bet you must be saying: "Hey, wait a minute! The title of the post promised millions of documents but you only show us a mere 100000! Where's the rest of it?!?". Well, I did benchmark a million documents as well, and with group size = 1000, in parallel, it took 3:57 minutes. For these tests I used 100000 documents instead to keep times down.
Conclusion and final notes
In this experiment I left a lot of variables fixed: document size, network throughput and latency (I used a local Solr instance so there is no network), CPU (since I ran Solr on the same box as the tests, they competed for CPU)... With a quad-core CPU I would expect this to consume more memory but it would also be faster. Bigger documents would also increase memory usage and make the whole process more network-sensitive. Is memory more important to you than throughput? Then you would use the non-parallel approach. So I prefer to leave these things out of SolrNet's scope for now. It depends too much on the structure of your particular data and setup to just pick some default values. And I don't want to take a dependency on the TPL yet.
Some general advice:
- Keep your process as linear (O(n)) and as lazy as possible.
- While increasing the group size can increase the throughput (and memory), also keep in mind that with big groups you'll start to see timeouts from Solr.
- When fetching data from the database, always do it with a forward-only enumerator, like a IDataReader or a LINQ2SQL enumerable. Loading the whole resultset in a List or DataTable will simply kill your memory and performance.
- It can also make sense to fetch the data from the database in several groups (I just assumed a single IEnumerable as an origin to keep it simple) and parallelize on that.
Footnotes:
- Dictionary documents for SolrNet is implemented in trunk, it will be included in the next release
- I know that even though this approach isn't scalable at all, it shouldn't throw OOM with only 150000 docs.
- I chose that particular Partition() function because it's one-pass. If you write a partitioning function with LINQ's groupby you'll traverse your IEnumerable (at least) twice. If you use a forward-only enumerable (e.g. LINQ2SQL), which I recommend, you only get to enumerate the result once.
- You can get the latest version of the TPL for .NET 3.5 SP1 from the Reactive Extensions.