Bug squash: February 2010

A couple of months ago I got myself a Motorola Droid (actually a Milestone in my country):

It's a pretty cool device, with an operating system (Android) that has a lot of potential, but also some warts. There are literally thousands of issues registered in their issue tracker, of which I have starred about 30 of them.

Anyway, I don't mean to write a full review of Android or the Moto Droid, but one thing that bothered me is that I couldn't find any free and easy streaming solution. Some solutions are awfully complicated, and others are way too proprietary for my taste. I just want to select a video file on my desktop computer and have it played on the phone.

So I hacked together a little web app that does just that. The web app sits on the desktop box and presents a filesystem browser. You browse it on your phone, and when you select a video, it opens on the phone's media player. It can also stream MP3, display images and text files and any other content that the phone can handle.

On a more technical note, video is streamed using VLC to create a RTSP endpoint. The video is transcoded on the fly to H.264 so that Android can understand it, which means that you can stream any video format that VLC supports (even with subtitles!). Too bad I couldn't get on-demand video to work yet (VLC does support it), which is why there's that "Kill VLC" button: when you quit Android's media player, the video keeps running on the server. Also, without on-demand video there's no way to pause or seek from the client. Transcoding settings are configurable so you can fiddle with the bitrate and video resolution (currently set to baseline profile, 500kbit/s, 640x360).

You can get it here:

Only requirement to run is .NET 3.5 SP1.

When working with Solr, it's not uncommon to see indexes with hundreds of thousands of even millions of documents. Say you build those millions of documents from a RDBMS, which is a common case.

Solr has a tool to do this: the Data Import Handler. It's configurable with XML, like so many things in Java. Problem is, when you need to do some complex processing, it quickly turns into executable XML, and nobody likes that. More importantly, the process is not testable: you can't run a unit test that doesn't involve the actual database and the actual Solr instance. So I prefer to import data to Solr with code. More precisely: .NET code, using SolrNet.

Since adding documents one by one would be terribly inefficient (1000000 documents would mean 1000000 HTTP requests), SolrNet has a specific method to add documents in batch: Add(IEnumerable<T> documents). Let's try adding a huge amount of documents with this.

Setup

To keep this post focused, I'll abstract away the database. So, first thing I'll do is set up some fake documents ^[1]:

string text = new string('x', 1000); 
IEnumerable<Dictionary<string, object>> docs = Enumerable.Range(0, 150000)
    .Select(i => new Dictionary<string, object> { 
        {"id", i.ToString()}, 
        {"title", text} 
    });

This sets up 150000 documents, each one with a size of about 1 KB, lazily. They don't exist anywhere yet, until we start enumerating docs.

Tests

After setting up SolrNet we call:

solr.Add(docs);

and shortly after executing it the process grows its memory usage to some gigabytes and then crashes with an OutOfMemoryException. Holy crap!^[2]

Reducing the amount of documents to 100000 completed the process successfully, but it took 32s (3125 docs/s) and the peak memory usage was 850MB. This clearly isn't working!

What happened is that SolrNet tried to fit all the documents in a single HTTP request. Not very smart, eh? But that's out of SolrNet's scope, at least for now. What we need to do is feed it with manageable chunks of documents. So we grab a partition function like this one, courtesy of Jon Skeet^[3]. Armed with this function we partition the 100000 docs into chunks of 1000 docs:

foreach (var group in Partition(docs, 1000))
   solr.Add(group);

This completes in 34s which is slightly worse than without grouping, but memory usage is pretty constant at 50MB. Now we're getting somewhere!

But wait! What if we parallelize these groups? The Task Parallel Library (TPL)^[4] makes it very easy to do so:

Parallel.ForEach(Partition(docs, 1000), group => solr.Add(group));

This one took 21.2s to complete on my dual-core CPU but peak memory usage was 140MB since it has to keep several groups in memory simultaneously. This is pretty much what SolrJ (the Java Solr client) does with its StreamingUpdateSolrServer, except the Java folks had to manually queue and manage the threads, while we can just leverage the TPL in a single line of code.

Playing a bit with the group size I ended up with these charts of memory size and throughput:

Memory size seems to increase linearly with group size, while throughput shows an asymptotic growth.

By now I bet you must be saying: "Hey, wait a minute! The title of the post promised millions of documents but you only show us a mere 100000! Where's the rest of it?!?". Well, I did benchmark a million documents as well, and with group size = 1000, in parallel, it took 3:57 minutes. For these tests I used 100000 documents instead to keep times down.

Conclusion and final notes

In this experiment I left a lot of variables fixed: document size, network throughput and latency (I used a local Solr instance so there is no network), CPU (since I ran Solr on the same box as the tests, they competed for CPU)... With a quad-core CPU I would expect this to consume more memory but it would also be faster. Bigger documents would also increase memory usage and make the whole process more network-sensitive. Is memory more important to you than throughput? Then you would use the non-parallel approach. So I prefer to leave these things out of SolrNet's scope for now. It depends too much on the structure of your particular data and setup to just pick some default values. And I don't want to take a dependency on the TPL yet.

Some general advice:

Keep your process as linear (O(n)) and as lazy as possible.
While increasing the group size can increase the throughput (and memory), also keep in mind that with big groups you'll start to see timeouts from Solr.
When fetching data from the database, always do it with a forward-only enumerator, like a IDataReader or a LINQ2SQL enumerable. Loading the whole resultset in a List or DataTable will simply kill your memory and performance.
It can also make sense to fetch the data from the database in several groups (I just assumed a single IEnumerable as an origin to keep it simple) and parallelize on that.

Footnotes:

Dictionary documents for SolrNet is implemented in trunk, it will be included in the next release
I know that even though this approach isn't scalable at all, it shouldn't throw OOM with only 150000 docs.
I chose that particular Partition() function because it's one-pass. If you write a partitioning function with LINQ's groupby you'll traverse your IEnumerable (at least) twice. If you use a forward-only enumerable (e.g. LINQ2SQL), which I recommend, you only get to enumerate the result once.
You can get the latest version of the TPL for .NET 3.5 SP1 from the Reactive Extensions.

Bug squash

Monday, February 22, 2010

Mobile remote media browser

Wednesday, February 10, 2010

Indexing millions of documents with Solr and SolrNet

Setup

Tests

Conclusion and final notes

About Me

Contact

Labels

Blog Archive

License