Wednesday, February 10, 2010

Indexing millions of documents with Solr and SolrNet

When working with Solr, it's not uncommon to see indexes with hundreds of thousands of even millions of documents. Say you build those millions of documents from a RDBMS, which is a common case.

Solr has a tool to do this: the Data Import Handler. It's configurable with XML, like so many things in Java. Problem is, when you need to do some complex processing, it quickly turns into executable XML, and nobody likes that. More importantly, the process is not testable: you can't run a unit test that doesn't involve the actual database and the actual Solr instance. So I prefer to import data to Solr with code. More precisely: .NET code, using SolrNet.

Since adding documents one by one would be terribly inefficient (1000000 documents would mean 1000000 HTTP requests), SolrNet has a specific method to add documents in batch: Add(IEnumerable<T> documents). Let's try adding a huge amount of documents with this.

Setup

To keep this post focused, I'll abstract away the database. So, first thing I'll do is set up some fake documents [1]:

string text = new string('x', 1000); 
IEnumerable<Dictionary<string, object>> docs = Enumerable.Range(0, 150000)
    .Select(i => new Dictionary<string, object> { 
        {"id", i.ToString()}, 
        {"title", text} 
    }); 

This sets up 150000 documents, each one with a size of about 1 KB, lazily. They don't exist anywhere yet, until we start enumerating docs.

Tests

After setting up SolrNet we call:

solr.Add(docs);

and shortly after executing it the process grows its memory usage to some gigabytes and then crashes with an OutOfMemoryException. Holy crap![2]

Reducing the amount of documents to 100000 completed the process successfully, but it took 32s (3125 docs/s) and the peak memory usage was 850MB.  This clearly isn't working!

What happened is that SolrNet tried to fit all the documents in a single HTTP request. Not very smart, eh? But that's out of SolrNet's scope, at least for now. What we need to do is feed it with manageable chunks of documents. So we grab a partition function like this one, courtesy of Jon Skeet[3]. Armed with this function we partition the 100000 docs into chunks of 1000 docs:

foreach (var group in Partition(docs, 1000))
   solr.Add(group);

This completes in 34s which is slightly worse than without grouping, but memory usage is pretty constant at 50MB. Now we're getting somewhere!

But wait! What if we parallelize these groups? The Task Parallel Library (TPL)[4] makes it very easy to do so:

Parallel.ForEach(Partition(docs, 1000), group => solr.Add(group));

This one took 21.2s to complete on my dual-core CPU but peak memory usage was 140MB since it has to keep several groups in memory simultaneously. This is pretty much what SolrJ (the Java Solr client) does with its StreamingUpdateSolrServer, except the Java folks had to manually queue and manage the threads, while we can just leverage the TPL in a single line of code.

Playing a bit with the group size I ended up with these charts of memory size and throughput:

solr-mem

solr-throughput

 

Memory size seems to increase linearly with group size, while throughput shows an asymptotic growth.

By now I bet you must be saying: "Hey, wait a minute! The title of the post promised millions of documents but you only show us a mere 100000! Where's the rest of it?!?". Well, I did benchmark a million documents as well, and with group size = 1000, in parallel, it took 3:57 minutes. For these tests I used 100000 documents instead to keep times down.

Conclusion and final notes

In this experiment I left a lot of variables fixed: document size, network throughput and latency (I used a local Solr instance so there is no network), CPU (since I ran Solr on the same box as the tests, they competed for CPU)... With a quad-core CPU I would expect this to consume more memory but it would also be faster. Bigger documents would also increase memory usage and make the whole process more network-sensitive. Is memory more important to you than throughput? Then you would use the non-parallel approach. So I prefer to leave these things out of SolrNet's scope for now. It depends too much on the structure of your particular data and setup to just pick some default values. And I don't want to take a dependency on the TPL yet.

Some general advice:

  • Keep your process as linear (O(n)) and as lazy as possible.
  • While increasing the group size can increase the throughput (and memory), also keep in mind that with big groups you'll start to see timeouts from Solr.
  • When fetching data from the database, always do it with a forward-only enumerator, like a IDataReader or a LINQ2SQL enumerable. Loading the whole resultset in a List or DataTable will simply kill your memory and performance.
  • It can also make sense to fetch the data from the database in several groups (I just assumed a single IEnumerable as an origin to keep it simple) and parallelize on that.

Footnotes:

  1. Dictionary documents for SolrNet is implemented in trunk, it will be included in the next release
  2. I know that even though this approach isn't scalable at all, it shouldn't throw OOM with only 150000 docs.
  3. I chose that particular Partition() function because it's one-pass. If you write a partitioning function with LINQ's groupby you'll traverse your IEnumerable (at least) twice. If you use a forward-only enumerable (e.g. LINQ2SQL), which I recommend, you only get to enumerate the result once.
  4. You can get the latest version of the TPL for .NET 3.5 SP1 from the Reactive Extensions.

Submit this story to DotNetKicks Shout it

Thursday, January 7, 2010

Translating Haskell to F# and other considerations

In my last post I showed one way to group consecutive integers in a list. When I realized I needed such a function I first went googling for it but all I found was this Haskell code:

-- | Group consecutive integers:
--
-- > groupConsec [1,2,3,5,6,8,9] == [[1,2,3],[5,6],[8,9]]
groupConsec :: [Int] -> [[Int]]
groupConsec = groupConsec' []
    where
      groupConsec' x   []    = x
      groupConsec' [] (y:ys) = groupConsec' [[y]] ys
      groupConsec' xs (y:ys) = if y - head (last xs) == length (last xs)
                               then groupConsec' (init xs ++ [last xs ++ [y]]) ys
                               else groupConsec' (     xs ++ [           [y]]) ys

As I was on a C# project it was easier to start from scratch than trying to port this. So I wrote what I showed on the last post and moved on.

But I thought it would be interesting to port it to F#, to see how much different it would be.

First, let's translate it literally, by making as little changes as possible to the original code:

let groupConsec =
    let head = List.head
    let last l = List.nth l (l.Length-1)
    let length = List.length
    let init l = List.ofSeq (Seq.take ((List.length l)-1) l)
    let rec groupConsec' a b =
        match a,b with
        | x, [] -> x
        | [], y::ys -> groupConsec' [[y]] ys
        | xs, y::ys -> if y - head (last xs) = length (last xs)
                       then groupConsec' (init xs @ [last xs @ [y]]) ys
                       else groupConsec' (     xs @ [          [y]]) ys
                       
    groupConsec' []

At the beginning there's a couple of functions that aren't built into F#, but the rest of it is syntactically pretty much the same.

BUT, syntactic similarity does not imply that they're semantically the same. There's a fundamental difference from the original code: Haskell is a lazy language, while F#, coming from the ML-family, does eager evaluation. This doesn't necessarily mean that lazy is always better than eager or eager better than lazy. Lazy languages seem to have their own issues.

So let's just replace all lists in the code with LazyLists and see what happens:

let lazyGroupConsec =
    let head = LazyList.head
    /// From Haskell prelude: http://www.haskell.org/onlinereport/standard-prelude.html
    let rec last =
            function
            | LazyList.Nil -> failwith "last: empty list"
            | LazyList.Cons(x,xs) -> 
            match xs with
            | LazyList.Nil -> x
            | _ -> last xs
            
    let length = LazyList.length
    let rec init = 
            function
            | LazyList.Nil -> failwith "init: empty list"
            | LazyList.Cons(x,xs) -> 
            match xs with
            | LazyList.Nil -> LazyList.empty()
            | _ -> LazyList.cons x (init xs)
            
    let (@) x y = LazyList.append x y
    let lazyList x = LazyList.cons x (LazyList.empty())
                
    let rec groupConsec' a b =
        match a,b with
        | x, LazyList.Nil -> x
        | LazyList.Nil, LazyList.Cons(y,ys) -> groupConsec' (lazyList (LazyList.ofList [y])) ys
        | xs, LazyList.Cons(y,ys) -> if y - head (last xs) = length (last xs)
                                     then groupConsec' (init xs @ lazyList (last xs @ (LazyList.ofList [y]))) ys
                                     else groupConsec' (     xs @ lazyList (          (LazyList.ofList [y]))) ys

    groupConsec' (LazyList.empty())

Ok, now the function is lazy, but the code took a serious hit in readability...

Let's take now a whole different approach: porting from the C# code, but using recursion instead of the loop to avoid the mutable list:

let lazyGroupConsec2 : int seq -> int list seq =
    let lastOf x = List.nth x ((List.length x) - 1)
    let rec groupConsec' group groups input =
            if Seq.isEmpty input
            then Seq.append groups [group]
            else
                let hd = Seq.head input
                let tl = Seq.skip 1 input
                if List.length group = 0 || hd - (lastOf group) <= 1
                then groupConsec' (group @ [hd]) groups tl
                else groupConsec' [hd] (Seq.append groups [group]) tl
    groupConsec' [] []

This one looks pretty neat, right? Too bad it needs a type annotation or it gets a value restriction error.

Now I'm too lazy (no pun intended) to properly analyze the complexity of each of these functions, but at least let's run a simple benchmark over a range of integers with some holes: let range = {1..2000} |> Seq.filter (fun i -> i % 5 <> 0)

Function Time
groupConsec 00:00.1383291
lazyGroupConsec 00:00.9355807
lazyGroupConsec2 02:03.2433208

In case it's not clear, the first, non-lazy one took about 138ms and the last one took over 2 minutes. The reason is that seq pretty much sucks when used recursively, as Brian explains in detail in this stackoverflow answer.

Let's try using a LazyList instead for the inner recursive function:

let lazyGroupConsec3 l =
    let lastOf x = List.nth x ((List.length x) - 1)
    let (+) x y = LazyList.append x (LazyList.ofList [y])
    let rec groupConsec' group groups input =
            if LazyList.isEmpty input
            then groups + group
            else
                let hd = LazyList.head input
                let tl = LazyList.tail input
                if List.length group = 0 || hd - (lastOf group) <= 1
                then groupConsec' (group @ [hd]) groups tl
                else groupConsec' [hd] (groups + group) tl
    groupConsec' [] (LazyList.empty()) (LazyList.ofSeq l)

This one took 95ms to process the same range of integers.

Conclusions:

  • Porting code from Haskell to F# might be tempting but beware of their fundamental differences.
  • Because of these differences, the best solution in Haskell might not be the best solution in F#.
  • You really need to know how seq, list and lazylist work or you'll have a big performance problem.
  • LazyLists are so cool I wish they were more tightly integrated to the language.

Homework:

  • Properly analyze big-O complexity, or at least measure it empirically.
  • Check that all recursive functions here get properly optimized for tail calls.

Submit this story to DotNetKicks Shout it

Grouping consecutive integers in C#

Quick post: an extension method to group consecutive integers:

/// <summary>
/// Groups consecutive integers. Requires asc-ordered list
/// </summary>
/// <param name="list">ASC-ordered list of integers</param>
/// <returns></returns>
public static IEnumerable<IEnumerable<int>> GroupConsecutive(this IEnumerable<int> list) {
    var group = new List<int>();
    foreach (var i in list) {
        if (group.Count == 0 || i - group[group.Count - 1] <= 1)
            group.Add(i);
        else {
            yield return group;
            group = new List<int> {i};
        }
    }
    yield return group;
}

Sample usage:

var groups = new[] {1, 2, 4, 5, 6, 9}.GroupConsecutive();
foreach (var g in groups)
    Console.WriteLine("group: {0}", string.Join(",", g.Select(x => x.ToString()).ToArray()));

which prints out:

group: 1,2
group: 4,5,6
group: 9

Submit this story to DotNetKicks Shout it

Tuesday, December 29, 2009

SolrNet 0.2.3 released

I just released SolrNet 0.2.3. There aren't any significant changes from beta1, I just filled in some missing documentation bits and a couple of tests.

For the next release I'll focus on performance improvements and making multi-core access easier.

Downloads:

Submit this story to DotNetKicks Shout it

Wednesday, December 9, 2009

Boo web console

Here's another embeddable web console: a Boo interpreter. You enter some code, hit the Execute button and it compiles and runs the code. You get access to all of the assemblies in your app.

I use this to run short ad-hoc "queries" against a running web app, like:

  • Is this thing in the cache?
    print (context.Cache["pepe"] == null)
  • Do I have all httpmodules correctly registered?
    import System.Web
    for s in context.ApplicationInstance.Container.ResolveAll(typeof(IHttpModule)):
      print s
  • Checking performance counters:
    import System.Diagnostics
    pc = PerformanceCounterCategory("ASP.NET")
    for counter in pc.GetCounters():
      print counter.CounterName, counter.NextValue()
  • Or even running some NHibernate query (although I'd prefer using the NHibernate web console):
    import NHibernate 
    sf = context.ApplicationInstance.Container.Resolve(typeof(ISessionFactory)) 
    using s = sf.OpenSession(): 
      users = s.CreateQuery("from User order by newid()").SetMaxResults(5).List()
      for u in users: 
        print u.Id 

Why Boo? Because it's pretty much the perfect .net scripting language to embed: small (this console with boo merged in is about 1.9MB), type inference, duck typing, low-ceremony syntax.

Setup:

  • Put BooWebConsole.dll in your app's bin directory
  • Add it to your web.config's <httpHandlers> section:
    <add verb="*" path="boo/*" validate="false" type="BooWebConsole.ControllerFactory, BooWebConsole"/>
  • Visit /boo/index.ashx

Notes:

  • Unlike booish, this console is stateless. Each http request creates a new interpreter, it doesn't remember anything you executed in previous requests.
  • Since the code is compiled and run within the main AppDomain, this will leak memory. Not a big deal in a dev environment though. But putting this in a production site is pretty much suicidal.
  • In case it isn't obvious: this is not meant to replace proper testing.

Code is here.

Submit this story to DotNetKicks Shout it