Thursday, March 20, 2014

Better dictionary types

Say you need to call a function like:

int DoSomething(Dictionary<string, string> data)

Do you know what kind of data you should feed to that function? Evidently, the keys and values of the dictionary are strings, but what should the comparer for the keys be? Should keys be case-sensitive or case-insensitive? Does it even matter for this function?

For example, if DoSomething was processing HTTP headers, it would need to receive a case-insensitive dictionary, as HTTP header names are case-insensitive. Yet the type doesn’t enforce it, it doesn’t even give us a hint.

How do other typed languages deal with this? Let’s take a look at Haskell first.

Haskell’s Data.Map requires the key type to have an instance for the Ord typeclass. Since you can’t have more than one typeclass instance per type, there is no possible ambiguity about how keys are compared. This property of having at most one instance per typeclass per type is called “coherence” and it’s a good property to have in a typeclass system as it keeps things simple, both for the programmer and the compiler. If you wanted a case-insensitive Map, you’d use Data.CaseInsensitive and your concrete Map type would reflect its case-insensitive behavior, e.g.

import qualified Data.Map as M
import qualified Data.CaseInsensitive as CI ( mk )

main = do
  let m = M.fromList [(CI.mk "one", 1)]
  print $ M.lookup (CI.mk "One") m

Here the type of m is Map (CI String) Integer. You can’t confuse it with the case-sensitive Map String Integer because the compiler simply won’t let you. They’re different types!

.NET doesn’t have typeclasses but we could achieve something similar in this case if we could redesign System.Collections.Generic.Dictionary and remove the constructors that admit an instance of IEqualityComparer<TKey> . That means it would always use the default comparer for the key type. And if we wanted a case-insensitive dictionary, we’d just wrap our string keys in a type implementing case-insensitive equality, e.g.:

sealed class CaseInsensitiveString {
    public readonly string Value;

    public CaseInsensitiveString(string value) {
        Value = value;
    }

    public override bool Equals(object obj) {
        if (ReferenceEquals(null, obj)) return false;
        if (ReferenceEquals(this, obj)) return true;
        if (obj.GetType() != GetType()) return false;
        return StringComparer.InvariantCultureIgnoreCase.Equals(Value, ((CaseInsensitiveString)obj).Value);
    }

    public override int GetHashCode() {
        return (Value != null ? StringComparer.InvariantCultureIgnoreCase.GetHashCode(Value) : 0);
    }

    public override string ToString() {
        return Value;
    }

    public static implicit operator CaseInsensitiveString (string a) {
        return new CaseInsensitiveString(a);
    }
}

var data = new Dictionary<CaseInsensitiveString, string> {
    {"one", "1"},
};

Console.WriteLine(data["One"]);

But since Dictionary has constructors that allow overriding the equality comparer, this is not enough. In other words, the type Dictionary<CaseInsensitiveString, string> does not guarantee that the dictionary is case-insensitive. We can easily work around this by creating a new type that limits Dictionary’s constructors:

sealed class Dictionary2<TKey, TValue>: Dictionary<TKey, TValue> {
    public Dictionary2() {}
    public Dictionary2(int capacity) : base(capacity) {}
    public Dictionary2(IDictionary<TKey, TValue> dictionary) : base(dictionary) {}
}

And now we can guarantee that a Dictionary2<CaseInsensitiveString, string> will do key equality as defined by CaseInsensitiveString and thus it will be case-insensitive.

There is one downside to this solution: we need to wrap all the keys when we want a different comparer. This means more allocations, less performance. Haskell can avoid this penalty by making the wrapper a newtype (not the particular case of Data.CaseInsensitive though!), which we don’t have in .NET. Can we do better?

The main problem here was that the type doesn’t uniquely determine an equality comparer. If we don’t make that a part of the key type, then couldn’t we make the comparer part of the dictionary type?

This is precisely what ocaml-core does. The Map type is determined by types of the map’s keys and values, and the comparison function used to order the keys. The book Real World OCaml explains how including the comparator in the type is important because certain operations that work on multiple maps require that they have the same comparison function. As we’ve seen, System.Collections.Generic.Dictionary can’t enforce that.

Following that same design principle, now instead of forbidding all constructors that accept an equality comparer, we do the opposite: forbid all constructors that don’t take a comparer, thus making it always explicit, and include the comparer as an additional type parameter:

sealed class Dictionary<TKey, TValue, TEqualityComparer> : Dictionary<TKey, TValue> where TEqualityComparer : IEqualityComparer<TKey> {
    public Dictionary(TEqualityComparer comparer) : base(comparer) {}
    public Dictionary(int capacity, TEqualityComparer comparer) : base(capacity, comparer) {}
    public Dictionary(IDictionary<TKey, TValue> dictionary, TEqualityComparer comparer) : base(dictionary, comparer) {}
}

A small helper to aid with type inference in C#:

static class Dict<TKey, TValue> {
    public static Dictionary<TKey, TValue, TEqualityComparer> Create<TEqualityComparer>(TEqualityComparer comparer) where TEqualityComparer: IEqualityComparer<TKey> {
        return new Dictionary<TKey, TValue, TEqualityComparer>(comparer);
    }
}

Another small helper class to ease the definition of comparer types based on the ones we already have:

class DelegatingEqualityComparer<T>: IEqualityComparer<T> {
    private readonly IEqualityComparer<T> comparer;

    public DelegatingEqualityComparer(IEqualityComparer<T> comparer) {
        this.comparer = comparer;
    }

    public bool Equals(T x, T y) {
        return comparer.Equals(x, y);
    }

    public int GetHashCode(T obj) {
        return comparer.GetHashCode(obj);
    }
}

Now we can easily create new comparer types like this:

sealed class StringComparerInvariantCultureIgnoreCase: DelegatingEqualityComparer<string> {
    private StringComparerInvariantCultureIgnoreCase() : base(StringComparer.InvariantCultureIgnoreCase) {}

    public static readonly StringComparerInvariantCultureIgnoreCase Instance = new StringComparerInvariantCultureIgnoreCase();
}

Finally we can use this new Dictionary type like this:

var data = Dict<string, string>.Create(StringComparerInvariantCultureIgnoreCase.Instance);
data.Add("one", "1");

Console.WriteLine(data["One"]);

Back to our original function, if we wanted to enforce a case-insensitive dictionary we can now use this new Dictionary type and change the signature to:

int DoSomething(Dictionary<string, string, StringComparerInvariantCultureIgnoreCase> data)

Epilogue

Types are a terrific tool to reason about our code, but only if we use them correctly. Throwing around types in an impure, partial language like C# or F# does not mean you're using types in a meaningful way. Consider what your types allow and what they don’t allow. Make illegal states unrepresentable. With precise types it becomes easier to reason about your code. Invariants enforced through the type system means the compiler makes it impossible to create invalid programs.

When you find yourself in need of inspiration for your types, see what other typed languages do, especially OCaml and Haskell. Their type systems are much more powerful than .NET’s, but often you can extract some of the underlying design principles and adapt them to less powerful type systems.

Friday, February 14, 2014

Generating immutable instances in C# with FsCheck

If you do any functional programming in C# you’ll probably have lots of classes that look like this:

class Person {
    private readonly string name;
    private readonly DateTime dateOfBirth;

    public Person(string name, DateTime dateOfBirth) {
        this.name = name;
        this.dateOfBirth = dateOfBirth;
    }

    public string Name {
        get { return name; }
    }

    public DateTime DateOfBirth {
        get { return dateOfBirth; }
    }

    // equality members...
}

I.e. immutable classes, similar to F# records.
Now let’s say we want to test some property involving this Person class. For example, let’s say we have a serializer:

interface IPersonSerializer {
    string Serialize(Person p);
    Person Deserialize(string source);
}

Never mind the unsafety of this serializer, we want to test that roundtrip serialization works as expected. So we grab FsCheck and write:

Spec.ForAny((Person p) => serializer.Deserialize(serializer.Serialize(p)).Equals(p))
    .QuickCheckThrowOnFailure();

Only to be greeted with:

System.Exception: Geneflect: type not handled Person

Ok, so we write the generator explicitly:

var personGen =
    from name in Any.OfType<string>()
    from dob in Any.OfType<DateTime>()
    select new Person(name, dob);

Spec.For(personGen, p => serializer.Deserialize(serializer.Serialize(p)).Equals(p))
    .QuickCheckThrowOnFailure();

And all is fine.

But the generator code is trivially derivable from the class definition. And when you have lots of immutable classes, this boilerplate becomes really annoying. Couldn’t we automatize that somehow?

As it turns out, FsCheck already does this for F# records, using reflection. With a bit of code we can extend the reflection-based generator (built into FsCheck) to make it generate instances of immutable classes. With this change, we can go back to writing Spec.ForAny and FsCheck will automatically derive the generator for our Person class!

The restrictions on the classes to make them generable by FsCheck are:

  • Must be a concrete class. No interfaces or abstract classes; FsCheck can’t guess a concrete implementation.
  • It has to have only one public constructor. Otherwise, which one would FsCheck choose?
  • All public fields and properties must be readonly. Otherwise, it creates the ambiguity of which ones to set and which ones not.
  • Must not be recursively defined. FsCheck doesn’t generate recursively defined F# records by reflection either. I assume this is to keep the implementation simple.
  • Must not have type parameters. This restriction could probably be relaxed, but since I haven’t needed it so far, I decided to play it safe. Left as an exercise for the reader :-)

This is available in FsCheck as of version 0.9.2.

Happy FsChecking in C# and VB.NET !

Monday, October 21, 2013

Towards a NuGet dependency monitor with OData and F#

Every now and then I bump into the Hackage Dependency Monitor and wish there was something similar for NuGet. Node.js also has something like this with David. As a maintainer of several packages, it would come in handy if something would tell me when a dependency gets outdated, instead of having to find out through a bug report or reading the new release announcement by coincidence.

Since NuGet exposes its repository through an OData feed, and .NET has some good facilities to query these, I thought it’d be fun to try to implement the core of a dependency monitor.

Exploration

I started out with LINQPad, which makes exploration as easy as it gets. Click “add connection”, select the OData driver, and enter the feed URL https://nuget.org/api/v2 . That’s it. Open the connection and it shows the available tables (only "Packages" in this case) and their schema:

For example, we can find out how many projects are hosted on Github:

Packages 
.Where(x => x.IsLatestVersion && x.IsAbsoluteLatestVersion && x.ProjectUrl.Contains("github")) 
.Count()

Here are the figures for the most popular project hosting services (for the packages that have the ProjectUrl property defined):

Github: 5378
Codeplex: 2658
Google: 483
Bitbucket: 415

We can also see that dependencies are stored as a string… what does the content look like? Let’s find out:

This means that we’ll have to parse this dependency string and run another query to fetch and analyze dependencies. So let’s switch to the F# REPL to write some code more comfortably.

Some helpers

NuGet has some functions that will come in handy to parse these dependency versions:

#r "System.Xml.Linq"
#r @"g:\prg\SolrNet\lib\NuGet.exe"

open NuGet

let split (c: char) (x: string) = x.Split c

let parseDependencies : string -> seq<string * IVersionSpec> = 
    split '|' >> Seq.map (split ':' >> fun x -> x.[0], VersionUtility.ParseVersionSpec x.[1])

To query the NuGet feed we’ll use the OData type provider.

Unfortunately, WCF Data Services doesn’t support Contains(), which we need to get all the dependencies for a package in a single query. As an example, running this on LINQPad:

Packages.Where(x => new[] {"GDataDB", "FSharpx.Core"}.Contains(x.Id))

throws NotSupportedException: the method ‘Contains’ is not supported.

The type provider uses the same query translator underneath so it has the same limitation. OData v3 supports the Any() operator, but it seems that the NuGet feed is OData v2, so that doesn’t work either. Anyway, we can implement this with some quotation manipulation:

open Microsoft.FSharp.Quotations
open Microsoft.FSharp.Quotations.Patterns

let inList (memberr: Expr<'a -> 'b>) (values: 'b list) : Expr<'a -> bool> =
    match memberr with
    | Lambda (_, PropertyGet _) -> 
        match values with
        | [] -> <@ fun _ -> true @>
        | _ -> values |> Seq.map (fun v -> <@ fun a -> (%memberr) a = v @>) |> Seq.reduce (fun a b -> <@ fun x -> (%a) x || (%b) x @>)
    | _ -> failwith "Expression has to be a member"

You can compare this with the equivalent C# code, it really shows how expression splicing makes the F# code much clearer.

Also, just for kicks, let’s run the query asynchronously. This doesn’t make much difference now as we’ll run this in the REPL, but it would be useful if this were to be used in a server.

#r "System.Data.Services.Client"
#r "FSharp.Data.TypeProviders"

open System.Linq
open System.Data.Services.Client

let execQueryAsync (q: _ DataServiceQuery) = 
    Async.FromBeginEnd(q.BeginExecute, q.EndExecute)

Main code

Ok, enough with the helper functions. Here’s the main function:

type nuget = Microsoft.FSharp.Data.TypeProviders.ODataService<"https://nuget.org/api/v2">
type Package = nuget.ServiceTypes.V2FeedPackage
let ctx = nuget.GetDataContext()

let checkDependencies packageId =
    async {
        let packagesQuery = 
            query {
                for p in ctx.Packages do
                where (p.Id = packageId && p.IsAbsoluteLatestVersion && p.IsLatestVersion)
                select p
            } :?> DataServiceQuery<Package>
        let! packages = execQueryAsync packagesQuery

        let package = Seq.exactlyOne packages

        let deps = parseDependencies package.Dependencies |> Seq.toList
        let depIds = List.map fst deps
        let depsQuery = 
            query {
                for p in ctx.Packages do
                where (((%(inList <@ fun (x: Package) -> x.Id @> depIds)) p) && p.IsAbsoluteLatestVersion && p.IsLatestVersion)
                select p
            } :?> DataServiceQuery<Package>
        let! depPackages = execQueryAsync depsQuery
        let depPackagesList = Seq.toList depPackages
        let satisfies =
            deps |> List.map (fun (depId, version) -> 
                                let depPackage = depPackagesList |> Seq.find (fun p -> p.Id = depId)
                                let semVersion = SemanticVersion.Parse depPackage.Version
                                depId, version, version.Satisfies semVersion)
        return satisfies
    }

This returns a list of tuples where the first element is the package ID of the dependency, the required version for that dependency, and a boolean saying if the dependency is outdated (false) or not (true).

Let’s try an example:

checkDependencies "GDataDB" |> Async.RunSynchronously

Gives:

[("Google.GData.Client", [2.1.0.0] {IsMaxInclusive = true;
                                      IsMinInclusive = true;
                                      MaxVersion = 2.1.0.0;
                                      MinVersion = 2.1.0.0;}, false);
   ("Google.GData.Extensions", [2.1.0.0] {IsMaxInclusive = true;
                                          IsMinInclusive = true;
                                          MaxVersion = 2.1.0.0;
                                          MinVersion = 2.1.0.0;}, false);
   ("Google.GData.Documents", [2.1.0.0] {IsMaxInclusive = true;
                                         IsMinInclusive = true;
                                         MaxVersion = 2.1.0.0;
                                         MinVersion = 2.1.0.0;}, false);
   ("Google.GData.Spreadsheets", [2.1.0.0] {IsMaxInclusive = true;
                                            IsMinInclusive = true;
                                            MaxVersion = 2.1.0.0;
                                            MinVersion = 2.1.0.0;}, false)]

Uh-oh, I better update those dependencies!

Conclusion

So this looks simple enough, right? However, I consider this just a spike, it’s not really robust. What happens if the package doesn’t exist? Exception. Ill-defined dependencies? Exception.

Still, it shouldn’t be too hard to make this more robust, then put it in a web server and done! Easier said than done ;-)

Also, it seems that the NuGet v3 API won't be based on OData so this whole experiment might need to be rewritten soon.

Anyway, here's the entire code for this post.

Appendix: WCF Data Services criticism

I generally dislike expression-based translators (i.e. IQueryable) because they’re usually eminently partial, i.e. you have to guess what’s supported, read the docs very carefully, or run your code in a test and see what happens. Otherwise you get exceptions everywhere. The compiler can’t do much and it’s never quite clear what will execute on the client and what will be translated and executed on the server. This hurts your ability to reason about the code, which in turn means more programming by coincidence.

WCF Data Services takes this to pathological levels. While exploring the NuGet feed in LINQPad I found many simple expressions that should have worked but didn’t. A few examples:

Packages.Where(x => new[] {"GDataDB", "FSharpx.Core"}.Contains(x.Id))

This is the one I mentioned earlier. I don’t see why the expression translator couldn’t do what I did and compile this to a chain of OR’ed expressions.

Even the simplest projection fails:

Packages.Select(x => x.Id)

Throwing NotSupportedException: Individual properties can only be selected from a single resource or as part of a type. Specify a key predicate to restrict the entity set to a single instance or project the property into a named or anonymous type.

I have no idea what the first part of that error means, but projecting to an anonymous type works:

Packages.Select(x => new {x.Id})

The official explanation for this is that the OData protocol doesn’t support it, but again, it would seem that this is the job of the expression translator and the protocol or server side of things has little to do with it.

By the way, if you happen to add a Take() operator you get a totally different exception:

Packages.Select(x => x.Id).Take(20)

Throws an InvalidCastException: Unable to cast object of type 'System.Data.Services.Client.NavigationPropertySingletonExpression' to type 'System.Data.Services.Client.ResourceSetExpression'.

Or add a condition and you get yet another different error:

Packages
.Where(x => x.IsLatestVersion && x.IsAbsoluteLatestVersion)
.Select(x => x.Id)
.Take(10)

NotSupportedException: Can only specify query options (orderby, where, take, skip) after last navigation.

Which is incorrect, since replacing the projection in the above expression with an anonymous type works fine.

In this other example, the library doesn’t process negated conditions correctly, which causes a cryptic server-side exception:

Packages
.Where(x => !x.ProjectUrl.Contains("google"))
.Select(x => new {x.Id})

“An error occurred while processing this request. InnerException: Rewriting child expression from type 'System.Nullable`1[System.Boolean]' to type 'System.Boolean' is not allowed, because it would change the meaning of the operation. If this is intentional, override 'VisitUnary' and change it to allow this rewrite.”

The generated URL from this expression is: https://nuget.org/api/v2/Packages()?$filter=not substringof('google',ProjectUrl)&$select=Id which apparently isn’t supported server-side.

But change the condition from using the “not” operator to “== false” and everything is magically fixed:

Packages
.Where(x => x.ProjectUrl.Contains("google") == false)
.Select(x => new {x.Id})

Generated URL: https://nuget.org/api/v2/Packages()?$filter=substringof('google',ProjectUrl) eq false&$top=10&$select=Id

These two expressions are logically equivalent, but one works and the other one fails:

Packages
.Where(x => x.IsLatestVersion)
.Select(x => new {x.IsLatestVersion})
Packages
.Select(x => new {x.IsLatestVersion})
.Where(x => x.IsLatestVersion)

The second one fails with: “NotSupportedException: The filter query option cannot be specified after the select query option.”.

These are all very simple expressions and I found all of these issues in about one hour of experimentation in LINQPad (for reference, LINQPad v4.47.02 using WCF Data Services 5.5), so be very careful and test every single call if you have to use WCF Data Services. And keep in mind that if you use the OData F# type provider, you’re also using WCF Data Services, so the same warning applies.

Wednesday, August 28, 2013

Objects and functional programming

In a recent question on Stackoverflow, someone asked “when to use interfaces and when to use higher-order functions?”. To summarize, it’s about deciding whether to design for an interface or a function to be passed. It’s expressed in F#, but the same question and arguments can be applied to similar languages like C#, VB.NET or Java.

Functional programming is about programming with mathematical functions, which means no side-effects. Some people say that “functional programming” as a paradigm or concept isn’t useful, and all that really matters is being able to reason about your code. The best way I know to reason about my code is to avoid side-effects or isolate them as much as possible.

In any case, none of this says anything about objects, classes or interfaces. You can represent functions however you like. You can write pure code with objects or without them. OOP is effectively orthogonal to functional programming. In this post I'll use the terms 'objects', 'classes', 'interfaces' somewhat interchangeably, the differences don't matter in this context. Hopefully my point still gets across.

Higher-order functions are of course a very useful tool to raise the level of abstraction. However, many perhaps don’t realize that any function receiving some object as argument is effectively a higher-order function. To quote William Cook in “On Understanding Data Abstraction, Revisited”:

Object interfaces are essentially higher-order types, in the same sense that passing functions as values is higher-order. Any time an object is passed as a value, or returned as a value, the object-oriented program is passing functions as values and returning functions as values. The fact that the functions are collected into records and called methods is irrelevant. As a result, the typical object-oriented program makes far more use of higher-order values than many functional programs.

So in principle there is little difference between passing an interface and passing a function. The only difference here is that an interface is named and has named functions, while a function is anonymous. The cost of the interface is the additional boilerplate, which also means having to keep track of one more thing.

Even more, since objects typically have many functions, you could say that you’re not just passing functions as values, but passing modules as values. To put it clearly: objects are first-class modules.

As an aside, the term “first-class value” doesn’t have a precise definition, but I find it useful to wield it with the definition given in MSDN or the Wikipedia.

Objects are also parametrizable modules because constructors can take parameters. If a constructor takes some other object as parameter, then you could say that you’re parameterizing a module by another module.

In contrast to this, F# modules (more generally, static classes in .NET) are not first-class modules. They can’t be passed as arguments, you can’t do something like creating a list of modules. And they can’t be parameterized either.

So why do we even bother with modules if they’re not first-class? Because it’s easier to pick just one function out of a module to use or to compose. Object composition is more coarse-grained. As Joe Armstrong famously said: “You wanted a banana but you got a gorilla holding the banana”.

Back to the Stackoverflow question, what’s the difference between:

module UserService =
   let getAll memoize f =
       memoize(fun _ -> f)

   let tryGetByID id f memoize =
       memoize(fun _ -> f id)

   let add evict f name keyToEvict  =
       let result = f name
       evict keyToEvict
       result

and

type UserService(cacheProvider: ICacheProvider, db: ITable<User>) =
   member x.GetAll() = 
       cacheProvider.memoize(fun _ -> db |> List.ofSeq)

   member x.TryGetByID id = 
       cacheProvider.memoize(fun _ -> db |> Query.tryFirst <@ fun z -> z.ID = ID @>)

   member x.Add name = 
       let result = db.Add name
       cacheProvider.evict <@ x.GetAll() @> []
       result

The first one has some more parameters passed in from the caller, but you can imagine what it would look like. I probably wouldn’t arrange things like either of them, but for one, both lack side-effects. To the first one, you can pass pure functions. To the second one, you can pass implementations of ICacheProvider and ITable with pure functions.

However if you take a good look at the second one, you’ll see that every method uses both cacheProvider and db. So in this case it’s not so bad to pass a couple of gorillas. And it gives the reader a lot more information about what’s being composed, as opposed to a signature like

add : evict:('a -> unit) -> f:('b -> 'c) -> name:'b -> keyToEvict:'a -> 'c

To summarize: The beauty of functional programming lies in being able to reason about your code. One of the easiest ways to achieve this is to write code without side-effects. Classes, interfaces, objects are not opposed to this. In object-capable languages, objects can be a useful tool. Here I talked about objects as modules, but they can model other things too, like records or algebraic data types. They can be easily overused though, especially by programmers new to functional programming. Consider carefully if you want to be juggling gorillas rather than bananas!

Saturday, August 3, 2013

Book review: Apache Solr for Indexing Data How-to

A few days ago I kindly received a copy of the book “Apache Solr for Indexing Data How-to” by Alexandre Rafalovitch for review. Here are my impressions about it.

Solr, by now a nine-year old project, is a powerful piece of software, with lots of high-level features and facilities for text-centric data. And it builds on Lucene, itself an 11-year-old stand-alone project.

At 80 pages, “Apache Solr for Indexing Data How-to” doesn’t try to cover all the features. Instead, it focuses on indexing, that is, getting data from some source (Relational database, text files, etc) into Solr. This is of course a major part of using Solr.

When starting out with Solr, most people first follow the official tutorial, but then feel lost when faced with real-world requirements. The official wiki docs have greatly improved in the last few years but there’s still a large gap between the tutorial and the docs. The reference guide is also great but for a novice it may seem daunting at first. You can see this in many questions on Stackoverflow. This book helps close that gap a bit, at least the part about getting your data into Solr.

You can read it like a cookbook, as a guidance for specific indexing scenarios. As a good “how-to” book, each section starts with a short introduction, then a step-by-step guidance on how to get to the goal, and a “how it works” section explaining everything. An additional section adds tips and further references about each subject.

Of course you can also read it like a regular book. It starts with the most basic scenario, picking up where the tutorial leaves off, and then dives into more complex scenarios. All examples are on github so you can follow on a concrete instance of Solr while reading. The book is written for Solr 4.3. As of now Solr 4.4 is already out and 4.5 is coming soon, but don’t worry, the dev team seems to follow Semantic Versioning so there aren’t any breaking changes.

One problem with this kind of books is that often they can’t focus just on the main topic (in this case, indexing) without at least touching on other topics. Indexing is related to the Solr schema, which in turn is a function of the search needs of your application. This book dabbles in faceting and searching when the scenario demands it, but otherwise acknowledges its limited scope and refers the reader to other books or the reference documentation when appropriate, so you never feel lost.

Another issue is the simplification of some scenarios in order to focus on operative topics and avoid scope creep. For example, the section on indexing data from a relational database uses an example where the database has only one table, no foreign keys. In most real-world scenarios you’ll have lots of related database tables which you’ll have to denormalize and flatten depending on your search needs.

Overall, I think “Apache Solr for Indexing Data How-to” is great for a novice in Solr. It’s a simple, concrete guide to indexing which is one the first things you do with Solr. Just don’t expect it to be all-comprehensive: it doesn’t cover all scenarios and you should read it along the docs to truly understand the concepts at work. It’s designed to help you move forward when, as a beginner, everything looks too complex and you have no idea what to do.

The tutorial will get you started, but this book will get you going.