Thursday, March 20, 2014

Better dictionary types

Say you need to call a function like:

int DoSomething(Dictionary<string, string> data)

Do you know what kind of data you should feed to that function? Evidently, the keys and values of the dictionary are strings, but what should the comparer for the keys be? Should keys be case-sensitive or case-insensitive? Does it even matter for this function?

For example, if DoSomething was processing HTTP headers, it would need to receive a case-insensitive dictionary, as HTTP header names are case-insensitive. Yet the type doesn’t enforce it, it doesn’t even give us a hint.

How do other typed languages deal with this? Let’s take a look at Haskell first.

Haskell’s Data.Map requires the key type to have an instance for the Ord typeclass. Since you can’t have more than one typeclass instance per type, there is no possible ambiguity about how keys are compared. This property of having at most one instance per typeclass per type is called “coherence” and it’s a good property to have in a typeclass system as it keeps things simple, both for the programmer and the compiler. If you wanted a case-insensitive Map, you’d use Data.CaseInsensitive and your concrete Map type would reflect its case-insensitive behavior, e.g.

import qualified Data.Map as M
import qualified Data.CaseInsensitive as CI ( mk )

main = do
  let m = M.fromList [(CI.mk "one", 1)]
  print $ M.lookup (CI.mk "One") m

Here the type of m is Map (CI String) Integer. You can’t confuse it with the case-sensitive Map String Integer because the compiler simply won’t let you. They’re different types!

.NET doesn’t have typeclasses but we could achieve something similar in this case if we could redesign System.Collections.Generic.Dictionary and remove the constructors that admit an instance of IEqualityComparer<TKey> . That means it would always use the default comparer for the key type. And if we wanted a case-insensitive dictionary, we’d just wrap our string keys in a type implementing case-insensitive equality, e.g.:

sealed class CaseInsensitiveString {
    public readonly string Value;

    public CaseInsensitiveString(string value) {
        Value = value;
    }

    public override bool Equals(object obj) {
        if (ReferenceEquals(null, obj)) return false;
        if (ReferenceEquals(this, obj)) return true;
        if (obj.GetType() != GetType()) return false;
        return StringComparer.InvariantCultureIgnoreCase.Equals(Value, ((CaseInsensitiveString)obj).Value);
    }

    public override int GetHashCode() {
        return (Value != null ? StringComparer.InvariantCultureIgnoreCase.GetHashCode(Value) : 0);
    }

    public override string ToString() {
        return Value;
    }

    public static implicit operator CaseInsensitiveString (string a) {
        return new CaseInsensitiveString(a);
    }
}

var data = new Dictionary<CaseInsensitiveString, string> {
    {"one", "1"},
};

Console.WriteLine(data["One"]);

But since Dictionary has constructors that allow overriding the equality comparer, this is not enough. In other words, the type Dictionary<CaseInsensitiveString, string> does not guarantee that the dictionary is case-insensitive. We can easily work around this by creating a new type that limits Dictionary’s constructors:

sealed class Dictionary2<TKey, TValue>: Dictionary<TKey, TValue> {
    public Dictionary2() {}
    public Dictionary2(int capacity) : base(capacity) {}
    public Dictionary2(IDictionary<TKey, TValue> dictionary) : base(dictionary) {}
}

And now we can guarantee that a Dictionary2<CaseInsensitiveString, string> will do key equality as defined by CaseInsensitiveString and thus it will be case-insensitive.

There is one downside to this solution: we need to wrap all the keys when we want a different comparer. This means more allocations, less performance. Haskell can avoid this penalty by making the wrapper a newtype (not the particular case of Data.CaseInsensitive though!), which we don’t have in .NET. Can we do better?

The main problem here was that the type doesn’t uniquely determine an equality comparer. If we don’t make that a part of the key type, then couldn’t we make the comparer part of the dictionary type?

This is precisely what ocaml-core does. The Map type is determined by types of the map’s keys and values, and the comparison function used to order the keys. The book Real World OCaml explains how including the comparator in the type is important because certain operations that work on multiple maps require that they have the same comparison function. As we’ve seen, System.Collections.Generic.Dictionary can’t enforce that.

Following that same design principle, now instead of forbidding all constructors that accept an equality comparer, we do the opposite: forbid all constructors that don’t take a comparer, thus making it always explicit, and include the comparer as an additional type parameter:

sealed class Dictionary<TKey, TValue, TEqualityComparer> : Dictionary<TKey, TValue> where TEqualityComparer : IEqualityComparer<TKey> {
    public Dictionary(TEqualityComparer comparer) : base(comparer) {}
    public Dictionary(int capacity, TEqualityComparer comparer) : base(capacity, comparer) {}
    public Dictionary(IDictionary<TKey, TValue> dictionary, TEqualityComparer comparer) : base(dictionary, comparer) {}
}

A small helper to aid with type inference in C#:

static class Dict<TKey, TValue> {
    public static Dictionary<TKey, TValue, TEqualityComparer> Create<TEqualityComparer>(TEqualityComparer comparer) where TEqualityComparer: IEqualityComparer<TKey> {
        return new Dictionary<TKey, TValue, TEqualityComparer>(comparer);
    }
}

Another small helper class to ease the definition of comparer types based on the ones we already have:

class DelegatingEqualityComparer<T>: IEqualityComparer<T> {
    private readonly IEqualityComparer<T> comparer;

    public DelegatingEqualityComparer(IEqualityComparer<T> comparer) {
        this.comparer = comparer;
    }

    public bool Equals(T x, T y) {
        return comparer.Equals(x, y);
    }

    public int GetHashCode(T obj) {
        return comparer.GetHashCode(obj);
    }
}

Now we can easily create new comparer types like this:

sealed class StringComparerInvariantCultureIgnoreCase: DelegatingEqualityComparer<string> {
    private StringComparerInvariantCultureIgnoreCase() : base(StringComparer.InvariantCultureIgnoreCase) {}

    public static readonly StringComparerInvariantCultureIgnoreCase Instance = new StringComparerInvariantCultureIgnoreCase();
}

Finally we can use this new Dictionary type like this:

var data = Dict<string, string>.Create(StringComparerInvariantCultureIgnoreCase.Instance);
data.Add("one", "1");

Console.WriteLine(data["One"]);

Back to our original function, if we wanted to enforce a case-insensitive dictionary we can now use this new Dictionary type and change the signature to:

int DoSomething(Dictionary<string, string, StringComparerInvariantCultureIgnoreCase> data)

Epilogue

Types are a terrific tool to reason about our code, but only if we use them correctly. Throwing around types in an impure, partial language like C# or F# does not mean you're using types in a meaningful way. Consider what your types allow and what they don’t allow. Make illegal states unrepresentable. With precise types it becomes easier to reason about your code. Invariants enforced through the type system means the compiler makes it impossible to create invalid programs.

When you find yourself in need of inspiration for your types, see what other typed languages do, especially OCaml and Haskell. Their type systems are much more powerful than .NET’s, but often you can extract some of the underlying design principles and adapt them to less powerful type systems.

Addendum

Judging by some comments I've read around the web, it seems the goal of this post wasn't clear. It's definitely not about dictionaries concretely. It's about the thought process to recognize flaws in type design and how to fix them so that some interesting invariant can be enforced through the type system. The dictionary types here are merely used to illustrate this process.