Bug squash

Friday, May 23, 2014

Mapping JSON to objects with Fleece

In the last post I introduced Fleece and briefly explained how to use it to map objects to JSON. Sometimes I say “serialize” instead of “map”, but since the actual serialization is done by System.Json I think the right term to use here is “map” (as in mapping an object to a tree of JsonValues), or maybe "encoding" / "decoding".

Fleece can also do the opposite operation: map JSON to objects. There’s already an excellent F# library to deserialize JSON to typed objects: the JSON type provider from FSharp.Data (previously implemented in FSharpx), and so it’s impossible to avoid comparisons.

Some drawbacks of the JSON type provider

Whenever you need to deserialize JSON, I recommend you to try the JSON type provider first. When the conditions are right, nothing beats its simplicity.

But the conditions aren’t always right, and so the JSON type provider is sometimes not the best tool to use to deserialize JSON. Some of its drawbacks are:

1. Not total: throws exceptions when parsing fails. Exceptions hurt composability and your ability to reason about the code. This is mostly just an annoyance, as we can easily work around it with a small higher-order function:

let inline protect f x = 
    try
        Choice1Of2 (f x)
    with e -> Choice2Of2 e

(This is already part of FSharpx by the way).

2. Another annoyance is that the current implementation of the JSON type provider outputs erased types, so the types inferred from the JSON sample are only available in F#, not other languages. So you if you ever need to consume these types from C# or VB.NET you'll have to copy the implicit types explicitly and write the code to map them. This defeats the “no-code” benefit of using a type provider.

3. You’ll also usually want to write explicit types if you want to perform some additional validations. Think for example a NonEmptyList. In fact the type provider mechanism can’t generate records or discriminated unions, so if you want precise typing you have no choice but to write your types and then map them from the output of the type provider, again defeating the “no-code” benefit of using a type provider.

4. If the JSON input is “dynamic”, i.e. its structure is not exactly always the same, it varies depending on some request parameters, etc, then the type provider becomes useless because you can’t rely on a single sample (or a manageable number of samples) to infer the types. In this case you want to start working with the types, not with JSON samples. That’s why FSharp.Data also exposes and documents its underlying JSON parser/reader.

5. The parser generated by the type provider is monolithic, so you can’t “customize” a parser, or introduce validations while parsing/mapping, etc.

I don’t mean to make this post about criticizing the JSON type provider so I won’t go into detail about each of these points. As an exercise to illustrate these points try to use the JSON type provider to build a generic parser for the JSON output from Solr that can be consumed from any .NET language.

The concrete case that motivated me to write Fleece was the Edmunds API, which is very irregular (or “dynamic” depending on the point of view), plus I wanted specific types and needed to consume these API bindings from C#.

In a future post I might explore how to combine Fleece and the JSON type provider to take advantage of the benefits of each one where they are strong.

Fleece: the FromJSON typeclass

Back to Fleece: just as serialization is overloaded with the ToJSON fake typeclass, deserialization is built around a FromJSON typeclass.

The signature of the overloaded fromJSON function is:

fromJSON : JsonValue -> 'a ParseResult

where 'a is the type to decode (it must be overloaded in the FromJSON typeclass) and ParseResult a simple alias to Choice<'a, string>, i.e. you get either the decoded value or an error.

There’s also a convenience function parseJSON: string -> 'a ParseResult that takes a raw JSON string as input.

Let’s start with a simple example. We have this tree of people and their children:

let personJson = """
{
   "name": "John",
   "age": 44,
   "children": [{
       "name": "Katy",
       "age": 5,
       "children": []
   }, {
       "name": "Johnny",
       "age": 7,
       "children": []
   }]
}
"""

We can represent this with the following recursive type:

type Person = {
   Name: string 
   Age: int 
   Children: Person list 
}

Here’s one way to define FromJSON for the Person type:

type Person with 
   static member FromJSON (_: Person) = 
       function 
       | JObject o -> 
           let name = o .@ "name" 
           let age = o .@ "age" 
           let children = o .@ "children" 
           match name, age, children with 
           | Success name, Success age, Success children -> 
               Success { 
                   Person.Name = name 
                   Age = age 
                   Children = children 
               } 
           | x -> Failure (sprintf "Error parsing person: %A" x) 
       | x -> Failure (sprintf "Expected person, found %A" x)

Note the unused parameter of type Person: this is needed to make overloads unique and get the compiler to choose the right overloads.

Other than that, this is a function JsonValue -> Person ParseResult.

JObject is an active pattern identifying a JSON object (as opposed to a string, number, null, etc).

Success and Failure are simple aliases for the constructors Choice1Of2 and Choice2Of2 respectively, giving them more meaningful names. They’re also available as active patterns so we can use them in pattern matching.

The .@ operator tries to get a mapped value from a JSON object by key. That is, you can only call it for types that have a suitable FromJSON defined. Otherwise you get a compile-time error.

There’s also an operator .@? intended for optional keys in a JSON object, i.e. it returns Success None when the key isn't found, whereas .@ returns Failure "key xxx not found"

If you don’t like operators you can use the equivalent named functions jget / jgetopt.

That’s it, now we can parse JSON into Person:

let john : Person ParseResult = parseJSON personJson

Just as with serialization, deserialization in Fleece is total and ad-hoc polymorphic, and we get full compile-time checking. The same arguments about not breaking parametricity apply here.

Now, pattern matching each parsed value for Success/Failure doesn't sound like fun. Since these parsers return Choice<’value, ‘error> we can code monadically instead, so we can focus on the happy path, as Erik Meijer says. We can use the Choice.choose computation expression in FSharpx.Core, or the generic monad computation expression in FSharpPlus. Since Fleece already has a dependency on FSharpPlus, let’s use that:

type Person with 
   static member FromJSON (_: Person) = 
       function 
       | JObject o -> 
           monad { 
               let! name = o .@ "name" 
               let! age = o .@ "age" 
               let! children = o .@ "children" 
               return { 
                   Person.Name = name 
                   Age = age 
                   Children = children 
               } 
           } 
       | x -> Failure (sprintf "Expected person, found %A" x)

This reads much better. The ‘monad’ computation expression gets the compiler to infer the concrete type for the monad, in this case Choice<'a, 'e>.

We can write it even more compactly using applicative functors, though we need a curried constructor for Person:

type Person with 
   static member Create name age children = { Person.Name = name; Age = age; Children = children } 
   static member FromJSON (_: Person) = 
       function 
       | JObject o -> Person.Create <!> (o .@ "name") <*> (o .@ "age") <*> (o .@ "children") 
       | x -> Failure (sprintf "Expected person, found %A" x)

FSharpPlus already comes with overloaded applicative operators for the common applicatives (Choice, Option, etc).

We could go further and wrap that pattern match into a function, but let's just leave it at that.

What’s different about decoding JSON with Fleece is in the .@ operator. Just as with the .= operator does for encoding and the ToJSON typeclass, .@ only works for types that have a suitable FromJSON defined. Otherwise you get a compile-time error.

Not only that, but you also get decoder composition “for free”. Note how in the previous example o .@ "children" is inferring a Person list, which composes the decoder for 'a list with the very same decoder we’re defining for Person.
Fleece includes many decoders for common types, so if you want to decode, say, a Choice<(int * string) list, Choice<decimal option, string>> you don’t really need to do anything, and it’s all statically type-safe, pure and total, not breaking parametricity.

Roundtrips

When you need to both serialize and deserialize a type to JSON, it’s useful to make it roundtrip-safe, i.e. if you call toJSON and then fromJSON you get the original value again.

You can encode this property like this:

let inline roundtrip p = 
    let actual = p |> toJSON |> fromJSON
    actual = Success p

Use FsCheck to check this property, which will run it through a large number of instances for the type you want to check. Fleece does this for primitive types.

Also note the “inline” in this definition, which makes it possible to write generic code without having to specify any particular type. If you hover the mouse in Visual Studio over “roundtrip” it says val roundtrip: ('a -> bool) (requires member ToJSON and member FromJSON and equality), which means that the compiler is inferring the necessary constraints that the type must satisfy.

In the next post we’ll do some deeper analysis of the pros and cons of this typeclass-based approach to JSON encoding.

Tuesday, May 13, 2014

Mapping objects to JSON with Fleece

In the last post I briefly explained some of the consequences of breaking parametricity, in particular for JSON serialization. I used JSON serialization here as a particular case only to introduce Fleece, but breaking parametricity anywhere has similar consequences.

So how can we serialize JSON without breaking parametricity?

The simplest thing we could do is to write multiple monomorphic Serialize functions, one for each type we want to serialize, e.g:

class Person {
    public readonly int Id;
    public readonly string Name;

    public Person(int id, string name) {
        Id = id;
        Name = name;
    }
}

string Serialize(Person p) {
    return @"{""Id"": " + p.Id + @", ""Name"": """ + p.Name + @"""}";
}

You’ll notice that this code doesn’t escape the Name value and therefore will generate broken JSON in general. Why doesn’t this happen with the Id? Types! The lowly int type has restricted the values it can have and therefore we can statically assure that it doesn’t need any escaping. Cool, huh?

So how do we solve the escaping problem for the Name field? With more types, of course! Instead of handling JSON as strings, we create some types to model more precisely what JSON can do, and let the compiler check that for us. System.Json does just that, so let’s use it. Strictly speaking it’s still a too loose but it’s decent enough. Our code becomes:

JsonObject Serialize(Person p) {
    return new JsonObject {
        {"Id", p.Id},
        {"Name", p.Name},
    };
}

Now we don’t have to worry about encoding issues, JsonObject takes care of that. Also the function now returns a JsonObject instead of a string, which allows us to safely compose the result into larger JSON objects. When we’re done composing, we call ToString() to get the serialized JSON.
One way to look at this is that we’re defining a tiny language here, with JsonValue.ToString() being one possible interpreter.
I’m sorry if these explanations sound a bit patronizing, but I really want to emphasize how types make our job easier.

Let’s raise the abstraction bar a bit and write a function that serializes an IEnumerable<T>. Since we don’t want to break parametricity, we must ensure that this will work for any type T such that T can itself be serialized. But we don’t know how to turn any arbitrary type T into a JsonValue, and we’ve ruled out runtime type inspection, so we have to pass the conversion explicitly:

JsonArray Serialize<T>(IEnumerable<T> list, Func<T, JsonValue> serializerT) {
    return new JsonArray(list.Select(serializerT));
}

That works, but it’s not very nice. We have to compose these serializer functions manually! Every time we want to serialize an IEnumerable<T> we’ll have to look up where the function to serialize T is. Is there any way to avoid that?

We could pass the conversion by using interfaces. If we have:

public interface IToJson {
    JsonValue ToJson();
}

We could make Person implement IToJson simply by moving the Serialize(Person p) function above to the definition of Person:

class Person: IToJson {
    public readonly int Id;
    public readonly string Name;

    public Person2(int id, string name) {
        Id = id;
        Name = name;
    }

    public JsonValue ToJson() {
        return new JsonObject {
            {"Id", Id},
            {"Name", Name},
        };
    }
}

Then serializing a list can be restricted to types implementing IToJson:

JsonValue ToJson<T>(this IEnumerable<T> list) where T: IToJson {
    return new JsonArray(list.Select(x => x.ToJson()));
}

But this only works for types we control. We can’t make IEnumerable<T> implement IToJson. We can wrap it:

class JsonList<T>: IToJson where T: IToJson {
    public readonly IEnumerable<T> List;

    public JsonList(IEnumerable<T> list) {
        List = list;
    }

    public JsonValue ToJson() {
        return new JsonArray(List.Select(x => x.ToJson()));
    }
}

But this is inconvenient. Suppose we have a class Company with a list of Person as employees. Here’s how serialization would look like:

class Company: IToJson {
    public readonly string Name;
    public readonly List<Person> Employees;

    public Company(string name, List<Person> employees) {
        Name = name;
        Employees = employees;
    }

    public JsonValue ToJson() {
        return new JsonObject {
            {"Name", Name},
            {"Employees", new JsonList<Person>(Employees).ToJson()}
        };
    }
}

That’s not much better than manually inlining the code for JsonList.ToJson(). Ideally, we just want to call a simple function ToJson and have the compiler somehow figure out if there’s a suitable function declared for the type of the argument. Overloading would be great if the generic ToJson function for lists could somehow recursively look into what overloads are defined if there’s a match for T, at compile time.

As far as I know C# can’t do that but it turns out that F# can.

Enter Fleece

With Fleece, converting Person and Company to JSON goes like this, without implementing any IToJson interface:

type Person with
   static member ToJSON (x: Person) =
       jobj [ 
           "Id" .= x.Id
           "Name" .= x.Name
       ]

type Company with
    static member ToJSON (x: Company) =
        jobj [
            "Name" .= x.Name
            "Employees" .= x.Employees
        ]

let company = { Company.Name = "Double Fine"; Employees = [{ Employee.Id = 1; Name = "Tim Schafer"}] }

let jsonAsString = (toJSON company).ToString()

Fleece here is statically ensuring that the type of the argument of toJSON has a definition of a static member ToJSON. It also checks statically that every type to the right of a .= expression has a suitable definition of ToJSON, and chooses it automatically. If there isn’t a suitable definition, you get a compile-time error.

Moreover, it "composes" (probably not the right term to use here) the serializers automatically at compile-time. In previous examples, we had to compose the list serializer with the Person serializer “manually” to serialize a concrete list of persons. Fleece defines a parametric serializer for lists, and then the compiler composes this serializer with the serializer for Person we have defined here.

What makes this possible in F# are inlines and static member constraints. Here’s how the definition of ToJSON for a list looks like:

type ToJSONClass with
    static member inline ToJSON (x: 'a list) =
        JArray (listAsReadOnly (List.map toJSON x))

If you hover in Visual Studio over the ToJSON keyword in this definition, it says ToJSON: 'a list -> JsonValue (requires member ToJSON). This means that F# is inferring that the type 'a must have a ToJSON member in order to call toJSON on a list of 'a.

This is equivalent to Haskell’s:

instance (ToJSON a) => ToJSON [a] where
   toJSON = Array . V.fromList . map toJSON

Haskell has dedicated syntax for this which makes the constraint explicit compared to F#.

This has long been used in F# for generic math. F# also has an ad-hoc, limited form of this in the EqualityConditionalOn attribute, where the equality of a type depends on a type argument having “equality”. So for example, this in F#:

type MyBox<[<EqualityConditionalOn>] 'a> = ...

Roughly corresponds to Haskell’s:

instance (Eq a) => Eq (MyBox a) where ...

With the difference that F# can sometimes derive equality automatically depending on the equality of type arguments.

The bottom line here is that the toJSON function is overloaded only for supported types, instead of being parametrically polymorphic, so it does not break parametricity. This overloading is also called ad-hoc polymorphism. Haskell achieves ad-hoc polymorphism thanks to typeclasses, and this technique is based on that. In fact, Fleece is based on Aeson, the de-facto standard JSON library for Haskell.

Anton Tayanovskyy had also prototyped a similar typeclass-based JSON library for F# some time ago.

Fleece only scratches the surface of what’s possible with inline-encoded typeclasses (a.k.a. “type methods”). I recommend reading Gustavo León’s blog to learn more about this technique and checking out the FsControl and FSharpPlus projects.

Refactor safety

Now, in my last post I claimed that this approach would save code by saving tests. But here we see that we have to write serializer code for each type that we want to serialize, unlike reflection-based libraries! That’s quite a bit of boilerplate! I could argue that it’s a small price to pay to get code you can reason about, but the truth is this is pretty trivial code that could be easily generated at compile-time.
In fact, Haskell can do that, either with Template Haskell or the more recent GHC.Generics, in which case you only need to make your type derive Generic and declare the typeclass instance. The actual serialization code is filled in by the compiler.

But there is a bigger problem with both reflection- and GHC.Generics-derived serialization: they tie JSON output to identifiers in your code. So whenever you rename some identifier (for example a field name in a record) in a type that is used to model some JSON output, you’re implicitly changing your JSON schema. You’re making a breaking change in the output of the program in what’s normally a safe operation. To quote a tweet:

any reason, other than a fondness of imminent schema and versioning issues, to use reflection-based JSON serializer in F#?
— Lev Gorodinski (@eulerfx) April 1, 2014

Or more bluntly:

OH on #scalaz: “If I walk up to your code and rename all the fields [or anything!] and the program changes, I think you should go to jail.”
— Sukant Hajra (@shajra) January 9, 2014

Still, it might be useful to start prototyping with reflection-based serializer while being aware of these issues, then switch to explicit serialization once the initial prototype stage is done. Some haskellers do this with GHC.Generics-derived serialization (thanks to Jonathan Fischoff for confirming this on #haskell IRC).

Even in C#, without these F# fake typeclases, you should be very wary of breaking parametricity and the cost on maintenance it implies. In my opinion, the boilerplate is worth it to avoid breaking parametricity.

In the next post we’ll see how to use Fleece to map in the opposite direction: JSON to objects. We’ll also see some of the drawbacks of these fake typeclasses.

Tuesday, May 6, 2014

On parametric polymorphism and JSON serialization

A couple of months ago I wrote Fleece, a JSON mapper for F#. What does that mean? It provides a library of functions to help map a JSON value tree onto .NET typed instances. And some more functions to do the inverse operation: map typed .NET values onto a JSON tree.
Fleece delegates the actual JSON parsing and serialization to System.Json.

But what’s the purpose of Fleece? Why another JSON library? Why F#? Is Fleece merely another case of Not-Invented-Here Syndrome? After all, anyone can serialize things easily with something like ServiceStack.Text, for example:

class Person {
    public int Id { get; set; }
    public string Name { get; set; }
}

var john = ServiceStack.Text.JsonSerializer.SerializeToString(new Person {Id = 1, Name = "John"});

Right?

However there’s a problem here. How do we know that the code above works for this definition of Person? Well, of course we can just run it and get the expected result stored in the john variable. But can you be sure that it will work for this Person type, without running it? It seems obvious that it will work, otherwise the library wouldn’t be useful, would it?

And yet, if we slightly change the definition of Person:

class Person {
    public readonly int Id;
    public readonly string Name;

    public Person(int id, string name) {
        Id = id;
        Name = name;
    }
}

var john = ServiceStack.Text.JsonSerializer.SerializeToString(new Person(id: 1, name: "John"));

It will compile, but john will contain the string "{}", i.e. an empty object. Definitely not what anyone would want! Yes, set the magic IncludePublicFields flag and it works, but why do we have to guess this? Would it make any difference if it throwed an exception instead of generating an empty JSON object? We spend a lot of time compiling things, can’t the compiler check this for us?

Even worse, SerializeToString will happily accept any instance of any type, even if it doesn’t make any sense:

var json = ServiceStack.Text.JsonSerializer.SerializeToString(new Func<int, int>(x => x + 1));
Console.WriteLine(json); // empty string

By the way, don’t think this is to bash ServiceStack in particular. Most JSON libraries for .NET have this problem:

static void NewtonsoftJson() {
    var json = Newtonsoft.Json.JsonConvert.SerializeObject(new Func<int, int>(x => x + 1));
    Console.WriteLine(json);

    // {"Delegate":{},"method0":{"Name":"<Newtonsoft>b__3","AssemblyName":"SerializationLies, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","ClassName":"SerializationLies.Program","Signature":"Int32 <Newtonsoft>b__3(Int32)","Signature2":"System.Int32 <Newtonsoft>b__3(System.Int32)","MemberType":8,"GenericArguments":null}}
}

DataContractJsonSerializer at least acknowledges its partiality and throws:

static void DataContract() {
    using (var ms = new MemoryStream()) {
        new DataContractJsonSerializer(typeof(Func<int, int>)).WriteObject(ms, new Func<int, int>(x => x + 1));
        Console.WriteLine(Encoding.ASCII.GetString(ms.ToArray()));
    }

/* throws:             
Unhandled Exception: System.Runtime.Serialization.SerializationException: DataContractJsonSerializer does not support the setting of the FullTypeName of the object to be serialized to a value other than the default FullTypeName. 
Attempted to serialize object with full type name 'System.DelegateSerializationHolder' and default full type name 
'System.Func`2[[System.Int32, mscorlib, Version=4.0.0.0,Culture=neutral, PublicKeyToken=b77a5c561934e089],[System.Int32, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089]]'.
*/
}

And by extension, pretty much all web frameworks (probably the most common producers of JSON) have this problem too:

ASP.NET MVC 4

public class HomeController : Controller {
    public ActionResult Index() {
        return Json(new Func<int, int>(x => x + 1), JsonRequestBehavior.AllowGet);
    }
}

Throws System.InvalidOperationException: A circular reference was detected while serializing an object of type 'System.Reflection.RuntimeModule'.

And therefore also Figment, since I based it on ASP.NET MVC:

get "/" (json ((+) 1))

In ASP.NET Web API your controllers can return any object and the serializers will try to serialize it according to the result of content negotiation. In the case of JSON, the default serializer is Newtonsoft so you end up with the same result I showed above for Newtonsoft.

In NancyFX:

public class Home : NancyModule {
    public Home() {
        Get["/home"] = _ => new Func<int, int>(x => x + 1);
    }
}

Visit /home.json and get an InvalidOperationException: Circular reference detected.

In ServiceStack:

[Route("/home")]
public class Home { }
public class HomeService : Service {
    public object Any(Home h) {
        return new Func<int, int>(x => x + 1);
    }
}

Visit /home?format=json and you get:

{"Method":{"__type":"System.Reflection.RuntimeMethodInfo, mscorlib","Name":"b__0","DeclaringType":"ServiceStackTest.HomeService, ServiceStackTest","ReflectedType":"ServiceStackTest.HomeService, ServiceStackTest","MemberType":8,"MetadataToken":100663310,"Module":{"__type":"System.Reflection.RuntimeModule, mscorlib","MDStreamVersion":131072,"FullyQualifiedName":"G:\\Windows\\Microsoft.NET\\Framework\\v4.0.30319\\Temporary ASP.NET Files\\root\\c11d5664\\d0efee07\\assembly\\dl3\\51c16a64\\5cdba327_3150cf01\\ServiceStackTest.dll","ModuleVersionId":"125112c7a82d4c2099718b901637e950","MetadataToken":1,"ScopeName":"ServiceStackTest.dll","Name":"ServiceStackTest.dll","Assembly":{"__type":"System.Reflection.RuntimeAssembly, mscorlib","CodeBase":"file:///g:/prg/ServiceStackTest/ServiceStackTest/bin/ServiceStackTest.DLL","FullName":"ServiceStackTest, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null","DefinedTypes":["ServiceStackTest.Global, ServiceStackTest","ServiceStackTest.AppHost, ServiceStackTest","Service{"ResponseStatus":{"ErrorCode":"InvalidCastException","Message":"Unable to cast object of type 'System.Security.Policy.Zone' to type 'System.Security.Policy.Url'.","StackTrace":"   at ServiceStack.Text.Common.WriteType`2.WriteProperties(TextWriter writer, Object value)\r\n   at ServiceStack.Text.Common.WriteListsOfElements`1.WriteIEnumerable(TextWriter writer, Object oValueCollection)\r\n   at ServiceStack.Text.Common.WriteType`2.WriteProperties(TextWriter writer, Object value)\r\n   at ServiceStack.Text.Common.WriteType`2.WriteAbstractProperties(TextWriter writer, Object value)\r\n   at ServiceStack.Text.Common.WriteType`2.WriteProperties(TextWriter writer, Object value)\r\n   at ServiceStack.Text.Common.WriteType`2.WriteAbstractProperties(TextWriter writer, Object value)\r\n   at ServiceStack.Text.Common.WriteType`2.WriteProperties(TextWriter writer, Object value)\r\n   at ServiceStack.Text.Common.WriteType`2.WriteAbstractProperties(TextWriter writer, Object value)\r\n   at ServiceStack.Text.Common.WriteType`2.WriteProperties(TextWriter writer, Object value)\r\n   at ServiceStack.Text.JsonSerializer.SerializeToStream(Object value, Type type, Stream stream)\r\n   at ServiceStack.Text.JsonSerializer.SerializeToStream[T](T value, Stream stream)\r\n   at ServiceStack.Serialization.JsonDataContractSerializer.SerializeToStream[T](T obj, Stream stream)\r\n   at ServiceStack.Host.ContentTypes.b__5(IRequest r, Object o, Stream s)\r\n   at ServiceStack.Host.ContentTypes.<>c__DisplayClass2.b__1(IRequest httpReq, Object dto, IResponse httpRes)\r\n   at ServiceStack.HttpResponseExtensionsInternal.WriteToResponse(IResponse response, Object result, ResponseSerializerDelegate defaultAction, IRequest request, Byte[] bodyPrefix, Byte[] bodySuffix)"}}

You get the point.

I don’t think anyone really wants this, but what are the alternatives?

Many would say “just write a test for it”, but that would mean writing a test for every type we serialize, hardly a good use of our time and very easy to forget. Since we’re working in a statically-typed language, can’t we make the compiler work for us?

The first step is understanding why this is really wrong. When you write a function signature like this in C#:

string Serialize<T>(T obj)

when interpreted under the Curry-Howard isomorphism this is actually saying: “I propose a function named ‘Serialize’ which turns any value of any type into a string”. Which is a blatant lie when implemented, since we’ve seen that you can’t get a meaningful string out of many types. The only logical implementation for such a function, without breaking parametricity, is a constant string.

Well, in .NET we could also implement it by calling ToString() on the argument, but you’ll notice that Object.ToString() has essentially the same signature as our Serialize above, and therefore the same arguments apply.

And you have to give up side effects too, otherwise you could implement this function simply by ignoring the argument and reading a string from a file. Runtime type inspection (i.e. reflection) too, as it breaks parametricity.

These restrictions and the property of parametricity are important because they enable code you can reason about, and very precisely. They will help you refactor towards more general code. You can even derive theorems just from their signatures.

I won't explain Curry-Howard or parametricity here, but if you're not familiar with these concepts I highly recommend following the links above. I found them to be very important concepts, especially in statically-typed languages.

You may think that all this reasoning and theorems is only for academics and doesn’t apply to your job, but we’ll see how by following these restrictions we can get the compiler to check what otherwise would have taken a test for each serialized type. This means less code and less runtime errors, a very real benefit! The more you abide by these restrictions, the more the compiler and types will help you.

The opposite of this programming by reasoning is programming by coincidence. You start "trying out things" to somehow hit those magical lines of code that do what you wanted to do... at least for the inputs you have considered.

So to sum up: use of unconstrained type parameters (generics) in concrete methods/classes for things that don’t truly represent “for all types” is a logic flaw.

Now that I briefly and badly explained the problem, what not to do and why, in the next post we'll see what we can do.

Thursday, March 20, 2014

Better dictionary types

Say you need to call a function like:

int DoSomething(Dictionary<string, string> data)

Do you know what kind of data you should feed to that function? Evidently, the keys and values of the dictionary are strings, but what should the comparer for the keys be? Should keys be case-sensitive or case-insensitive? Does it even matter for this function?

For example, if DoSomething was processing HTTP headers, it would need to receive a case-insensitive dictionary, as HTTP header names are case-insensitive. Yet the type doesn’t enforce it, it doesn’t even give us a hint.

How do other typed languages deal with this? Let’s take a look at Haskell first.

Haskell’s Data.Map requires the key type to have an instance for the Ord typeclass. Since you can’t have more than one typeclass instance per type, there is no possible ambiguity about how keys are compared. This property of having at most one instance per typeclass per type is called “coherence” and it’s a good property to have in a typeclass system as it keeps things simple, both for the programmer and the compiler. If you wanted a case-insensitive Map, you’d use Data.CaseInsensitive and your concrete Map type would reflect its case-insensitive behavior, e.g.

import qualified Data.Map as M
import qualified Data.CaseInsensitive as CI ( mk )

main = do
  let m = M.fromList [(CI.mk "one", 1)]
  print $ M.lookup (CI.mk "One") m

Here the type of m is Map (CI String) Integer. You can’t confuse it with the case-sensitive Map String Integer because the compiler simply won’t let you. They’re different types!

.NET doesn’t have typeclasses but we could achieve something similar in this case if we could redesign System.Collections.Generic.Dictionary and remove the constructors that admit an instance of IEqualityComparer<TKey> . That means it would always use the default comparer for the key type. And if we wanted a case-insensitive dictionary, we’d just wrap our string keys in a type implementing case-insensitive equality, e.g.:

sealed class CaseInsensitiveString {
    public readonly string Value;

    public CaseInsensitiveString(string value) {
        Value = value;
    }

    public override bool Equals(object obj) {
        if (ReferenceEquals(null, obj)) return false;
        if (ReferenceEquals(this, obj)) return true;
        if (obj.GetType() != GetType()) return false;
        return StringComparer.InvariantCultureIgnoreCase.Equals(Value, ((CaseInsensitiveString)obj).Value);
    }

    public override int GetHashCode() {
        return (Value != null ? StringComparer.InvariantCultureIgnoreCase.GetHashCode(Value) : 0);
    }

    public override string ToString() {
        return Value;
    }

    public static implicit operator CaseInsensitiveString (string a) {
        return new CaseInsensitiveString(a);
    }
}

var data = new Dictionary<CaseInsensitiveString, string> {
    {"one", "1"},
};

Console.WriteLine(data["One"]);

But since Dictionary has constructors that allow overriding the equality comparer, this is not enough. In other words, the type Dictionary<CaseInsensitiveString, string> does not guarantee that the dictionary is case-insensitive. We can easily work around this by creating a new type that limits Dictionary’s constructors:

sealed class Dictionary2<TKey, TValue>: Dictionary<TKey, TValue> {
    public Dictionary2() {}
    public Dictionary2(int capacity) : base(capacity) {}
    public Dictionary2(IDictionary<TKey, TValue> dictionary) : base(dictionary) {}
}

And now we can guarantee that a Dictionary2<CaseInsensitiveString, string> will do key equality as defined by CaseInsensitiveString and thus it will be case-insensitive.

There is one downside to this solution: we need to wrap all the keys when we want a different comparer. This means more allocations, less performance. Haskell can avoid this penalty by making the wrapper a newtype (not the particular case of Data.CaseInsensitive though!), which we don’t have in .NET. Can we do better?

The main problem here was that the type doesn’t uniquely determine an equality comparer. If we don’t make that a part of the key type, then couldn’t we make the comparer part of the dictionary type?

This is precisely what ocaml-core does. The Map type is determined by types of the map’s keys and values, and the comparison function used to order the keys. The book Real World OCaml explains how including the comparator in the type is important because certain operations that work on multiple maps require that they have the same comparison function. As we’ve seen, System.Collections.Generic.Dictionary can’t enforce that.

Following that same design principle, now instead of forbidding all constructors that accept an equality comparer, we do the opposite: forbid all constructors that don’t take a comparer, thus making it always explicit, and include the comparer as an additional type parameter:

sealed class Dictionary<TKey, TValue, TEqualityComparer> : Dictionary<TKey, TValue> where TEqualityComparer : IEqualityComparer<TKey> {
    public Dictionary(TEqualityComparer comparer) : base(comparer) {}
    public Dictionary(int capacity, TEqualityComparer comparer) : base(capacity, comparer) {}
    public Dictionary(IDictionary<TKey, TValue> dictionary, TEqualityComparer comparer) : base(dictionary, comparer) {}
}

A small helper to aid with type inference in C#:

static class Dict<TKey, TValue> {
    public static Dictionary<TKey, TValue, TEqualityComparer> Create<TEqualityComparer>(TEqualityComparer comparer) where TEqualityComparer: IEqualityComparer<TKey> {
        return new Dictionary<TKey, TValue, TEqualityComparer>(comparer);
    }
}

Another small helper class to ease the definition of comparer types based on the ones we already have:

class DelegatingEqualityComparer<T>: IEqualityComparer<T> {
    private readonly IEqualityComparer<T> comparer;

    public DelegatingEqualityComparer(IEqualityComparer<T> comparer) {
        this.comparer = comparer;
    }

    public bool Equals(T x, T y) {
        return comparer.Equals(x, y);
    }

    public int GetHashCode(T obj) {
        return comparer.GetHashCode(obj);
    }
}

Now we can easily create new comparer types like this:

sealed class StringComparerInvariantCultureIgnoreCase: DelegatingEqualityComparer<string> {
    private StringComparerInvariantCultureIgnoreCase() : base(StringComparer.InvariantCultureIgnoreCase) {}

    public static readonly StringComparerInvariantCultureIgnoreCase Instance = new StringComparerInvariantCultureIgnoreCase();
}

Finally we can use this new Dictionary type like this:

var data = Dict<string, string>.Create(StringComparerInvariantCultureIgnoreCase.Instance);
data.Add("one", "1");

Console.WriteLine(data["One"]);

Back to our original function, if we wanted to enforce a case-insensitive dictionary we can now use this new Dictionary type and change the signature to:

int DoSomething(Dictionary<string, string, StringComparerInvariantCultureIgnoreCase> data)

Epilogue

Types are a terrific tool to reason about our code, but only if we use them correctly. Throwing around types in an impure, partial language like C# or F# does not mean you're using types in a meaningful way. Consider what your types allow and what they don’t allow. Make illegal states unrepresentable. With precise types it becomes easier to reason about your code. Invariants enforced through the type system means the compiler makes it impossible to create invalid programs.

When you find yourself in need of inspiration for your types, see what other typed languages do, especially OCaml and Haskell. Their type systems are much more powerful than .NET’s, but often you can extract some of the underlying design principles and adapt them to less powerful type systems.

Addendum

Judging by some comments I've read around the web, it seems the goal of this post wasn't clear. It's definitely not about dictionaries concretely. It's about the thought process to recognize flaws in type design and how to fix them so that some interesting invariant can be enforced through the type system. The dictionary types here are merely used to illustrate this process.

Friday, February 14, 2014

Generating immutable instances in C# with FsCheck

If you do any functional programming in C# you’ll probably have lots of classes that look like this:

class Person {
    private readonly string name;
    private readonly DateTime dateOfBirth;

    public Person(string name, DateTime dateOfBirth) {
        this.name = name;
        this.dateOfBirth = dateOfBirth;
    }

    public string Name {
        get { return name; }
    }

    public DateTime DateOfBirth {
        get { return dateOfBirth; }
    }

    // equality members...
}

I.e. immutable classes, similar to F# records.
Now let’s say we want to test some property involving this Person class. For example, let’s say we have a serializer:

interface IPersonSerializer {
    string Serialize(Person p);
    Person Deserialize(string source);
}

Never mind the unsafety of this serializer, we want to test that roundtrip serialization works as expected. So we grab FsCheck and write:

Spec.ForAny((Person p) => serializer.Deserialize(serializer.Serialize(p)).Equals(p))
    .QuickCheckThrowOnFailure();

Only to be greeted with:

System.Exception: Geneflect: type not handled Person

Ok, so we write the generator explicitly:

var personGen =
    from name in Any.OfType<string>()
    from dob in Any.OfType<DateTime>()
    select new Person(name, dob);

Spec.For(personGen, p => serializer.Deserialize(serializer.Serialize(p)).Equals(p))
    .QuickCheckThrowOnFailure();

And all is fine.

But the generator code is trivially derivable from the class definition. And when you have lots of immutable classes, this boilerplate becomes really annoying. Couldn’t we automatize that somehow?

As it turns out, FsCheck already does this for F# records, using reflection. With a bit of code we can extend the reflection-based generator (built into FsCheck) to make it generate instances of immutable classes. With this change, we can go back to writing Spec.ForAny and FsCheck will automatically derive the generator for our Person class!

The restrictions on the classes to make them generable by FsCheck are:

Must be a concrete class. No interfaces or abstract classes; FsCheck can’t guess a concrete implementation.
It has to have only one public constructor. Otherwise, which one would FsCheck choose?
All public fields and properties must be readonly. Otherwise, it creates the ambiguity of which ones to set and which ones not.
Must not be recursively defined. FsCheck doesn’t generate recursively defined F# records by reflection either. I assume this is to keep the implementation simple.
Must not have type parameters. This restriction could probably be relaxed, but since I haven’t needed it so far, I decided to play it safe. Left as an exercise for the reader :-)

This is available in FsCheck as of version 0.9.2.

Happy FsChecking in C# and VB.NET !