Monday, October 21, 2013

Towards a NuGet dependency monitor with OData and F#

Every now and then I bump into the Hackage Dependency Monitor and wish there was something similar for NuGet. Node.js also has something like this with David. As a maintainer of several packages, it would come in handy if something would tell me when a dependency gets outdated, instead of having to find out through a bug report or reading the new release announcement by coincidence.

Since NuGet exposes its repository through an OData feed, and .NET has some good facilities to query these, I thought it’d be fun to try to implement the core of a dependency monitor.

Exploration

I started out with LINQPad, which makes exploration as easy as it gets. Click “add connection”, select the OData driver, and enter the feed URL https://nuget.org/api/v2 . That’s it. Open the connection and it shows the available tables (only "Packages" in this case) and their schema:

For example, we can find out how many projects are hosted on Github:

Packages 
.Where(x => x.IsLatestVersion && x.IsAbsoluteLatestVersion && x.ProjectUrl.Contains("github")) 
.Count()

Here are the figures for the most popular project hosting services (for the packages that have the ProjectUrl property defined):

Github: 5378
Codeplex: 2658
Google: 483
Bitbucket: 415

We can also see that dependencies are stored as a string… what does the content look like? Let’s find out:

This means that we’ll have to parse this dependency string and run another query to fetch and analyze dependencies. So let’s switch to the F# REPL to write some code more comfortably.

Some helpers

NuGet has some functions that will come in handy to parse these dependency versions:

#r "System.Xml.Linq"
#r @"g:\prg\SolrNet\lib\NuGet.exe"

open NuGet

let split (c: char) (x: string) = x.Split c

let parseDependencies : string -> seq<string * IVersionSpec> = 
    split '|' >> Seq.map (split ':' >> fun x -> x.[0], VersionUtility.ParseVersionSpec x.[1])

To query the NuGet feed we’ll use the OData type provider.

Unfortunately, WCF Data Services doesn’t support Contains(), which we need to get all the dependencies for a package in a single query. As an example, running this on LINQPad:

Packages.Where(x => new[] {"GDataDB", "FSharpx.Core"}.Contains(x.Id))

throws NotSupportedException: the method ‘Contains’ is not supported.

The type provider uses the same query translator underneath so it has the same limitation. OData v3 supports the Any() operator, but it seems that the NuGet feed is OData v2, so that doesn’t work either. Anyway, we can implement this with some quotation manipulation:

open Microsoft.FSharp.Quotations
open Microsoft.FSharp.Quotations.Patterns

let inList (memberr: Expr<'a -> 'b>) (values: 'b list) : Expr<'a -> bool> =
    match memberr with
    | Lambda (_, PropertyGet _) -> 
        match values with
        | [] -> <@ fun _ -> true @>
        | _ -> values |> Seq.map (fun v -> <@ fun a -> (%memberr) a = v @>) |> Seq.reduce (fun a b -> <@ fun x -> (%a) x || (%b) x @>)
    | _ -> failwith "Expression has to be a member"

You can compare this with the equivalent C# code, it really shows how expression splicing makes the F# code much clearer.

Also, just for kicks, let’s run the query asynchronously. This doesn’t make much difference now as we’ll run this in the REPL, but it would be useful if this were to be used in a server.

#r "System.Data.Services.Client"
#r "FSharp.Data.TypeProviders"

open System.Linq
open System.Data.Services.Client

let execQueryAsync (q: _ DataServiceQuery) = 
    Async.FromBeginEnd(q.BeginExecute, q.EndExecute)

Main code

Ok, enough with the helper functions. Here’s the main function:

type nuget = Microsoft.FSharp.Data.TypeProviders.ODataService<"https://nuget.org/api/v2">
type Package = nuget.ServiceTypes.V2FeedPackage
let ctx = nuget.GetDataContext()

let checkDependencies packageId =
    async {
        let packagesQuery = 
            query {
                for p in ctx.Packages do
                where (p.Id = packageId && p.IsAbsoluteLatestVersion && p.IsLatestVersion)
                select p
            } :?> DataServiceQuery<Package>
        let! packages = execQueryAsync packagesQuery

        let package = Seq.exactlyOne packages

        let deps = parseDependencies package.Dependencies |> Seq.toList
        let depIds = List.map fst deps
        let depsQuery = 
            query {
                for p in ctx.Packages do
                where (((%(inList <@ fun (x: Package) -> x.Id @> depIds)) p) && p.IsAbsoluteLatestVersion && p.IsLatestVersion)
                select p
            } :?> DataServiceQuery<Package>
        let! depPackages = execQueryAsync depsQuery
        let depPackagesList = Seq.toList depPackages
        let satisfies =
            deps |> List.map (fun (depId, version) -> 
                                let depPackage = depPackagesList |> Seq.find (fun p -> p.Id = depId)
                                let semVersion = SemanticVersion.Parse depPackage.Version
                                depId, version, version.Satisfies semVersion)
        return satisfies
    }

This returns a list of tuples where the first element is the package ID of the dependency, the required version for that dependency, and a boolean saying if the dependency is outdated (false) or not (true).

Let’s try an example:

checkDependencies "GDataDB" |> Async.RunSynchronously

Gives:

[("Google.GData.Client", [2.1.0.0] {IsMaxInclusive = true;
                                      IsMinInclusive = true;
                                      MaxVersion = 2.1.0.0;
                                      MinVersion = 2.1.0.0;}, false);
   ("Google.GData.Extensions", [2.1.0.0] {IsMaxInclusive = true;
                                          IsMinInclusive = true;
                                          MaxVersion = 2.1.0.0;
                                          MinVersion = 2.1.0.0;}, false);
   ("Google.GData.Documents", [2.1.0.0] {IsMaxInclusive = true;
                                         IsMinInclusive = true;
                                         MaxVersion = 2.1.0.0;
                                         MinVersion = 2.1.0.0;}, false);
   ("Google.GData.Spreadsheets", [2.1.0.0] {IsMaxInclusive = true;
                                            IsMinInclusive = true;
                                            MaxVersion = 2.1.0.0;
                                            MinVersion = 2.1.0.0;}, false)]

Uh-oh, I better update those dependencies!

Conclusion

So this looks simple enough, right? However, I consider this just a spike, it’s not really robust. What happens if the package doesn’t exist? Exception. Ill-defined dependencies? Exception.

Still, it shouldn’t be too hard to make this more robust, then put it in a web server and done! Easier said than done ;-)

Also, it seems that the NuGet v3 API won't be based on OData so this whole experiment might need to be rewritten soon.

Anyway, here's the entire code for this post.

Appendix: WCF Data Services criticism

I generally dislike expression-based translators (i.e. IQueryable) because they’re usually eminently partial, i.e. you have to guess what’s supported, read the docs very carefully, or run your code in a test and see what happens. Otherwise you get exceptions everywhere. The compiler can’t do much and it’s never quite clear what will execute on the client and what will be translated and executed on the server. This hurts your ability to reason about the code, which in turn means more programming by coincidence.

WCF Data Services takes this to pathological levels. While exploring the NuGet feed in LINQPad I found many simple expressions that should have worked but didn’t. A few examples:

Packages.Where(x => new[] {"GDataDB", "FSharpx.Core"}.Contains(x.Id))

This is the one I mentioned earlier. I don’t see why the expression translator couldn’t do what I did and compile this to a chain of OR’ed expressions.

Even the simplest projection fails:

Packages.Select(x => x.Id)

Throwing NotSupportedException: Individual properties can only be selected from a single resource or as part of a type. Specify a key predicate to restrict the entity set to a single instance or project the property into a named or anonymous type.

I have no idea what the first part of that error means, but projecting to an anonymous type works:

Packages.Select(x => new {x.Id})

The official explanation for this is that the OData protocol doesn’t support it, but again, it would seem that this is the job of the expression translator and the protocol or server side of things has little to do with it.

By the way, if you happen to add a Take() operator you get a totally different exception:

Packages.Select(x => x.Id).Take(20)

Throws an InvalidCastException: Unable to cast object of type 'System.Data.Services.Client.NavigationPropertySingletonExpression' to type 'System.Data.Services.Client.ResourceSetExpression'.

Or add a condition and you get yet another different error:

Packages
.Where(x => x.IsLatestVersion && x.IsAbsoluteLatestVersion)
.Select(x => x.Id)
.Take(10)

NotSupportedException: Can only specify query options (orderby, where, take, skip) after last navigation.

Which is incorrect, since replacing the projection in the above expression with an anonymous type works fine.

In this other example, the library doesn’t process negated conditions correctly, which causes a cryptic server-side exception:

Packages
.Where(x => !x.ProjectUrl.Contains("google"))
.Select(x => new {x.Id})

“An error occurred while processing this request. InnerException: Rewriting child expression from type 'System.Nullable`1[System.Boolean]' to type 'System.Boolean' is not allowed, because it would change the meaning of the operation. If this is intentional, override 'VisitUnary' and change it to allow this rewrite.”

The generated URL from this expression is: https://nuget.org/api/v2/Packages()?$filter=not substringof('google',ProjectUrl)&$select=Id which apparently isn’t supported server-side.

But change the condition from using the “not” operator to “== false” and everything is magically fixed:

Packages
.Where(x => x.ProjectUrl.Contains("google") == false)
.Select(x => new {x.Id})

Generated URL: https://nuget.org/api/v2/Packages()?$filter=substringof('google',ProjectUrl) eq false&$top=10&$select=Id

These two expressions are logically equivalent, but one works and the other one fails:

Packages
.Where(x => x.IsLatestVersion)
.Select(x => new {x.IsLatestVersion})
Packages
.Select(x => new {x.IsLatestVersion})
.Where(x => x.IsLatestVersion)

The second one fails with: “NotSupportedException: The filter query option cannot be specified after the select query option.”.

These are all very simple expressions and I found all of these issues in about one hour of experimentation in LINQPad (for reference, LINQPad v4.47.02 using WCF Data Services 5.5), so be very careful and test every single call if you have to use WCF Data Services. And keep in mind that if you use the OData F# type provider, you’re also using WCF Data Services, so the same warning applies.

3 comments:

jackfoxy said...

You really put your finger on problems with IQueryable, especially the inability of users to determine what executes where. Still, it's useful to have a common framework for IO APIs given the general lack of documentation for so many systems we want to interact with, even if every implementation is different. Is there a better alternative? Would a datalog framework be any better, for instance?

Mauricio Scheffer said...

Thanks for your comment, Jack. I think the problem is that you can lift any function into an Expression. There should be some way to limit that per-provider. I think type providers should be able to do this. I don't remember seeing any implementation doing it, though.

I'm not really sure anymore it's so useful to have a common API for various data models. LINQ was modeled from relational databases, but it doesn't really fit other kinds of databases. For example, I wrote APIs for Urchin and Solr, and they're so different from relational databases that the standard LINQ syntax wouldn't apply in a lot of cases and it would only be able to express only the most trivial queries in these databases.

Since databases are so different in capabilities, models, properties, etc, why try to force them all into a single API? That seems as bad as implementing a .NET interface and having half of the methods throwing NotSupportedException, i.e. the abstraction is too coarse.

I'd rather embrace the differences and model the API of each database for what it really is.

Datalog sounds interesting, I don't know much about it. But it seems it wouldn't be able to model several features of Solr, like faceting. It seems to be most useful when the database itself is designed to support it.

object said...

Thanks for a great article. And btw it provoked me to do a comparison between WCF Data Services and cross-platform library Simple.OData.Client that I happen to maintain. I wrote it partly because of the same frustration I experienced while working with WCF DS. You only touched a surface of what doesn't properly work in WCF DS. If you tried to modify data you would come to another set of obstacles. Here's an article with comparison: http://www.codeproject.com/Articles/686240/12-reasons-to-consume-OData-feeds-using-Simple-ODa