Saturday, February 26, 2011

Password strength, entropy, and compression

For some reason a couple of days ago I got hooked up with password strength stuff.

First, a disclaimer: I'm no security or math expert. There's some handwaving here, and it's been too many years since I studied this, but I hope I didn't make any gross conceptual errors. If I did, please tell me.

Now, if you look around you'll find lots of password strength checkers, but most of them work with a set of rules, each adding a fixed number of 'points' for password length, using numbers, symbols, etc. That's good enough for many cases, but we can do better.

The better way involves calculating the entropy of the password. In a nutshell, information entropy is a measure of unpredictability/uncertainty. 

For example, random bytes are unpredictable. Truly random bytes (i.e. not pseudo-random) would make terrific passwords if it weren't because nobody can remember them. "902c5cce62e11db18264" (10 random bytes represented in hex, courtesy of random.org) is a pretty good password. Its entropy is high. But it's also very hard to remember, so it's not very practical.

On the other hand, there's plain text. Text is much more memorable, but it's also more predictable, so plain text passwords are usually easily guessable. Unless they're long enough, but then they aren't memorable.

So there's a tension between password strength and usefulness, which prompts some to claim that passwords are broken, and gurus to refute that. Most people pick trivial memorable passwords, while security experts recommend creating strong passwords and writing them down. One way to do that would be using password managers.

Anyway, back to password strength checkers. One of the best I found is this one in Javascript. It checks a dictionary for common words (in English), and also calculates entropy using a table of letter pair frequencies in English. The character set is semi-dynamic depending on the input password. It's pretty sophisticated.

A cheaper way (for my little brain that is, not necessarily for the computer) to estimate entropy indirectly is by using compression. Entropy and compression rate are directly related: the higher the entropy in a piece of data, the harder it is to compress it, so the resulting compressed data will be longer. (EDIT: turns out this is theoretically incorrect, see discussion below)

This of course relies on having built-in compression in your platform, which Javascript on browsers has not, but in .NET we can easily write (F#):

open System.IO
open System.IO.Compression
let compressedLength (s: string) =
    use buffer = new MemoryStream()
    use comp = new DeflateStream(buffer, CompressionMode.Compress)
    use w = new StreamWriter(comp)
    w.Write(s)
    w.Flush()
    buffer.Length

Let's see how this function behaves for various potential passwords:

["a"; "god"; "123456"; "password"; "password1"; "password12"; "password1!"; "q!1[Ba5"; "902c5cce62e11db18264"; "iloveyou"; "ILoveYou"; "qwerty123456"; "q1w2e3r4t5y6"; "this is a good password"; "12345678"; "111111"; "letmein"; "qwerty"]
|> List.map (fun p -> p,compressedLength p) |> List.sortBy snd
Password Compressed length
a 98
god 100
111111 100
123456 104
password 104
iloveyou 104
letmein 104
qwerty 104
password1 106
password12 106
q!1[Ba5 106
ILoveYou 106
12345678 106
password1! 108
qwerty123456 110
q1w2e3r4t5y6 110
this is a good password 116
902c5cce62e11db18264 118

Due to header overhead, even a single character compresses to 98 bytes. Still, the results reflect quite nicely most of the common recommendations for passwords: make them as long as possible; use letters, numbers and symbols; mix upper and lower case; don't repeat characters.

I wouldn't accept a password that 'scores' less than 106. Requiring 108+ is safer but might piss off some users.

Note that this is highly platform-dependent. For example, on Python:

import zlib
p = ["a", "god", "123456", "password", "password1", "password12", "password1!", "q!1[Ba5", "902c5cce62e11db18264", "iloveyou", "ILoveYou", "qwerty123456", "q1w2e3r4t5y6", "this is a good password", "12345678", "111111", "letmein", "qwerty"]
l = [(w,len(zlib.compress(w,1))) for w in p]
sorted(l, key=lambda a:a[1])


Password Compressed length
a 9
god 11
111111 11
123456 14
qwerty 14
q!1[Ba5 15
letmein 15
password 16
iloveyou 16
ILoveYou 16
12345678 16
password1 17
password12 18
password1! 18
qwerty123456 20
q1w2e3r4t5y6 20
902c5cce62e11db18264 28
this is a good password 29

And there could be differences even between .NET versions. It would be safer to compare against some reference values instead of comparing with a 'magic' value.

Of course, entropy is but one dimension of password strength that only considers brute-force attacks, you still have to couple this with a dictionary check to filter out common passwords. And some even go as far as saying that password entropy is rarely relevant any more!

Thursday, February 10, 2011

Running Solr on .NET

Many people are reluctant to use Solr in .NET applications because it requires installing a Java runtime. It might complicate deployment for an otherwise all-Microsoft application, which is one of the reasons why the Stackoverflow team chose to use Lucene.NET instead. Personally, I think the advantages of Solr, like:

  • Much less code to develop and maintain compared to an in-house implementation of features like faceting, caching, suggestions, etc
  • Interface simpler than Lucene, yet flexible and powerful.
  • Web console.
  • Easy scaling.

(just to name a few), more than make up for the deployment hassle, at least for my applications. In other words, I think the problem of deploying Solr is easier than the problem of coding faceting, caching, suggestions, etc over Lucene(.net) efficiently. I'd rather not reinvent this wheel.

But I understand that things would be simpler if Solr just ran somehow directly on .NET. I see three ways to make that happen:

  1. Manually port the source code to .NET. Many well-known projects ported from Java took this path: log4net, Quartz.NET, NHibernate, Lucene.NET, GitSharp to name a few.
  2. Automatically port the source code to .NET. NGit and db4o use this approach, using Sharpen.
  3. Automatically convert Java bytecode to .NET IL.

All these options have their pros and cons:

Option 1 gives you the most freedom but requires the most effort, both for the initial port and for keeping up with the original code.

Option 2 still requires manual tweaks (Lluis mentions modifications both to Sharpen itself and its generated code), and generates java-ish code, whereas in the first approach you're free to translate things as you see fit. It may be convenient to write a usability layer on top of the generated code.

Option 3 is the most automatic of them all. Using IKVM you can just get the JARs, run them through ikvmc and get .NET DLLs as a result. It requires the least effort, and almost no coding at all (you still might have to wrap some Java-specific APIs). However it also requires having the whole IKVM runtime and OpenJDK at runtime which can be cumbersome and may have some negative performance impact.

In the particular case of Solr, the IKVM+OpenJDK dependency isn't really a big deal, because it's a server, not a library, and it's easily xcopy-able and not as heavy as a full Java runtime (for a standard Windows install, that is). Sure, Solr can also be embedded in Java apps, but not even the Solr team recommends it for general usage.

I actually started experimenting with Solr and IKVM back in June 2009. I ran into ClassLoader issues, and since I didn't really need it, I gave up. I've seen many people express interest in having Solr running on .NET, and I did point them to my incomplete solution, yet nobody did anything (I'm actually a bit ticked off by this).

Anyway, a few days ago I accidentally bumped into this article by Kevin Miller where he explains how to use IKVM to run Tika on .NET. I'm not (yet) interested in Tika itself, but the article also contained the solution for my ClassLoader problems.

So I got back to working on this and actually got it to a usable state. It has four modes of operations:

  1. Embedded, just like you would embed it in a Java app
  2. Server, using the embedded Jetty (launched from command-line)
  3. Server, as Windows Service (also using the embedded Jetty)
  4. Server, as a IHttpHandler, so it can be hooked up to IIS as a regular ASP.NET app.

The ASP.NET mode still needs some work: queries work, but updates don't. There's probably something wrong in the Servlet-ASP.NET adapters.

Honestly, I haven't tested it much. I still don't need it, so I don't intend to put much time on it. I haven't done any performance tests. I don't expect this to be as fast as the original Java Solr of course, since there's the whole IKVM runtime and OpenJDK in the middle, but I do believe it will have acceptable performance.

Consider this a call for contributors: SolrIKVM is currently quite usable, but I wouldn't call it production-quality yet. If you really want Solr on .NET, fork the project, try it, test it, fix the issues. Make it happen. Feel free to ask me any questions about it.

By the way, huge kudos to Jeroen Frijters, it's pretty awesome that IKVM can run Jetty and Solr like this.

Project is on github.

Tuesday, February 8, 2011

Validation in formlets

Last time, we broke up formlets into primitive applicative functors, then composed these primitives to produce formlets.

So far, if someone entered an unexpected value in our little implementation of formlets (e.g. "abc" in an int Formlet), we'd get a nasty exception and there wasn't much we could do about it because we couldn't trace the exception to the formlet that was producing it. What we want, instead of an exception, is something that accumulates errors without interrupting the computation flow. Applicative functors, being oblivious, are perfect for that. More concretely, we want to:

  1. Know if the formlet was able to collect the value or not, and 
  2. if it couldn't, get the formlet rendered with the values the user entered and error messages, so we can re-display this second form, giving the user a chance to correct his mistakes.

The first item is a job for the option type. We'll return Some v if the formlet was successful and None if it wasn't.

For the second, we already have a XmlWriter applicative, so we could just reuse it to build this "error form".

Recall the type of formlets as we defined it last time:

type 'a Formlet = 'a Environ XmlWriter NameGen

Adding the two concerns we just mentioned, the signature becomes:

type 'a Formlet = 'a Error XmlWriter Environ XmlWriter NameGen

where Error is the Maybe monad in applicative form (remember that every monad is also an applicative functor).

This is what I was talking about when I mentioned formlets being extensible: you can extend formlet features by composing additional applicatives.

By the way, here's the same formlet type expressed using C# bracket style (which F# also supports):

type Formlet<'a> = NameGen<XmlWriter<Environ<XmlWriter<Error<'a>>>>>

See now why I prefer ML syntax? Bracket style looks quite messy once you start nesting many types.

Anyway, it's not enough to compose these applicatives, we also need some way to define how to validate things and how to show error messages. Let's define a validator type:

/// Validator type. 
/// Fst determines if value is valid 
/// Snd builds an error message 
type 'a Validator = ('a -> bool) * ('a -> xml_item list -> xml_item list)

and a function that attaches a validator to a formlet:

satisfies : 'a Validator -> 'a Formlet -> 'a Formlet

(I'm not going to bore you with implementation details in this post)
So we can now write:

let isInt = Int32.TryParse >> fst 
let intErrorMsg a xml = xml @ [ Tag("span", ["class","error"], [Text(sprintf "'%s' is not a valid number" a)]) ] 
let inputInt = input |> satisfies (isInt, intErrorMsg) |> lift int

When rendered the first time, this formlet doesn't show anything different:

printfn "%s" (render inputInt)

<input name="input_0" value=""/>

If we enter an invalid value, validation kicks in:

let env = EnvDict.fromValueSeq ["input_0","abc"] 
match run inputInt env with 
| _, Some v -> printfn "Formlet successful, value %d" v 
| errorForm, None -> printfn "Formlet unsuccessful, error form: \n%A" (XmlWriter.render errorForm)

This will print:

Formlet unsuccessful, error form: 
<div> 
    <input name="input_0" value="abc" /> 
    <span class="error">'abc' is not a valid number</span> 
</div>

We might want to wrap all that XML/HTML manipulation in order to make defining validators easier, for example:

/// <summary> 
/// Constructs a validator 
/// </summary> 
/// <param name="isValid">Determines if value is valid</param> 
/// <param name="errorMsg">Builds the error message</param> 
let err (isValid: 'a -> bool) (errorMsg: 'a -> string) : 'a Validator = 
    let addError value xml = 
        [ Tag("span", ["class","errorinput"], xml) 
          Tag("span", ["class","error"], [Text(errorMsg value)]) ]
    isValid, addError

Now we can code inputInt as:

let inputInt = input |> satisfies (err isInt (sprintf "'%s' is not a valid number")) |> lift int

We can chain as many validators as we want for any formlet. For example, here's a formlet that checks that the submitted value falls within a specified range:

let isInRange min max n = n >= min && n <= max
let inputRange min max = inputInt |> satisfies (err (isInRange min max) (fun _ -> sprintf "Value must be between %d and %d" min max))
let input10to40 = inputRange 10 40

Note how I used inputInt as the starting point for inputRange. inputInt is already validated for integers, so any non-integers values fed to inputRange will fail with the same error message as before. inputRange adds further validation on top of inputInt's validation. This is another example of the composability of formlets.

I put up a fully-fledged implementation of formlets, including validation, on github. There are some minor differences with what I've written so far about formlets. I'll blog about these differences soon.

As with all my previous articles on formlets, I have to give credit to the Links team; these blog posts and code are mostly just a rehash of their papers, which I can't recommend enough.