Bug squash: Password strength, entropy, and compression

For some reason a couple of days ago I got hooked up with password strength stuff.

First, a disclaimer: I'm no security or math expert. There's some handwaving here, and it's been too many years since I studied this, but I hope I didn't make any gross conceptual errors. If I did, please tell me.

Now, if you look around you'll find lots of password strength checkers, but most of them work with a set of rules, each adding a fixed number of 'points' for password length, using numbers, symbols, etc. That's good enough for many cases, but we can do better.

The better way involves calculating the entropy of the password. In a nutshell, information entropy is a measure of unpredictability/uncertainty.

For example, random bytes are unpredictable. Truly random bytes (i.e. not pseudo-random) would make terrific passwords if it weren't because nobody can remember them. "902c5cce62e11db18264" (10 random bytes represented in hex, courtesy of random.org) is a pretty good password. Its entropy is high. But it's also very hard to remember, so it's not very practical.

On the other hand, there's plain text. Text is much more memorable, but it's also more predictable, so plain text passwords are usually easily guessable. Unless they're long enough, but then they aren't memorable.

So there's a tension between password strength and usefulness, which prompts some to claim that passwords are broken, and gurus to refute that. Most people pick trivial memorable passwords, while security experts recommend creating strong passwords and writing them down. One way to do that would be using password managers.

Anyway, back to password strength checkers. One of the best I found is this one in Javascript. It checks a dictionary for common words (in English), and also calculates entropy using a table of letter pair frequencies in English. The character set is semi-dynamic depending on the input password. It's pretty sophisticated.

A cheaper way (for my little brain that is, not necessarily for the computer) to estimate entropy indirectly is by using compression. Entropy and compression rate are directly related: the higher the entropy in a piece of data, the harder it is to compress it, so the resulting compressed data will be longer. (EDIT: turns out this is theoretically incorrect, see discussion below)

This of course relies on having built-in compression in your platform, which Javascript on browsers has not, but in .NET we can easily write (F#):

open System.IO
open System.IO.Compression
let compressedLength (s: string) =
    use buffer = new MemoryStream()
    use comp = new DeflateStream(buffer, CompressionMode.Compress)
    use w = new StreamWriter(comp)
    w.Write(s)
    w.Flush()
    buffer.Length

Let's see how this function behaves for various potential passwords:

["a"; "god"; "123456"; "password"; "password1"; "password12"; "password1!"; "q!1[Ba5"; "902c5cce62e11db18264"; "iloveyou"; "ILoveYou"; "qwerty123456"; "q1w2e3r4t5y6"; "this is a good password"; "12345678"; "111111"; "letmein"; "qwerty"]
|> List.map (fun p -> p,compressedLength p) |> List.sortBy snd

Password	Compressed length
a	98
god	100
111111	100
123456	104
password	104
iloveyou	104
letmein	104
qwerty	104
password1	106
password12	106
q!1[Ba5	106
ILoveYou	106
12345678	106
password1!	108
qwerty123456	110
q1w2e3r4t5y6	110
this is a good password	116
902c5cce62e11db18264	118

Due to header overhead, even a single character compresses to 98 bytes. Still, the results reflect quite nicely most of the common recommendations for passwords: make them as long as possible; use letters, numbers and symbols; mix upper and lower case; don't repeat characters.

I wouldn't accept a password that 'scores' less than 106. Requiring 108+ is safer but might piss off some users.

Note that this is highly platform-dependent. For example, on Python:

import zlib
p = ["a", "god", "123456", "password", "password1", "password12", "password1!", "q!1[Ba5", "902c5cce62e11db18264", "iloveyou", "ILoveYou", "qwerty123456", "q1w2e3r4t5y6", "this is a good password", "12345678", "111111", "letmein", "qwerty"]
l = [(w,len(zlib.compress(w,1))) for w in p]
sorted(l, key=lambda a:a[1])

Password	Compressed length
a	9
god	11
111111	11
123456	14
qwerty	14
q!1[Ba5	15
letmein	15
password	16
iloveyou	16
ILoveYou	16
12345678	16
password1	17
password12	18
password1!	18
qwerty123456	20
q1w2e3r4t5y6	20
902c5cce62e11db18264	28
this is a good password	29

And there could be differences even between .NET versions. It would be safer to compare against some reference values instead of comparing with a 'magic' value.

Of course, entropy is but one dimension of password strength that only considers brute-force attacks, you still have to couple this with a dictionary check to filter out common passwords. And some even go as far as saying that password entropy is rarely relevant any more!

4 comments:

oleksii said...: Hi Mauricio,

Nice post, but...
Entropy cannot be estimated by compression. This is entrily wrong. See this guy for example: http://cscs.umich.edu/~crshalizi/notebooks/cep-gzip.html

Oleksii; February 27, 2011 at 7:26 PM
oleksii said...: Not sure if anybody use PI as a password, but let's say someone does. So, PI = 3,14159....
This number pass most statistical tests for randomness, and you can't compress it using standard compressors, therefore the length of compressed output will be larger (or equals at most) than the input. It doesn't mean the password is strong as it is just PI.; February 27, 2011 at 7:50 PM
Mauricio Scheffer said...: @Oleksii: thanks for the pointer. At least I'm glad that people much smarter than me made the same mistake ;-)
Still, even though it's theoretically wrong, I think gzip might be a good tool for this specific case. I'd definitely use this over other solutions that use fixed sets of rules with points. Maybe it's not such a bad estimator for short data as are passwords.
About your pi argument: I think it's a bad password because it's a well-known value, but not because of its entropy, so it's out of the scope of what's discussed here. It would have to be filtered by a dictionary check.; February 27, 2011 at 9:52 PM
oleksii said...: Mauricio,
Yeah I am actually using gzip to estimate Kolmogorov Complexity and it is still pretty good approximation.

Gzip will work most of the cases for simple checks, although it has some caveats.

The sample with PI is a contradiction actually. PI will not be compressed and you will have large compression length, it shall mean that it is a good password, but is not.

Finally, you can try compression ratio (uncompressed/compressed size). This value will show you how much you can shrink the password and it is independent of the password lenght, as pure compressed lenght depends on the size you feed in + you will have a small header and footer (see for example compression of one symbol).
There are some limitation of the algorithm, the pwd must not be greater the the look up window, which is by default set to 32 Kb (just to keep in mind if you want to go to some long sequences).; February 28, 2011 at 5:13 AM

Bug squash

Saturday, February 26, 2011

Password strength, entropy, and compression

4 comments:

About Me

Contact

Labels

Blog Archive

License