john-users - False positives essay :)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <541e07c58ced7b4c4af5328f03e123a0@smtp.hushmail.com>
Date: Sun, 23 Feb 2014 23:55:30 +0100
From: magnum <john.magnum@...hmail.com>
To: Jeff Keller <jakeller@...r.com>, john-users@...ts.openwall.com
Subject: False positives essay :)

This ended up long and might be of public interest so I cc the list:

On 2014-02-23 22:40, Jeff Keller wrote:
> Out of curiosity, how does it come up with false positives?
> Isn’t there only *one* password that would work?

It sure is, but in this case we have a hard time detecting when we got 
the right one. For login passwords it's simple: The hash is made my 
feeding the password through a one-way function F, that in it's crudest 
form is just a single SHA1, for example. So

	hash = SHA1("correct password")

When we brute force, we know the hash but not the password. So we just 
do lots of

	hash2 = SHA1("password candidate")

and compare our hash2 with the correct hash. When they match, we have 
found the correct password. Very simple, no false positives, no false 
negatives.

Now, for disk or file encryption we don't have a hash! The DMG image was 
basically made like this when created:

	key = F("correct password")
	encrypted_data = AES(key, plain_data)

We obviously do not know the key. We do have an encrypted block (eg. 
first or last block from filesystem, not a block from any particular 
file) from the DMG. The encryption is symmetrical so you decrypt using 
the same key. When we brute force, we do

	key = F("password candidate")
	decrypted_block = AES(key, encrypted_block)

The thing is that if we use the wrong key [ie. derived from the wrong 
password], AES will NOT emit any "error" but a perfectly fine output 
block of random garbage. Herein lies the problem: How do we know the 
decrypted_block is correct or not? We don't have any correct data to 
compare with!

So we are forced to check for "known plaintext" within the decrypted 
block. All false positives in this case came from a test for the string 
"Apple" within the decrypted block. While the chance for that string 
appearing in a specific spot in random data would be about one in a 
thousand billions, the chance for it appearing *anywhere* within a block 
of 8 KB is much greater. I'm not sure but I think it's in the order of 
one in 134 millions. We got 5-6 false positives in
~20 hours at ~5000 c/s which might indicate it's slightly more probable 
so I'm not sure about the exact figures.

All other known-plain tests in the format are at least 8 bytes and that 
wont give any false positive in 20,000 years or so. The problem is we 
are not 100% sure we can do without the "Apple" test. We think we can, 
but we are not sure. If we drop that test and your decrypted block 
really has the "Apple" string but none of the other strings we test for, 
we'd end up with a false NEGATIVE instead, which as you can imagine is 
much, much worse. Shrug.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.