|
Message-ID: <20110329172520.GA16758@openwall.com> Date: Tue, 29 Mar 2011 21:25:20 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: binary hashes and BINARY_SIZE Hi, I think there's some confusion regarding what goes into the "binary" hashes and thus what BINARY_SIZE and params.binary_size of a "format" should be. I'll try to explain. A typical "hash encoding" that we load from a password hash file might look like: $1$7Uu2iTBB$Y4hQl2WvrOA3LBbLDxbAf0 This one consists of a hash type identifier ("$1$"), a salt ("7Uu2iTBB"), and the hash itself ("Y4hQl2WvrOA3LBbLDxbAf0"). Now, John could simply keep these strings in memory, compute similar strings for every candidate password being tested, and compare the strings. However, this is suboptimal because in that case it has to ASCII-encode all computed hashes, whereas instead it can ASCII-decode the loaded hashes into their binary representations and be comparing those while cracking. For example, "Y4hQl2WvrOA3LBbLDxbAf0" may be decoded into a 16-byte binary value at startup. (For this specific hash type, this does not make much of a performance difference because these hashes are so slow to compute. For some other hash types, it does.) However, John still needs the ASCII strings for storage into john.pot whenever a password gets cracked. Thus, it could either keep both things in memory or it could encode the loaded binary representations back into ASCII when needed. However, the latter would require extra code (per hash type), and more code means potentially more bugs. So John uses the former approach - keep both things in memory - but then many "formats" make an optimization where only partial binary hashes are kept in memory. Really, say, 32 bits per hash is enough to weed out most candidate passwords. And in the relatively rare cases that partial hashes match, John can always re-decode the ASCII strings (which it has code for anyway) and do the full comparison. Thus, cmp_all(), cmp_one(), and all of binary_hash[]() and get_hash[]() functions work on potentially partial binary hashes, whereas cmp_exact() works on full ASCII-encoded hashes. To continue the same example, MD5_fmt.c uses a BINARY_SIZE of 4. That's 4 bytes, or 32-bit partial binary hashes. Compared to the full size of 16, this saves 12 bytes of memory per hash loaded for cracking. (It may also improve cache usage, which matters for faster hashes.) To give another example, here's how I converted the "dummy" "format" (to be introduced in 1.7.7) from full to partial binary hashes: http://cvsweb.openwall.com/cgi/cvsweb.cgi/Owl/packages/john/john/src/dummy.c.diff?r1=1.2;r2=1.3 This has resulted in huge memory savings because otherwise I'd have to make BINARY_SIZE large enough to hold the longest supported plaintext password (we don't want any user-visible "hash collisions", nor do we want to use a real and thus not-very-fast cryptographic hash function in "dummy"). Now that I look at NT_fmt.c in jumbo, I see that its BINARY_SIZE is 16. So it wastes memory on storage of full binary hashes, even though their halves (8 bytes) would be more than enough. We could want to correct this, moving some code from cmp_one() to cmp_exact(). Alain? Many other "formats" are similarly not optimized in this respect. Not that I care much (with modern RAM sizes), but Alain started to compare memory usage of different tools here: http://www.hashsuite.webuda.com/index.php?p=1_6 ;-) JtR's memory usage at NTLM will grow by 4 MB or 8 MB with the large hash tables patch, but we can reduce it by 8 MB or 12 MB with the BINARY_SIZE optimization. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.