|
Message-ID: <F25F93930D034671A97CCEB19088838E@D9VGLK61> Date: Wed, 29 Jun 2011 23:33:01 -0500 From: "JimF" <jfoug@....net> To: <john-dev@...ts.openwall.com> Subject: Re: Lukas's Status Report - #7 of 15 ----- Original Message ----- From: "Lukasz Odzioba" <lukas.odzioba@...il.com> > I'm not familiar with mscash2, so can't understand all you said, but > after a week or two I'll contact you in that case. I was not familiar with cash2 (DCC2) until recently. The orignal code (current 'released' code), was pretty opaque, and I found it very hard to figure out what was wrong. We had some problems with utf8->unicode conversions compared with iso8851->unicode, and we had some issues with the salt length (user name). So I spent a little time trying to figure things out. Magnum and I nailed the issues in the code (after quite a bit of work). Well, after I understood some of the code, I assumed it would be easy and nice to port to SSE, so I started again researching the algorithm, and not using the code as implemented. I wanted to simplify as much as possible the code, so I chose to do it all using OpenSSL function calls. When doing this, I found that even though in the inner loop, there are 2 84 byte SHA1 encryptions (which require 2 64 byte SHA block encryptions each), that the first block in each of these encryptions was exactly the same EVERY time. Only the trailing 20 bytes varied with each encryptions. Thus, I found that if you perform the 2 1st limb encrptions, keeping the state of the SHA, and then restore each of these states (within the loop), and perform the 2nd half of the SHA crypt. Since I have spent the time, and have a decent handle on the algorithm, I can explain it in simple terms. first there are 2 variable peices of information. The password (key), and the username (salt). Both of these are converted to UTF16 unicode. Then there are 2 primative encryption algorthms used, MD4 and SHA1. There are also some intermediate steps that are the 'same' as other password keying algorithms. These are the NTLM, the DCC1 (mscash1), and then finally, the final DCC2 (mscash2) is produced. So the steps are: 1. convert key into unicode. 2. convert salt into unicode. 3. Perform MD4 on the unicode key. This produces an NTLM hash value. (16 bytes) 4. Append the unicode salt (user name) to this NTLM hash. This will produce a variable sized 'key'. It appears the user name is maxed out at 22 unicode characters (possibly 21). The original code only allowed 19, but I believe that was too short. 5. Perform an MD4 on the data buffer from step 4. This produces a DCC1 (mscash) hash value. (16 bytes). 6. Perform this logic PBKDF2(HMAC_SHA1, DCC1, salt (the unicode user name), 10240, 16). A good place to see what is meant by the PBKDF2(...) stuff is at: http://en.wikipedia.org/wiki/PBKDF2 HMAC_SHA1 means to use SHA1 internally within the hmac function. The password is the 16 bytes of DCC1. The salt is the unicode user name. 10240 is the number of hmac iterations. 16 is the number of bytes to use (only the lower 16 bytes, not the full 20 bytes of the SHA1). Here is step 6 in 'english': produce ipad and opad. These are each 64 bytes. They are the password ^ 0x36363636..... for the ipad, and the password ^ 0x5c5c5c5c.... for the opad. a. temp_hash is 20 bytes. return_hash is 16 byte buffer. The first iteration is different. a. perform SHA1 of ipad . salt . 0001 -> tmp_hash. The 0001 is a 4 byte integer 1, in big endian format. b. perform SHA1 of opad . tmp_hash -> tmp_hash c. return_hash = tmp_hash it is actually return_hash ^= tmp_hash with in step 0, return_hash starting out as 0. For the remaining 10239 steps, perform this: a. SHA1 of ipad . temp_hash -> tmp_hash b. SHA1 of opad . temp_hash -> tmp_hash c. return_hash ^= tmp_hash When all iterations are done, the final 'hash', is the value of return_hash. What i noticed was that all 'first' SHA's of ipad and opad values sipmly sets up the SHA environment. It is always 'constant', since it is an even 64 bytes long. Thus, I precompute these one time only, and use that information to setup the SHA and simply encrypt the 2nd buffer each time. Works great, and only does 1/2 the work. Now, for SSE, I am ONLY going to add SSE code, to the internal loop (actually, to the 10239 iterations of the very inside 'loop'. This should gain 99.99% of the 'possible' gain, but this 'intrusion' and change to get the SSE2 is as trivial as I can get it. However, do to differences between the SSE and other runs (like 'generic' or OMP), I will likely create a special pbkdf2 function, that is called one time to perform work on all items at once (after loading them), while the existing pbkdf2 works on one item at a time (for allowing multi-threading to work). To show what was done (you can also look at the code I posted earlier today, and see the memcpy's to see what is happening). prepare SSE variables for (loop=1; loop < 10240; ++loop) { restore_ipad_state SSE_SHA1 tmp_hash -> tmp_hash restore_opad_stat SSE_SHA1 tmp_hash -> tmp_hash xor results } I am wondering if for the GPU code, this might also be the 'right' place to send the data over to the multiple GPU's to do their magic. I am not sure exactly how the 'best' way to integrate into john, but remember, this 1 loop is 99.99% of the work being done. The other IMHO is simply setup code for this loop. Even if you speed up this loop 100x, then there is still only 1% of the time in the entire rest of the 'setup' code. That is why I made a choice to ONLY optimize this inner loop, and pretty much ignore the rest. The other other optimization I can see (other than pinhole speedup's), is that the NTLM only needs to be performed if a new password is loaded. The way john works, is it loads X number of passwords, then sets a salt, calls crypt_all, checks for hits, then sets the next salt, calls crypt_all, checks, etc, etc, until all salts have been run. Thus, if there are many salts, the NTLM hash (the first MD4 of the unicode password), only needs to be done when new password(s) is(are) loaded. Hopefully, this is a little more information for you, and will allow you to get a better feel for just what is happening under the hood of mscash2. I do hope to get the low down on how to get SSE SHA1 working for multi-buffer encryptions. Once I get that, it should not be long, until I have a faster working cash2 format, which has been trimmed down to about 600 lines (which about only 150 are the 'core' encryption stuff). I really did not plan on making any announcement about the code changes, until I was done with the format (generic, OMP and SSE), but since you posted that you may be starting on GPU code, I wanted to make sure it was known that we have a much more easy to follow format, even if only working for generic/omp. Jim.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.