|
Message-ID: <BANLkTimTvBkK4qamWpF1+TgxCp_Oj+W_Fg@mail.gmail.com> Date: Sat, 2 Jul 2011 00:04:57 +0200 From: Łukasz Odzioba <lukas.odzioba@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Lukas's Status Report - #7 of 15 2011/6/30 JimF <jfoug@....net>: > ----- Original Message ----- From: "Lukasz Odzioba" > <lukas.odzioba@...il.com> >> >> I'm not familiar with mscash2, so can't understand all you said, but >> after a week or two I'll contact you in that case. > > I was not familiar with cash2 (DCC2) until recently. The orignal code > (current 'released' code), was pretty opaque, and I found it very hard to > figure out what was wrong. We had some problems with utf8->unicode > conversions compared with iso8851->unicode, and we had some issues with the > salt length (user name). So I spent a little time trying to figure things > out. Magnum and I nailed the issues in the code (after quite a bit of > work). > > Well, after I understood some of the code, I assumed it would be easy and > nice to port to SSE, so I started again researching the algorithm, and not > using the code as implemented. I wanted to simplify as much as possible the > code, so I chose to do it all using OpenSSL function calls. When doing this, > I found that even though in the inner loop, there are 2 84 byte SHA1 > encryptions (which require 2 64 byte SHA block encryptions each), that the > first block in each of these encryptions was exactly the same EVERY time. > Only the trailing 20 bytes varied with each encryptions. Thus, I found that > if you perform the 2 1st limb encrptions, keeping the state of the SHA, and > then restore each of these states (within the loop), and perform the 2nd > half of the SHA crypt. > > Since I have spent the time, and have a decent handle on the algorithm, I > can explain it in simple terms. > > first there are 2 variable peices of information. The password (key), and > the username (salt). Both of these are converted to UTF16 unicode. > > Then there are 2 primative encryption algorthms used, MD4 and SHA1. There > are also some intermediate steps that are the 'same' as other password > keying algorithms. These are the NTLM, the DCC1 (mscash1), and then > finally, the final DCC2 (mscash2) is produced. > > So the steps are: > > 1. convert key into unicode. > 2. convert salt into unicode. > 3. Perform MD4 on the unicode key. This produces an NTLM hash value. (16 > bytes) > 4. Append the unicode salt (user name) to this NTLM hash. This will produce > a variable sized 'key'. It appears the user name is maxed out at 22 > unicode characters (possibly 21). The original code only allowed 19, but I > believe that was too short. > 5. Perform an MD4 on the data buffer from step 4. This produces a DCC1 > (mscash) hash value. (16 bytes). > 6. Perform this logic PBKDF2(HMAC_SHA1, DCC1, salt (the unicode user name), > 10240, 16). > > A good place to see what is meant by the PBKDF2(...) stuff is at: > http://en.wikipedia.org/wiki/PBKDF2 > > HMAC_SHA1 means to use SHA1 internally within the hmac function. The > password is the 16 bytes of DCC1. The salt is the unicode user name. 10240 > is the number of hmac iterations. 16 is the number of bytes to use (only > the lower 16 bytes, not the full 20 bytes of the SHA1). > > Here is step 6 in 'english': > > produce ipad and opad. These are each 64 bytes. They are the password ^ > 0x36363636..... for the ipad, and the password ^ 0x5c5c5c5c.... for the > opad. > a. temp_hash is 20 bytes. return_hash is 16 byte buffer. > The first iteration is different. > a. perform SHA1 of ipad . salt . 0001 -> tmp_hash. The 0001 is a 4 byte > integer 1, in big endian format. > b. perform SHA1 of opad . tmp_hash -> tmp_hash > c. return_hash = tmp_hash it is actually return_hash ^= tmp_hash with in > step 0, return_hash starting out as 0. > > > For the remaining 10239 steps, perform this: > a. SHA1 of ipad . temp_hash -> tmp_hash > b. SHA1 of opad . temp_hash -> tmp_hash > c. return_hash ^= tmp_hash > > When all iterations are done, the final 'hash', is the value of return_hash. > > What i noticed was that all 'first' SHA's of ipad and opad values sipmly > sets up the SHA environment. It is always 'constant', since it is an even > 64 bytes long. Thus, I precompute these one time only, and use that > information to setup the SHA and simply encrypt the 2nd buffer each time. > Works great, and only does 1/2 the work. > > Now, for SSE, I am ONLY going to add SSE code, to the internal loop > (actually, to the 10239 iterations of the very inside 'loop'. This should > gain 99.99% of the 'possible' gain, but this 'intrusion' and change to get > the SSE2 is as trivial as I can get it. However, do to differences between > the SSE and other runs (like 'generic' or OMP), I will likely create a > special pbkdf2 function, that is called one time to perform work on all > items at once (after loading them), while the existing pbkdf2 works on one > item at a time (for allowing multi-threading to work). > > To show what was done (you can also look at the code I posted earlier today, > and see the memcpy's to see what is happening). > > prepare SSE variables > for (loop=1; loop < 10240; ++loop) { > restore_ipad_state > SSE_SHA1 tmp_hash -> tmp_hash > restore_opad_stat > SSE_SHA1 tmp_hash -> tmp_hash > xor results > } > > I am wondering if for the GPU code, this might also be the 'right' place to > send the data over to the multiple GPU's to do their magic. I am not sure > exactly how the 'best' way to integrate into john, but remember, this 1 loop > is 99.99% of the work being done. The other IMHO is simply setup code for > this loop. Even if you speed up this loop 100x, then there is still only 1% > of the time in the entire rest of the 'setup' code. That is why I made a > choice to ONLY optimize this inner loop, and pretty much ignore the rest. > The other other optimization I can see (other than pinhole speedup's), is > that the NTLM only needs to be performed if a new password is loaded. The > way john works, is it loads X number of passwords, then sets a salt, calls > crypt_all, checks for hits, then sets the next salt, calls crypt_all, > checks, etc, etc, until all salts have been run. Thus, if there are many > salts, the NTLM hash (the first MD4 of the unicode password), only needs to > be done when new password(s) is(are) loaded. > > Hopefully, this is a little more information for you, and will allow you to > get a better feel for just what is happening under the hood of mscash2. > > I do hope to get the low down on how to get SSE SHA1 working for > multi-buffer encryptions. Once I get that, it should not be long, until I > have a faster working cash2 format, which has been trimmed down to about 600 > lines (which about only 150 are the 'core' encryption stuff). I really did > not plan on making any announcement about the code changes, until I was done > with the format (generic, OMP and SSE), but since you posted that you may be > starting on GPU code, I wanted to make sure it was known that we have a much > more easy to follow format, even if only working for generic/omp. > > Jim. > > > Hopefully, this is a little more information for you, and will allow you to > get a better feel for just what is happening under the hood of mscash2. Thanks Jim for an exhaustive explaination, I think it's enough to me to do implement that properly, and it possibly will save me lot of time. > I am wondering if for the GPU code, this might also be the 'right' place to > send the data over to the multiple GPU's to do their magic. I am not sure > exactly how the 'best' way to integrate into john, but remember, this 1 loop > is 99.99% of the work being done. Besides pure sha256cuda patch all of my previous patches were doing everything on gpu. Maybe it's wasn't a good approach, and was going to check that and forgot. Now I'll try do only this "real slow" part on gpu, and the rest on cpu, hopefully it will be faster that way. Lukas
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.