|
Message-ID: <20150919135848.GA28734@openwall.com> Date: Sat, 19 Sep 2015 16:58:48 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: fast hash early exit vs. large hash list On Sat, Sep 19, 2015 at 02:33:21AM +0200, magnum wrote: > On 18/09/15 17:25, Solar Designer wrote: > >For raw-md5, we currently have early exit before the last 3 steps. > >Aside from this being extremely far from what state of the art fast > >hash crackers do in terms of steps reversal, > > Can we reverse any more without considering the actual candidate (or its > length)? I haven't looked into that. I suggest you take a look at BarsWFopensource.zip (I did not) and see what the requirements are for what it does. I know from other crackers' authors that it is possible to fully reverse all the I()'s (16 steps) and a bit more at least in some cases. At least when cracking a single raw MD5 hash, we're looking at 40-something steps performed per candidate (IIRC, 43 was mentioned, although optimized crackers vary a bit in this respect). Not our current 61. > I doubt the shared functions are suitable for hard-core > reversal. It might be better to do so in formats like the -ng ones that > doesn't use shared code. Or at least do them first. Doing this in an -ng format first, then seeing what/how we can move to shared code makes sense to me. > >as currently implemented it > >also hurts performance when cracking large hash lists. Only 32 bits of > >the hash result are preserved and the rest are recomputed with scalar > >code in cmp_exact(), and when the target hash list is large this happens > >quite often. In the 29M testcase, cmp_exact() is called a few million > >times on wrong passwords (as well as 1.7+ million on correct passwords). > > >I think we should support a mode where the SIMD code would exit early, > >but would record the entire state (all four vectors), so that cmp_*() > >wouldn't have to recompute from scratch. Ideally, we'd only enable this > >mode when the number of loaded hashes is large, although this can be > >tricky in terms of (avoiding) code complexity. > > For MD4/5, I was already considering improvements like cmp_one() of > Alain's NT format (and introduce a shared function for doing those last > steps). Maybe we should have a separate function for single-round, with > the few branches that would be needed for that (reversed or not, full > output state stored or not). This would benefit iterated formats too > because of less branches. BTW, here's a compiler trick to produce specialized functions without cluttering the source code: static MAYBE_INLINE void generic(int flag1, int flag2, int choice1) { ... if (flag1) { ... } ... switch (choice1) { ... } ... if (flag2) { ... } ... } static void special1(void) { generic(1, 0, 5); } static void special2(void) { generic(0, 1, 4); } and so on. Where MAYBE_INLINE is our macro that tries to force function inlining regardless of limits on function size. Optimizing compilers are able to substitute the constants and eliminate the branches and dead code (that would have resulted from naive inlining with no optimization). > For SHA-1/SHA-2 though, would we not need a good part of the 80 word > internal state to be able to continue? Should we store that too? That > ought to be slower than not reversing at all. In self-contained formats > like eg. SHA-256-ng we (can) have the whole state available, but with > the shared code we can't. I hope I'm wrong though? I haven't looked into reversal of fast functions myself, and most of what I heard from others was about MD4/5. It does appear to me that some of W[] would be needed for SHA, but then it similarly appears that MD5's last 16 steps depend on the input being hashed, yet folks are reversing all of those. Regardless, there are certainly cases where we do face this tradeoff: store more in case we have to complete computation later, or store less and risk having to recompute from scratch (or from an earlier point). This isn't even fast hash specific: e.g., we also face this with the OpenMP code for bcrypt, where the 4+ KB state is optimally kept on each thread's stack, and copying it from there in case we have to complete computation (the 128 Blowfish encryptions that we can usually skip) results in a measurable performance penalty at low (yet supported) bcrypt costs. (The current OpenMP code computes those extra 128 Blowfish encryptions all the time, to avoid the need for copying. This is probably sub-optimal as 128*16 = 2048 Blowfish rounds got to cost more than copying 4 KB. I need to improve this.) Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.