|
Message-ID: <c4a39162ea29647d6727ef16953693a7@smtp.hushmail.com> Date: Sun, 3 Feb 2013 03:01:54 +0100 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: NetNTLMv1 On 2 Feb, 2013, at 20:30 , magnum <john.magnum@...hmail.com> wrote: > On 2 Feb, 2013, at 16:25 , Solar Designer <solar@...nwall.com> wrote: >> On Fri, Feb 01, 2013 at 07:45:12AM +0400, Solar Designer wrote: >>> With a generic+OpenMP build, it is ~3150M c/s for one process (8 >>> threads). This puzzles me, because generic's MD4 computations are >>> slower, whereas the comparisons are not supposed to be faster since >>> OpenMP is only being made use of for the MD4s, not for comparisons, in >>> that code version. So I would have expected its performance to be >>> around ~850M at "many salts" - same as I'm getting for one process with >>> the XOP build (on otherwise idle system). I don't understand where a >>> further 4x speedup comes from. >> >> I think I figured this out: generic+OpenMP uses much higher >> max_keys_per_crypt than SIMD-enabled non-OpenMP builds do. Can you >> rework the latter to allow for increasing their max_keys_per_crypt? >> My gut feeling is that a value of around 0x100 will be optimal (need to >> make it a multiple of MMX_COEF and maybe MD4_SSE_PARA as appropriate for >> a given build, of course). > > Yes I figured I should try that. In NT2 there is a BLOCK_LOOPS macro that is a multiplier for SIMD number of keys. That was for OMP experiments but same code can be used for a single thread loop. BTW we can actually get up to ~80M for NT2 with 2xOMP but I haven't had any success in making it ready for production use: Only hardcoded values will work. As soon as I turn any of it into run-time variables, the overhead eats the gain. This can probably be worked out. And this should apply to NTLMv1 and MSCHAPv2 too. Lol, I hit a luxury problem: Benchmarking: NTLMv1 C/R MD4 DES (ESS MD5) [128/128 SSE2 intrinsics 12x]... DONE Many salts: 4294M c/s real, 4294M c/s virtual Only one salt: 38731K c/s real, 38348K c/s virtual I can't tune BLOCK_LOOPS, because I hit the 32 bit limit and always see 4294M. I am pretty sure I fixed this for MPI, maybe we should always use the MPI version of that code in bench.c. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.