|
Message-ID: <20150425113905.GA19072@openwall.com> Date: Sat, 25 Apr 2015 14:39:05 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: [GSoC] John the Ripper support for PHC finalists Hi Agnieszka, On Sat, Apr 25, 2015 at 04:27:49AM +0200, Agnieszka Bielec wrote: > I'm sending more benchmarking tests. These are very nice. Maybe you'd format this spreadsheet such that we could export it into a reasonably-looking PDF? And add descriptions of our systems in there (what actual hardware dev=1 corresponds to, etc.) Oh, and actual sizes corresponding to the different m_cost settings, in kilobytes. Then we'll "deliver" this as a report to the PHC community. > Sorry, in this week I had a lot of work activities related to my university No problem. > Isn't it strange that sse2 is faster than avx2 for greater costs values?? Where are you seeing that? Are you perhaps looking at "sse2" on super vs. "avx2" on well? If so, no, it's not strange. super is a much faster machine than well: 16 cores (32 threads) vs. 4 cores (8 threads). Also, super has 8 memory channels, and well has 2. (This matters for high m_cost, when we're out of cache.) super has a total of 40 MB of L3 cache (2.5 MB per core), well has 8 MB (2 MB per core). well's higher CPU clock rate and AVX2 can't fully compensate for those many advantages of super. The opposite could be a bit strange - that super is slower than well at any cost settings at all. But there's an explanation for this: at high c/s rates, overhead plays more of a role, especially with OpenMP. Indeed, when doing a million hashes per second, even slight desynchronization between the threads results in some of the threads waiting for others at the end of an OpenMP parallel block. (FWIW, higher OMP_SCALE helps reduce this effect by letting OpenMP (re)allocate work dynamically. With low OMP_SCALE, there's simply too little work for that.) At lower c/s rates, this effect is not so profound because slight discrepancies in threads' performance correspond to a much smaller fraction of their total running time. If you benchmark with --fork rather than with OpenMP, you'll likely see super performing better than well at all cost settings without exception. You'd use --fork=8 on well and --fork=32 on super, and use the many processes' cumulative speeds. Yes, this is inconvenient if you need to run many such benchmarks, so I don't actually suggest it. I just point out that the OpenMP overhead is avoidable. Also, maybe you didn't use "export GOMP_CPU_AFFINITY=0-31" in some (or all?) of your tests on super. It usually needs that setting, especially at high c/s rates. BTW, surely your "sse2" is actually AVX. You used the "SSE2" version of the source code, but when those same intrinsics are compiled with AVX enabled, the compiler produces the corresponding AVX instructions for them. We should probably document these benchmarks as "AVX" and "AVX2" when bringing them to PHC. What's actually puzzling is the sharp decrease in performance with higher t_cost on GPUs. There's expected to be a 4x decrease when you increase t_cost by 2, but e.g. for "private dev=5", m_cost=4 we see a 100x+ decrease when going from t_cost=2 to t_cost=4. My guess is that your OpenCL kernel does not fit in GPUs' L1 caches, and its shorter running times result in greater reuse of instruction fetches by the different wavefronts/warps as these gradually become more out of sync. If so, you can probably improve performance in those high t_cost cases (and possibly for lower costs as well) by reducing your kernel's code size. But that's just a guess, which might as well be wrong. You could want to check the GPU ISA level code size first. A major task that you haven't approached yet is instruction interleaving on the CPUs. Do you understand this concept? Including why it helps? While we use it in JtR in various formats, I think it's better illustrated by the evolution of my php_mt_seed program: http://www.openwall.com/php_mt_seed/ You may start with an older version of it, and see how much faster it became since then and _why_. You may also try reimplementing some of those same optimizations on your own, just to practice, without looking _too_ closely at the already-made optimizations (just skim over them to get an overall idea of the approach taken). I think this is a good way to learn both SIMD programming, and interleaving (which is a concept relevant with and without SIMD). http://cvsweb.openwall.com/cgi/cvsweb.cgi/projects/php_mt_seed/php_mt_seed.c http://download.openwall.net/pub/projects/php_mt_seed/ Unfortunately, the oldest version of php_mt_seed already includes 2x interleaving, but I brought it much further in later versions (to 8x interleaving and SIMD at once). Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.