|
Message-ID: <20120301213735.GB2793@openwall.com> Date: Fri, 2 Mar 2012 01:37:35 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Best performance MPI vs OMP On Thu, Mar 01, 2012 at 06:47:32PM +0100, Javier Gonz?lez del T?nago Liberal wrote: > I've been trying John and I noticed a big difference in performance > between MPI and OMP (LM, NTLM overall). These are the results: Yes, "fast" hashes show poor OpenMP scaling. However, the c/s rate is not the only thing to consider - also relevant are ease of use and order in which the candidate passwords are tried. This is where OpenMP works better, although on a 48-way machine the performance hit in terms of c/s rate for "fast" hashes is just too large, so MPI is currently a better option for those. > - DES > OMP > Benchmarking: Traditional DES [128/128 BS SSE2-16]... (48xOMP) DONE > Many salts: 47087K c/s real, 982834 c/s virtual > Only one salt: 21921K c/s real, 457370 c/s virtual > MPI > Benchmarking: Traditional DES [128/128 BS SSE2-16]... (48xMPI) DONE > Many salts: 50263K c/s real, 50263K c/s virtual > Only one salt: 48422K c/s real, 48422K c/s virtual As you can see, OpenMP shows a 93.7% efficiency for "many salts" here, which is actually surprisingly good (it's usually at around 90% even for much smaller systems). OpenMP will only show better efficiency for slower hashes - such as for Blowfish-based and MD5-based crypt(3), and for MSCash2. > - LM > OMP > Benchmarking: LM DES [128/128 BS SSE2-16]... (48xOMP) DONE > Raw: 39714K c/s real, 828600 c/s virtual > MPI > Benchmarking: LM DES [128/128 BS SSE2-16]... (48xMPI) DONE > Raw: 684363K c/s real, 677587K c/s virtual Yes, LM almost does not scale with the current OpenMP code. It will scale a bit for thread counts in the range of 2 to 8 or so, but going further will just slow it down. You may try OMP_NUM_THREADS=2 and increase it slowly if you're curious what the optimal value and max performance for LM with OpenMP is. Another setting to experiment with is GOMP_SPINCOUNT (try 10000, 100000, 1000000, 10000000). > - NETHALFLM > OMP > Benchmarking: HalfLM C/R DES [nethalflm]... (48xOMP) DONE > Many salts: 29949K c/s real, 647146 c/s virtual > Only one salt: 1622K c/s real, 262564 c/s virtual > MPI > Benchmarking: HalfLM C/R DES [nethalflm]... (48xMPI) DONE > Many salts: 52215K c/s real, 52215K c/s virtual > Only one salt: 26010K c/s real, 26010K c/s virtual > > - NETLM > OMP > Benchmarking: LM C/R DES [netlm]... (48xOMP) DONE > Many salts: 28550K c/s real, 647123 c/s virtual > Only one salt: 857480 c/s real, 231109 c/s virtual > MPI > Benchmarking: LM C/R DES [netlm]... (48xMPI) DONE > Many salts: 52331K c/s real, 51813K c/s virtual > Only one salt: 17337K c/s real, 17337K c/s virtual These are reasonable numbers. > Is that normal? Yes. > I suppose that in the same machine, the OMP implementation should work > faster, isn't? No, it should not. Why would it? OpenMP means close coordination between the threads, which involves overhead (one thread may sometimes wait for another, data may need to be transferred between the different CPUs' caches), not to mention that MT-safe code is often slower on its own (because of higher register pressure and more complicated addressing modes). BTW, the latter means that you might be able to get better MPI performance for some of the hash types by building without OpenMP. From your benchmarks above, it is unclear whether your MPI ones are for an MPI-only or an MPI+OpenMP build. With MPI, there are separate processes, which are not synchronized to each other. So the order in which candidate passwords are tried is less optimal (it does not reflect decreasing estimated probabilities as closely), but the c/s rate is higher (no waiting, no extra data transfers, no extra register pressure, no complicated addressing modes). So far, the only exception I am aware of - where OpenMP is actually faster in terms of c/s rate by a few percent - is the Blowfish-based crypt(3) code on UltraSPARC T2. My guess is that this is due to sharing of code and mostly read-only data between the threads, which helps use the CPU's L1 caches more optimally. > 1.7.9-jumbo-5_mpi+omp [linux-x86-64] So that's it - you need an MPI-only build for even better performance. Also, instead of linux-x86-64 you may try linux-x86-64i (with the "i") for much better performance at MD5-based hashes. I would very much appreciate it if you submit your OpenMP and single CPU core benchmarks to the wiki: http://openwall.info/wiki/john/benchmarks Please also post the corresponding MPI benchmarks in here, or maybe add a third table to the wiki. The benchmark results you posted so far are probably those relevant to your intended use, which is just right, but for the wiki we need info on specific hash types. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.