|
Message-ID: <20150622180309.GB17277@openwall.com> Date: Mon, 22 Jun 2015 21:03:09 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Interleaving of intrinsics On Mon, Jun 22, 2015 at 09:31:14PM +0800, Lei Zhang wrote: > Below are the latest results from benchmarking interleaving factors. We're using all PBKDF2-HMAC formats here, and highlighted values are the highest in their rows. > > magnum's laptop (OS X), gcc 5.1.0 > > hash\para | 1 | 2 | 3 | 4 | 5 | > -----------|----------|----------|----------|----------|----------| > md4 | 18836 | 29861 | 33100 | **33424**| 30780 | > md4-omp | 79520 | 120128 |**121920**| 120192 | 112000 | > md5 | 13532 | 21584 | **24420**| 22976 | 21465 | > md5-omp | 60736 | 86400 | **87920**| 84352 | 79360 | > sha1 | 10736 | **10952**| 8928 | 4032 | 3740 | > sha1-omp | **41312**| 39744 | 34176 | 19968 | 19840 | > sha256 | **4664**| 2384 | 3516 | 3952 | 4120 | > sha256-omp | **16736**| 10560 | 13782 | 15207 | 14891 | > sha512 | **1881**| 839 | 1290 | 1512 | 1524 | > sha512-omp | **6848**| 3808 | 4800 | 5639 | 5386 | I think the -omp speeds here don't matter much, except possibly for SHA-512. Efficiency at OpenMP for these fast hashes is low. What would matter more is cumulative or per-process speed with --fork. > MIC, icc 14.0.0 > > hash\para | 1 | 2 | 3 | 4 | 5 | > -----------|----------|----------|----------|----------|----------| > md4 | 5687 | **6526**| 6510 | 6209 | 6196 | > md4-omp | 669148 |**737882**| 711529 | 662588 | 466019 | > md5 | 4182 | 4942 | 5037 | 5005 | **5048**| > md5-omp | 520871 |**536854**| 513267 | 462291 | 447378 | > sha1 | **2598**| 2321 | 1411 | 1415 | 1346 | > sha1-omp |**282352**| 253514 | 180705 | 173886 | 163018 | > sha256 | **1077**| 855 | 830 | 887 | 880 | > sha256-omp |**119300**| 97882 | 96000 | 98642 | 97627 | > sha512 | 123 | 137 | 154 | 165 | **172**| > sha512-omp | 15567 | 17614 | 19525 | 20389 | **21333**| Are all of those speeds consistently in thousands c/s? If so, I don't understand how we may possibly achieve e.g. 737882 thousand(?) c/s, thus almost 738 million c/s, on MIC with our current approach at OpenMP parallelization. We surely should in fact achieve such speed, and even higher than that, with proper OpenMP parallelization, but we don't have proper OpenMP parallelization for fast hashes yet. Is this possibly just 737882 c/s? Thus, almost 10x _slower_ than the single-thread speed reported on the line above? If so, this is realistic... unfortunately. But it is also totally irrelevant to the interleaving task. This makes me wonder why you bothered to benchmark it and record those numbers at all? --fork=240 speeds would actually make sense here. -omp do not - these hashes are way too fast for sane efficiency at OpenMP on MIC, with our current poor approach. Unless I am missing something? > As stated in my previous messages, the '*_PARA_DO' stuffs used prevalently for interleaving aren't always unrolled as expected. OTOH, when manually unrolling those '_PARA_DO's, the resulting code gets significant higher register pressure, and runs slower (on x86). > > We've be stalling on this issue for a while. Should we refine the method of interleaving or just stay in the current approach? What to do next? It is difficult for me to provide advice on this without actually diving into the task myself, and essentially replacing you. I'd be reviewing the generated assembly code, making changes, and reviewing the code again. And indeed benchmarking, too. One thing that is clear is that non-fully-unrolled *_PARA_DO are not acceptable. If there are not enough registers for fully unrolling these without incurring spilling, then the interleaving factor should be smaller. On MIC, there should be enough registers for the interleaving factors considered above (up to 5x). Another thing that is clear is that you, Lei, need to have a better understanding and feeling for what performance figures are sane vs. insane. And for when our current OpenMP parallelization makes sense vs. does not. For the hashes benchmarked above, it does not - it's just a formality, a correctness test. It makes sense for slow hashes, and does not make sense for fast hashes. For fast hashes, --fork=240 speeds matter. (And maybe eventually we'll rework our OpenMP parallelization such that it'd achieve similar speeds at fast hashes too, or alternatively maybe some particular fast hash formats will get builtin mask mode candidate password generators and hash comparisons, like Sayantan implemented for a few of them in OpenCL.) Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.