|
Message-Id: <3A609CAC-128C-41E7-8C94-0AD2A8A879BA@gmail.com> Date: Mon, 22 Jun 2015 21:31:14 +0800 From: Lei Zhang <zhanglei.april@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Interleaving of intrinsics Below are the latest results from benchmarking interleaving factors. We're using all PBKDF2-HMAC formats here, and highlighted values are the highest in their rows. magnum's laptop (OS X), gcc 5.1.0 hash\para | 1 | 2 | 3 | 4 | 5 | -----------|----------|----------|----------|----------|----------| md4 | 18836 | 29861 | 33100 | **33424**| 30780 | md4-omp | 79520 | 120128 |**121920**| 120192 | 112000 | md5 | 13532 | 21584 | **24420**| 22976 | 21465 | md5-omp | 60736 | 86400 | **87920**| 84352 | 79360 | sha1 | 10736 | **10952**| 8928 | 4032 | 3740 | sha1-omp | **41312**| 39744 | 34176 | 19968 | 19840 | sha256 | **4664**| 2384 | 3516 | 3952 | 4120 | sha256-omp | **16736**| 10560 | 13782 | 15207 | 14891 | sha512 | **1881**| 839 | 1290 | 1512 | 1524 | sha512-omp | **6848**| 3808 | 4800 | 5639 | 5386 | MIC, icc 14.0.0 hash\para | 1 | 2 | 3 | 4 | 5 | -----------|----------|----------|----------|----------|----------| md4 | 5687 | **6526**| 6510 | 6209 | 6196 | md4-omp | 669148 |**737882**| 711529 | 662588 | 466019 | md5 | 4182 | 4942 | 5037 | 5005 | **5048**| md5-omp | 520871 |**536854**| 513267 | 462291 | 447378 | sha1 | **2598**| 2321 | 1411 | 1415 | 1346 | sha1-omp |**282352**| 253514 | 180705 | 173886 | 163018 | sha256 | **1077**| 855 | 830 | 887 | 880 | sha256-omp |**119300**| 97882 | 96000 | 98642 | 97627 | sha512 | 123 | 137 | 154 | 165 | **172**| sha512-omp | 15567 | 17614 | 19525 | 20389 | **21333**| As stated in my previous messages, the '*_PARA_DO' stuffs used prevalently for interleaving aren't always unrolled as expected. OTOH, when manually unrolling those '_PARA_DO's, the resulting code gets significant higher register pressure, and runs slower (on x86). We've be stalling on this issue for a while. Should we refine the method of interleaving or just stay in the current approach? What to do next? Lei > On Jun 12, 2015, at 1:52 AM, magnum <john.magnum@...hmail.com> wrote: > > On 2015-06-11 15:58, Lei Zhang wrote: >> >>> On Jun 11, 2015, at 3:30 PM, magnum <john.magnum@...hmail.com> wrote: >>> >>> Now we're getting somewhere. What if you build the "unrolled" topic branch instead, using para 2 (I think I didn't add code for higher para yet). This will be manually unrolled. How many vmovdqu can you see in that? Do you see other differences compared to the bleeding code (at same para)? >> >> The manually unrolled version generates significantly longer asm code, with ~8000 instructions in SSESHA256body. This number is ~5000 in the auto-unrolled version. >> >> The number of vmovdqu is also a lot bigger, that is 1170. It's only 260 in the auto-unrolled version. The register pressure seems to be much heavier when loops are fully unrolled. >> >> Then performance (pbkdf2-hmac-sha256, x2): >> >> [auto-unrolled] >> Raw: 235 c/s real, 235 c/s virtual >> [fully-unrolled] >> Raw: 133 c/s real, 133 c/s virtual >> >> Specs: laptop, icc, OpenMP disabled, turboboost disabled >> >> We can see the fully unrolled one is much slower. I think register pressure is playing a big role here. > > That version is manually interleaved x2, with no loop constructs and with non-array temp variables. We'll see what Solar says, but I presume this means we can just forget about interleaving SHA-2 on x86. And from what I've gathered I believe this also goes for SHA-1. > > magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.