|
Message-Id: <421472C5-7E1B-4959-AF80-9B91777B7D5A@gmail.com> Date: Tue, 9 Jun 2015 20:46:32 +0800 From: Lei Zhang <zhanglei.april@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Interleaving of intrinsics > On Jun 4, 2015, at 10:22 PM, magnum <john.magnum@...hmail.com> wrote: > > On 2015-06-02 13:01, Solar Designer wrote: >> Would it be reasonable for us to try my usual approach, with separate >> variables at the outer scope (inside the hashing function, but not >> inside the individual steps)? And if those are in fact separate >> variables rather than array elements, this implies manual or cpp level >> loop unrolling. > > I tried this out with MD5 and SHA256 in a topic branch. It doesn't seem to make any difference compared to loops and arrays. > > https://github.com/magnumripper/JohnTheRipper/commit/1ccc69541fef79c0f20f3143a2fcf3bedac55d30 > > Also, other tests (before that) indicate per-line loops vs. block loops for interleaving does not make any difference either, at least not for gcc. Perhaps it does for icc (as tested on super), but all results are so fluctuating and inconclusive I just get more confused the more I test. Perhaps turbo boost and stuff are playing up. > > Perhaps Lei can make some conclusions from generated asm code. I think that's the only way of telling what actually happens. > > Maybe we under-estimate the compilers. I'm starting to think MD4 and MD5 interleaves fine poorly coded or not, while SHA1/SHA2 formats simply does not interleave well regardless of coding. If that's the case it would be a relief in a way: We could just keep the readable and straight-forward code... > > magnum I tried to see the 'size' of sse-intrinsics.o under different interleaving factors and compiled by clang and icc respectively. lei-mac:src lei$ size clang/* __TEXT __DATA __OBJC others dec hex 122863 0 0 26572 149435 247bb clang/x1.o 127951 0 0 28699 156650 263ea clang/x2.o 128479 0 0 28614 157093 265a5 clang/x3.o 127679 0 0 28527 156206 2622e clang/x4.o lei-mac:src lei$ size icc/* __TEXT __DATA __OBJC others dec hex 102084 7545 0 50442 160071 27147 icc/x1.o 113012 9799 0 49375 172186 2a09a icc/x2.o 113348 9799 0 51275 174422 2a956 icc/x3.o 114740 9799 0 53235 177774 2b66e icc/x4.o It seems clang refuses to unroll some loops when interleaving factor is increased to 4. But icc unrolls just fine. icc has a relatively unique feature, that is giving out optimization report while compiling. I further investigated the reports given by icc under different interleaving factors and counted the number of loops fully unrolled. interleaving loops unrolled -------------------------------------- x1 215 x2 225 x3 225 x4 225 We can see the number of loops unrolled doesn't change under interleaving factors 2-4. It's a bit less under x1, which I guess is because icc thinks some loops that iterate only once are not worth unrolling. I haven't experimented with gcc, but I think it's quite possible that the *_PARA_DO() approach doesn't eventually lead to fully unrolled code. Explicit unrolling may be needed for interleaving. As for the precise implication on performance, I'm not very clear yet. Lei
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.