|
Message-ID: <20150311212102.GA20481@openwall.com> Date: Thu, 12 Mar 2015 00:21:02 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: bitslice MD*/SHA*, AVX2 On Wed, Mar 11, 2015 at 11:55:09PM +0300, Solar Designer wrote: > There's definitely some further room for optimization in md5slice.c - > for example, it does not yet take advantage of the common subexpressions > in the invocations of HH() (we can save 8*32 XORs). BTW, it appears that we don't currently use this optimization. Not even in the copy of my portable MD5 implementation in JtR, which I've since updated to include this optimization here: http://openwall.info/wiki/people/solar/software/public-domain-source-code/md5 In my testing, this might not be beneficial on 2-operand archs such as plain x86, but it should be on 3-operand archs such as AVX. So we should update the code in sse-intrinsics.c, and benchmark. And we should update the plain C code anyway, such as for non-x86 archs (which are mostly 3-operand RISC). magnum, Jim? > FF() has everything > merged into it manually and specialized, whereas GG(), HH(), and II() > use other functions. Also, explicit (such as via cpp or with a script) > unrolling and inlining might (or might not) work better than gcc's (the > various *ptr++'s will be replaced with explicit array accesses at fixed > indices, which I think gcc does for us anyway). BTW, the bitselect instructions on XOP and GCN, and the ternary logic instructions on AVX-512 and Maxwell are usable for both straightforward and bitslice implementations, but it might be tough to come up with their optimal usage for the latter. We get a large circuit that we need to map onto those bitselect's or LUT-3's optimally. In straightforward implementations, we simply see which portions of the MD*/SHA* basic functions fit bitselect's or LUT-3's. With bitslice implementations, we can do the same, but it is also possible that portions of logic coming from ADDs or crossing boundaries between former bitwise ops and ADDs (even across former bit rotates) fit bitselect's or LUT-3's. In some cases, there might be tradeoffs here - e.g., don't fully put a basic MD*/SHA* function into a LUT-3, but instead put a portion of it intermixed with nearby logic coming from ADDs into two LUT-3's. Or even add redundant extra logic to make the whole thing fit LUT-3's better. In other words, a greedy algorithm putting everything into the most capable instructions sequentially might not produce the best results. I did not look into this closely yet, so this is pure speculation on what optimization opportunities and challenges we might face. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.