|
Message-Id: <C1CBCB14-47FA-455E-A822-0FB0A08F75C3@gmail.com> Date: Thu, 11 Jun 2015 10:11:13 +0800 From: Lei Zhang <zhanglei.april@...il.com> To: john-dev@...ts.openwall.com Subject: Re: Interleaving of intrinsics > On Jun 11, 2015, at 1:19 AM, magnum <john.magnum@...hmail.com> wrote: > > On 2015-06-10 17:59, Lei Zhang wrote: >> I further did some investigation into the asm code generated under x1 >> & x2 (SIMD_PARA_SHA256) by icc on my laptop (AVX). In SSESHA256body, >> there're about 200 vmovdqu instructions generated under x1, and the >> number is 260 under x2. Most of the vmovdqu instructions seem to be >> used for loading & storing xmm registers, only a few for >> inter-register moving. I think it's likely those additional vmovdqu >> instructions under x2 are for register spilling. > > So we get 30% more load/store for 100% more work done. That should be a win! But this assumes we're not having actual loops in the code. I manually checked the report given by icc under interleaving x2. By checking the line number of the unrolled loops in the report, I can tell if a specific loop in the source is unrolled. There're 13 instances of SHA256_PARA_DO in SSESHA256body. According to icc's report, 10 of them are fully unrolled. In addition, there're 64 instances of SHA256_STEP, which in turn invokes SHA256_PARA_DO. But none of them are unrolled according to the report. So there're in total 13 + 64 = 77 loops contributed by SHA256_PARA_DO, but only 10 of them are unrolled. That doesn't look good. Lei
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.