john-dev - Re: Interleaving of intrinsics

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <C1CBCB14-47FA-455E-A822-0FB0A08F75C3@gmail.com>
Date: Thu, 11 Jun 2015 10:11:13 +0800
From: Lei Zhang <zhanglei.april@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Interleaving of intrinsics

> On Jun 11, 2015, at 1:19 AM, magnum <john.magnum@...hmail.com> wrote:
> 
> On 2015-06-10 17:59, Lei Zhang wrote:
>> I further did some investigation into the asm code generated under x1
>> & x2 (SIMD_PARA_SHA256) by icc on my laptop (AVX). In SSESHA256body,
>> there're about 200 vmovdqu instructions generated under x1, and the
>> number is 260 under x2. Most of the vmovdqu instructions seem to be
>> used for loading & storing xmm registers, only a few for
>> inter-register moving. I think it's likely those additional vmovdqu
>> instructions under x2 are for register spilling.
> 
> So we get 30% more load/store for 100% more work done. That should be a win! But this assumes we're not having actual loops in the code.

I manually checked the report given by icc under interleaving x2. By checking the line number of the unrolled loops in the report, I can tell if a specific loop in the source is unrolled.

There're 13 instances of SHA256_PARA_DO in SSESHA256body. According to icc's report, 10 of them are fully unrolled. 

In addition, there're 64 instances of SHA256_STEP, which in turn invokes SHA256_PARA_DO. But none of them are unrolled according to the report. 

So there're in total 13 + 64 = 77 loops contributed by SHA256_PARA_DO, but only 10 of them are unrolled. That doesn't look good.

Lei

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.