john-dev - Re: intrinsics: speed up for linux-x86-64-native

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date: Fri, 14 Sep 2012 20:10:53 +0200
From: Simon Marechal <simon@...quise.net>
To: john-dev@...ts.openwall.com
Subject: Re: intrinsics: speed up for linux-x86-64-native

On 09/14/2012 07:21 PM, magnum wrote:
> It's not very intuitive but AFAIK it was made that way on purpose: The intention is to get code that hides latency, much like GPU coding. This is pretty compiler-dependant and that is why icc can make such a great difference.
> 
> That is about all I know so I'll leave the details to the experts. I hope someone can give a more thorough explanation - I'd read it with interest.

This is exactly the right explanation :)

I haven't read an Intel optimization guide since the pentium, but AFAIK
this should mainly hide instruction dependency latency.

Processors are designed around an instruction pipeline, where
instructions are being decoded and preprocessed during this pipeline.

Every time you increase the pipeline length, you can reduce the
complexity of each step, and thus increase the frequency. The problem is
that there are several kind of events that lead to instructions not
being able to enter the pipeline, or to be stalled, reducing the
instructions/cycles ratio. This means that with long pipelines you get
huge penalties. This is the reason why the Netburst architecture
(Pentium 4) was under-performing, even though it was designed for raw
clock speed. It had a huge pipeline and bad reordering and prediction
facilities.

For example, with conditional branching a single branch is being fed to
the pipeline. If the prediction is wrong, the pipeline must be emptied
and refilled again. This is the reason why so much effort is spent in
CPU and compiler design to predict properly the execution flow.

Another case, which is specifically addressed with this optimization, is
instruction dependency. If an instruction uses the result of another, it
is stalled in the pipeline until the later is finally executed. So we
are replacing :

b <- a
c <- b

(full penalty between those two steps)
...

by

b1 <- a1
b2 <- a2
b3 <- a3
c1 <- b1
c2 <- b2
c3 <- b3

(penalty is reduced by 2 pipeline steps here)

Modern processors can reorder instructions dynamically to reduce the
impact of such dependencies.

This is a bit more complicated, as our instructions work with registers
(memory local to a core that is very fast and very scarce), and data
exchange between registers and other memory must be reduced. This means
that if we overdo this, we won't have enough registers to hold all
intermediate values, start using slow memory, and kill performance. ICC
is really good at scheduling register usage, which explains the huge
speed increase.

In GPU what you want to hide is mostly memory latency. You do that by
running a lot more programs than you have processing units available.
Each processing unit has many times more registers than a CPU, for this
purpose. When a memory access is performed, another thread takes over
while the memory is transfered. I believe that optimizing GPU workloads
is a LOT of fun.

I hope this roughly explains what is going on, and that it is not too wrong.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.