john-dev - Re: AVX in Intel Sandy Bridge

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110505195056.GA1500@openwall.com>
Date: Thu, 5 May 2011 23:50:56 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: AVX in Intel Sandy Bridge

On Sat, Apr 30, 2011 at 02:24:13PM +0400, Solar Designer wrote:
> Apparently, it might be possible to achieve some speedup by mixing
> 128-bit and 256-bit AVX ops.  I tried many different combinations, but
> not this one.  I will likely give it a try.

I gave it a try yesterday, on a Core i7-2600K, non-overclocked (3.4 GHz
standard, up to 3.8 GHz Turbo Boost), Ubuntu 11.04, x86_64, Ubuntu's gcc
4.5.2.  No luck.

1.7.7 as-is (128-bit AVX):

Benchmarking: Traditional DES [128/256 BS AVX]... DONE
Many salts:     5004K c/s real, 5054K c/s virtual
Only one salt:  4184K c/s real, 4184K c/s virtual

384-bit virtual vectors (256+128):

Benchmarking: Traditional DES [256/256 BS AVX + 128/128 BS AVX]... DONE
Many salts:     3661K c/s real, 3698K c/s virtual
Only one salt:  3188K c/s real, 3188K c/s virtual

Yet I am going to commit the code (disabled by default) - for testing on
future CPUs, etc.

I also experimented with 32-bit builds (after "aptitude install
libc6-dev-i386" and having added "-m32" into the proper make targets).
I am pleased that I had correctly predicted that 256-bit AVX would be
faster than 128-bit for 32-bit builds (with only 8 vector registers
available).  (I did not actually test this before.)

Here are the numbers for 32-bit builds.

256-bit AVX (default in 1.7.7 for 32-bit builds):

Benchmarking: Traditional DES [256/256 BS AVX]... DONE
Many salts:     4238K c/s real, 4281K c/s virtual
Only one salt:  3456K c/s real, 3456K c/s virtual

128-bit AVX (a trivial change to x86-sse.h):

Benchmarking: Traditional DES [128/256 BS AVX]... DONE
Many salts:     3490K c/s real, 3525K c/s virtual
Only one salt:  2938K c/s real, 2938K c/s virtual

BTW, I think that my use of 128/256 in the algorithm name was wrong -
we're actually telling the CPU that we're using 128-bit vectors, so it
is not entirely correct to say that we're only using 128 out of 256
bits, which is what the notation was meant to be used for (e.g., it is
used like that when it says 48/64 for non-bitslice DES).  Thus, I am
going to change these to say 128/128 for the next version.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.