john-users - Re: SIMD performance impact

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201015170832.GA6672@openwall.com>
Date: Thu, 15 Oct 2020 19:08:32 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: SIMD performance impact

On Tue, Sep 01, 2020 at 11:48:02AM +0200, Vincent wrote:
> On 01/09/2020 02:01, Rich Rumble wrote:
> >On Mon, Aug 31, 2020 at 5:09 PM Vincent wrote:
> >>John can use instruction set specific optimizations to fully exploit
> >>technology like SIMD. I haven't got CPUs that support AVX-512 but I'm
> >>very interested in the possible performance gains. So my question is:
> >>can someone with the latest generation CPU run a 'john --test' with
> >>different instruction set binaries (for example SSE4.2, AVX, AVX2,
> >>AVX-512) on the same CPU?
> >>
> >You may find some of what you're after on the WIKI, not only in terms of
> >instructions optimizations, but threading as well as parallel/workload
> >splitting methods like MPI, HT and thread count. You could certainly force
> >JtR to build and favor those instructions, but I think by default it tries
> >to optimize on what is detected with ./configure and when built too I
> >believe it will try to figure out what is present.
> 
> For sure it's possible. So I would be interested in a:
> 
> cd $john/src/
> lscpu
> for simd in sse4.2 avx avx2 avx512f avx512bw
> do
> 	./configure --enable-simd=$simd
> 	make -s clean
> 	make -s
> 	echo "$simd"
> 	../run/john --test
> done
> 
> I just haven't got the latest and greatest AMD and Intel CPUs to run it 
> myself ;)

I'm sorry I didn't bother running deliberately suboptimal builds on
recent CPUs yet, but I've just added benchmarks of AWS EC2 c5.24xlarge
(2x Intel Xeon Platinum 8275CL, 3.6 GHz all-core turbo) and AWS EC2
c5a.24xlarge (AMD EPYC 7R32, ~3.3 GHz sustained turbo) as text files
linked from these AWS EC2 instance names at:

https://www.openwall.com/john/cloud/

The Intel benchmark uses AVX-512, the AMD one uses AVX2, except where
the corresponding JtR format doesn't support SIMD (e.g., bcrypt) or
doesn't support wide SIMD (e.g., scrypt uses plain AVX).

AVX-512 wins by a large margin, but on the other hand it's two Intel
chips for the 96 vCPUs vs. just one AMD chip for the same vCPU count.
Much higher TDP for the two chips, too.

Some hightlights, Intel:

Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE
Many salts:     561512K c/s real, 5906K c/s virtual
Only one salt:  85685K c/s real, 1415K c/s virtual

Benchmarking: md5crypt, crypt(3) $1$ (and variants) [MD5 512/512 AVX512BW 16x3]... (96xOMP) DONE
Many salts:     7621K c/s real, 79396 c/s virtual
Only one salt:  6967K c/s real, 72432 c/s virtual

Benchmarking: md5crypt-long, crypt(3) $1$ (and variants) [MD5 32/64]... (96xOMP) DONE
Raw:    534528 c/s real, 5568 c/s virtual

Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X3]... (96xOMP) DONE
Speed for cost 1 (iteration count) of 32
Raw:    80382 c/s real, 838 c/s virtual

Benchmarking: scrypt (16384, 8, 1) [Salsa20/8 128/128 AVX]... (96xOMP) DONE
Speed for cost 1 (N) of 16384, cost 2 (r) of 8, cost 3 (p) of 1
Raw:    2946 c/s real, 30.8 c/s virtual

Benchmarking: sha256crypt, crypt(3) $5$ (rounds=5000) [SHA256 512/512 AVX512BW 16x]... (96xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    304481 c/s real, 3166 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 512/512 AVX512BW 8x]... (96xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    197071 c/s real, 2056 c/s virtual

Benchmarking: phpass ($P$9) [phpass ($P$ or $H$) 512/512 AVX512BW 16x3]... (96xOMP) DONE
Speed for cost 1 (iteration count) of 2048
Many salts:     4829K c/s real, 50306 c/s virtual
Only one salt:  3474K c/s real, 36118 c/s virtual

Benchmarking: Drupal7, $S$ (x16385) [SHA512 512/512 AVX512BW 8x]... (96xOMP) DONE
Speed for cost 1 (iteration count) of 16384
Raw:    65280 c/s real, 678 c/s virtual

AMD:

Benchmarking: descrypt, traditional crypt(3) [DES 256/256 AVX2]... (96xOMP) DONE
Many salts:     408354K c/s real, 4262K c/s virtual
Only one salt:  64290K c/s real, 668373 c/s virtual

Benchmarking: md5crypt, crypt(3) $1$ (and variants) [MD5 256/256 AVX2 8x3]... (96xOMP) DONE
Many salts:     4525K c/s real, 47138 c/s virtual
Only one salt:  3732K c/s real, 38803 c/s virtual

Benchmarking: md5crypt-long, crypt(3) $1$ (and variants) [MD5 32/64]... (96xOMP) DONE
Raw:    508800 c/s real, 5300 c/s virtual

Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X3]... (96xOMP) DONE
Speed for cost 1 (iteration count) of 32
Raw:    85104 c/s real, 884 c/s virtual

Benchmarking: scrypt (16384, 8, 1) [Salsa20/8 128/128 AVX]... (96xOMP) DONE
Speed for cost 1 (N) of 16384, cost 2 (r) of 8, cost 3 (p) of 1
Raw:    2211 c/s real, 23.2 c/s virtual

Benchmarking: sha256crypt, crypt(3) $5$ (rounds=5000) [SHA256 256/256 AVX2 8x]... (96xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    92739 c/s real, 984 c/s virtual

Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 256/256 AVX2 4x]... (96xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    64060 c/s real, 667 c/s virtual

Benchmarking: phpass ($P$9) [phpass ($P$ or $H$) 256/256 AVX2 8x3]... (96xOMP) DONE
Speed for cost 1 (iteration count) of 2048
Many salts:     2359K c/s real, 24528 c/s virtual
Only one salt:  1834K c/s real, 19163 c/s virtual

Benchmarking: Drupal7, $S$ (x16385) [SHA512 256/256 AVX2 4x]... (96xOMP) DONE
Speed for cost 1 (iteration count) of 16384
Raw:    21312 c/s real, 221 c/s virtual

It's curious that AVX-512 speeds up algorithms based on SHA-2 by a
factor of 3.  I guess that's due to the bit rotate and "ternary logic"
instructions (3-input LUTs).  It's also curious the same isn't of as
much help for the faster hashes, especially not for descrypt (even
though the "ternary logic" instructions are also in use there), maybe
because we're exceeding L1 data cache with all the in-flight hashes.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.