Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 11 Nov 2020 21:10:59 +0100
From: Solar Designer <>
Subject: Re: SIMD performance impact

On Wed, Nov 11, 2020 at 08:58:05PM +0100, Solar Designer wrote:
> On Thu, Oct 15, 2020 at 07:08:32PM +0200, Solar Designer wrote:
> > I've just added benchmarks of AWS EC2 c5.24xlarge
> > (2x Intel Xeon Platinum 8275CL, 3.6 GHz all-core turbo) and AWS EC2
> > c5a.24xlarge (AMD EPYC 7R32, ~3.3 GHz sustained turbo) as text files
> > linked from these AWS EC2 instance names at:
> > 
> >

> > Some hightlights, Intel:
> > 
> > Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE
> > Many salts:     561512K c/s real, 5906K c/s virtual
> > Only one salt:  85685K c/s real, 1415K c/s virtual
> > It's curious that AVX-512 speeds up algorithms based on SHA-2 by a
> > factor of 3.  I guess that's due to the bit rotate and "ternary logic"
> > instructions (3-input LUTs).  It's also curious the same isn't of as
> > much help for the faster hashes, especially not for descrypt (even
> > though the "ternary logic" instructions are also in use there), maybe
> > because we're exceeding L1 data cache with all the in-flight hashes.
> Turns out it was primarily Amdahl's law - for descrypt, the comparisons
> against loaded hashes were performed by a single thread (but in a smart
> manner, not directly comparing hashes one to one).  I added a minimal
> fix for this on Nov 1 in 645a378764d1, which resulted in:
> Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE
> Many salts:	831258K c/s real, 8784K c/s virtual
> Only one salt:	88033K c/s real, 1417K c/s virtual
> Today, I merged a further fix ec6d12bab175, which brought the speeds to:
> Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... (96xOMP) DONE
> Many salts:	861929K c/s real, 9087K c/s virtual
> Only one salt:	90046K c/s real, 1416K c/s virtual
> Single core speed with a non-OpenMP build:
> Benchmarking: descrypt, traditional crypt(3) [DES 512/512 AVX512F]... DONE
> Many salts:	27218K c/s real, 27218K c/s virtual
> Only one salt:	21903K c/s real, 21903K c/s virtual
> The candidate passwords stream is still produced by a single thread,
> which is why the much lower "Only one salt" speed.  We only address that
> limitation for fast hashes with mask mode in OpenCL so far.  However,
> for "Many salts" I think this is a very impressive speed to have on CPU.
> Actual cracking at ~1.4 hashes/salt achieves ~840M.  With "--fork=2" (2
> processes with 48 threads each), it's ~870M combined.  With "--fork=6",
> it's ~900M.  With "--fork=48" and a non-OpenMP build, it's ~21.5M*48 =
> 1032M.  These are with incremental mode locked to length 7 and few
> successful cracks (speeds are somewhat lower when frequently detecting
> successful cracks).  Going for mask mode also at length 7 improves the
> last one of these to ~22M*48 = 1056M.

Oh, I realized I need to clarify: despite of ~1.4 hashes/salt, the
speeds I posted are c/s (hashes computed per second), not C/s
(effective combinations tested per second) - the latter were accordingly
~1.4x higher than those I posted.  The ~1.4 is simply to have a
realistic test case, as descrypt salt collisions are very common.  The
c/s figures would actually be very slightly higher with only one hash
per salt (due to fewer comparisons to perform), but then it'd be weird
to have many salts without collisions.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.