Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sun, 4 Jul 2010 00:47:59 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: bitslice DES parallelization with OpenMP

Hi,

How do you like 356M c/s at LM hashes on a single system (no GPU,
seriously), and 15M c/s at traditional DES-based crypt(3) with just a
single hash being attacked (the slower case)?  And that's with no
hand-crafted assembly code at all.  Please read on. ;-)

This time, I came up with and implemented a new bitslice DES key setup
algorithm (currently with SSE2 intrinsics only - a proof of concept).
Although the new algorithm was meant to allow for vectorization and
parallelization - and it does - it's also a lot faster even in a single
process, single thread build (no OpenMP involved), albeit with SSE2
vectorization where appropriate.  Here are the new single process,
single thread benchmarks on Core i7 920 2.67 GHz:

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     2573K c/s real, 2573K c/s virtual
Only one salt:  2484K c/s real, 2484K c/s virtual

At first glance, this one is not exciting, but notice how the gap
between multi-salt and single salt speed became much narrower.  This
will be important for subsequent parallelization (see below).

Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
Raw:    38781K c/s real, 38781K c/s virtual

And this one is exciting: it is more than twice faster than what a clean
1.7.6 achieves on this machine (around 17.5M c/s at LM).  That's because
LM hash computation is mostly about key setup (or at least it was prior
to this change).

Another exiting result: with 8 separate processes running on the Core i7
at once, I am getting a combined c/s rate of 173M c/s at LM hashes.
For dual Xeon X5460 3.16 GHz, the numbers are 45M c/s (non-OpenMP build,
single process, single thread) and as much as 356M c/s combined for
8 simultaneous processes (still no OpenMP).

This is with john-1.7.6-fast-des-key-setup-3.diff.gz, which I uploaded
to the wiki:

http://openwall.info/wiki/john/patches

I am also uploading these patches to:

http://download.openwall.net/pub/projects/john/contrib/

so they get propagated to the mirrors.

Now let's get back to OpenMP.  1.7.6-omp-des-7, also uploaded to the
wiki, integrates the DES key setup code above and parallelizes it.  For
DES-based crypt(3), things are quite good.  On the Core i7:

$ GOMP_SPINCOUNT=10000 ../run/john -te -fo=des
Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     9584K c/s real, 1202K c/s virtual
Only one salt:  8565K c/s real, 1165K c/s virtual

During the first part of the benchmark (multi-salt), system load stays
at over 99%.  During the second part, it drops to 92% (leaving 8% idle),
which is quite normal given that some code that runs per-key hasn't been
parallelized (doing so wouldn't fit well in the OpenMP model).  With 8
separate non-OpenMP-enabled processes, the same system does around
10800K c/s combined for the single salt case, so the efficiency of the
OpenMP-enabled build at this test is around 80%.

Unfortunately, the speed for multi-salt has dropped from around 10200K
for -omp-des-4 to around 9600K.  I previously traced this to the move of
the "expand key" step from DES_bs.c into "crypt bodies" in DES_bs_b.c,
but why this causes a slowdown is a mystery to me.  I haven't
investigated this further yet - if anyone does, please let me know.
Meanwhile, I am keeping the older -omp-des-4 patch on the wiki as well.
I had to make this "expand key move" in this newer patch in order to
move key setup inside the parallel region, which was required to achieve
decent performance for single-salt.

Let's see if it's achieved in practice:

$ GOMP_SPINCOUNT=10000 ./john -i=alpha pw
Loaded 1 password hash (Traditional DES [128/128 BS SSE2-16])
guesses: 0  time: 0:00:00:01  c/s: 2850K  trying: lkuku - botink
guesses: 0  time: 0:00:00:05  c/s: 6596K  trying: droopapr - stpgehed
guesses: 0  time: 0:00:00:14  c/s: 7464K  trying: fedsuyog - fesidanc
guesses: 0  time: 0:00:00:28  c/s: 7788K  trying: vwhwca - vwcvfr
guesses: 0  time: 0:00:00:53  c/s: 7965K  trying: dcerkncs - alecismc
guesses: 0  time: 0:00:01:09  c/s: 8002K  trying: sejaxn - siqtui
guesses: 0  time: 0:00:01:45  c/s: 8071K  trying: eicomdm - eicvdic
guesses: 0  time: 0:00:02:21  c/s: 8103K  trying: huaugost - huluigra
apmahoop         (user)
guesses: 1  time: 0:00:02:41  c/s: 8110K  trying: apffjkbt - acchucrm

Yes, it is.  (It is normal for "incremental" mode, when not locked to a
specific length, to take a while to get up to full speed.  This is
because length switches involve overhead, but those become very
infrequent when running for a while.  The speed would continue to
increase even further if the above session were running for longer.)

Now to LM hash parallelization.  In short, there's some speedup from
parallelization, but things are not great - and my recommendation is to
run separate processes.  Here's what I am getting on the Core i7 with
1.7.6-omp-des-7 running 8 threads:

$ GOMP_SPINCOUNT=10000 ../run/john -te=1 -fo=lm
Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
Raw:    65126K c/s real, 16120K c/s virtual

That's about the best I can achieve with the current code.  It leaves
about 50% of CPU time idle (although due to SMT this corresponds to less
than 50% of actual CPU resources staying idle).  By running two 4-thread
processes simultaneously, I achieve a combined c/s rate of 90M c/s:

Raw:    45386K c/s real, 17443K c/s virtual
Raw:    45386K c/s real, 17429K c/s virtual
[1]-  Done                    GOMP_SPINCOUNT=10000 OMP_NUM_THREADS=4 ./john -te -fo=lm
[2]+  Done                    GOMP_SPINCOUNT=10000 OMP_NUM_THREADS=4 ./john -te -fo=lm

That's better, but it's still very far from the combined c/s rate for 8
separate non-OpenMP-enabled processes, which is as high as 173M c/s.

Now let's see how this behaves in actual usage.  I'll use just one hash
for this test for clarity, although multi-hash benchmarks are also very
interesting and are of more practical relevance.  Those show similar
results ("trust me" - or better try on your own and post in here).

First, just the new key setup (1.7.6-nsk-3), no OpenMP yet, back on the
Core i7 machine:

$ ./john -i -u=u3047 ~/john/pw-fake-nt
Loaded 1 password hash (LM DES [128/128 BS SSE2-16])
guesses: 0  time: 0:00:00:02  c/s: 19644K  trying: 09ACEBU - 09ACOS3
guesses: 0  time: 0:00:00:08  c/s: 26005K  trying: BBOSNCR - BBOTPHA
guesses: 0  time: 0:00:00:16  c/s: 28044K  trying: PW0DSK - PW0D72
guesses: 0  time: 0:00:00:20  c/s: 28688K  trying: COHJF9 - COVDFT
guesses: 0  time: 0:00:00:27  c/s: 29225K  trying: 34JYDW - 34JY31
guesses: 0  time: 0:00:00:43  c/s: 30049K  trying: KS54WG - KS58TL
guesses: 0  time: 0:00:01:06  c/s: 30444K  trying: 408I1$O - 407929T
NE1410S          (u3047)
guesses: 1  time: 0:00:01:11  c/s: 30387K  trying: NECQ9SU - NE1411B

This achieved over 30M c/s out of the "theoretical maximum" of 39M c/s.
With a longer session (or with a length lock), the speed would be
slightly higher.  Note that only one CPU core was in use by John in the
above run.

Now 1.7.6-omp-des-7 running 8 threads, same machine:

$ GOMP_SPINCOUNT=10000 ./john -i -u=u3047 ~/john/pw-fake-nt
Loaded 1 password hash (LM DES [128/128 BS SSE2-16])
guesses: 0  time: 0:00:00:00  c/s: 6156K  trying: PY2TY - M02TS
guesses: 0  time: 0:00:00:04  c/s: 31746K  trying: MW2C90 - MWIZDK
guesses: 0  time: 0:00:00:11  c/s: 39202K  trying: N1BAKK - NGCHCY
guesses: 0  time: 0:00:00:17  c/s: 41378K  trying: 0BAYBBS - 0BLMSD1
guesses: 0  time: 0:00:00:25  c/s: 43049K  trying: 20RPPAD - 20MMNTS
guesses: 0  time: 0:00:00:36  c/s: 44219K  trying: ALY2AS8 - AR62899
NE1410S          (u3047)
guesses: 1  time: 0:00:00:48  c/s: 44756K  trying: NECDRD3 - NE10TWO

Yes, it's 47% faster, which is consistent with "--test" benchmarks.
Those reported more like a 68% speedup, but we know that performance
with OpenMP is less stable (it's very sensitive to system load) and that
there's cracking-mode-specific overhead (not parallelized), which is not
included in the "--test" benchmarks.  So it is not surprising to see
a difference like this for this very fast hash type.  For slower hash
types, "--test" benchmarks match actual runs much closer.

BTW, the specific password "NE1410S" would normally be cracked instantly
(given its LM hash) because it is found on JtR's bundled common
passwords list, but I intentionally forced JtR to use the "incremental"
mode to have this session take a little while.  I did not even give JtR
any hint on the character set used.

Finally, some quick benchmarks for dual Xeon X5460 3.16 GHz (briefly
mentioned above), under some unrelated load:

Just the new key setup (no OpenMP), only one CPU core in use by John:

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     2982K c/s real, 2982K c/s virtual
Only one salt:  2880K c/s real, 2880K c/s virtual

Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
Raw:    45243K c/s real, 45243K c/s virtual

1.7.6-omp-des-7:

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     20054K c/s real, 2506K c/s virtual
Only one salt:  15532K c/s real, 1941K c/s virtual

That's over 15M c/s for the single-salt case.  Quite good.

Benchmarking: LM DES [128/128 BS SSE2-16]... DONE
Raw:    63799K c/s real, 7974K c/s virtual

Good speed, but nothing exciting now that we've already seen Core i7
perform a bit better.  I think it's the system load (in combination with
this very fast hash) that prevented the Xeon from showing a better
result.  This is as little as 18% of what this same system achieves with
8 separate non-OpenMP-enabled processes (64M out of 356M).

It should be possible to achieve a better result on an idle dual-Xeon
system - 100M c/s anyone?

As usual, feedback is welcome.  I'd like this stuff to become a bit more
mature before I approach integrating it.

Alexander

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ