john-users - bitslice DES parallelization with OpenMP

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100627224402.GA3058@openwall.com>
Date: Mon, 28 Jun 2010 02:44:02 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: bitslice DES parallelization with OpenMP

Hi,

I'll start with a benchmark result to get you to read the rest of this
lengthy message. ;-)

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     17444K c/s real, 2202K c/s virtual
Only one salt:  5561K c/s real, 694300 c/s virtual

Attached to this message and uploaded to the wiki is a quick and really
dirty patch, which is nevertheless a successful attempt at parallelizing
John the Ripper's bitslice DES code with OpenMP directives (requires gcc
4.2+ or the like).  So far, I've only tested this with gcc 4.5.0 and the
linux-x86-64 make target.  The current revision is 1.7.6-omp-des-2.

http://openwall.info/wiki/john/patches

This patch provides good performance for traditional DES-based crypt(3)
hashes in the multi-salt case, and even better (vs. non-patched) for
BSDI-style crypt(3) hashes (since those are slower), but it usually does
not provide a speedup for LM hashes (too much overhead, key setup not
parallelized, the ordering of candidate passwords is non-optimal for the
underlying key setup algorithm).

Another drawback of the OpenMP approach (or of any other "synchronous"
approach) is that efficiency drops when there's other load on the system.
While on an idle dual-core laptop I am getting 90% efficiency for
traditional DES-based crypt(3) hashes with multiple salts, the
efficiency drops to between 70% and 85% on 8-core servers with almost no
load, and to under 50% when the JtR-unrelated server load is at around 10%.
(These efficiency percentages are relative to combined c/s rates
possible with multiple separate processes run on the same systems.)

So this is for use on otherwise idle systems only.

(The OpenMP parallelization code for slower hashes that is integrated
into JtR 1.7.6 is a lot less sensitive to system load, likely because
the synchronization is less frequent.  Maybe the same could be achieved
by buffering a lot more candidate passwords with these faster hashes -
beyond the current setting of 1024 - although the candidates would not
fit in L1 data cache then, likely resulting in a minor slowdown when
there's no other load.)

Below are some benchmark results.  These are to be compared against
non-parallelized single process benchmark results on the wiki:

http://openwall.info/wiki/john/benchmarks

Without parallelization, my Core 2 Duo T7100 1.8 GHz laptop does (using
just one CPU core at a time):

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     1713K c/s real, 1713K c/s virtual
Only one salt:  1452K c/s real, 1467K c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     55680 c/s real, 55680 c/s virtual
Only one salt:  54144 c/s real, 54144 c/s virtual

With 1.7.6-omp-des-2 built with gcc 4.5.0, it does:

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     3099K c/s real, 1549K c/s virtual
Only one salt:  2302K c/s real, 1151K c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     97280 c/s real, 48398 c/s virtual
Only one salt:  92261 c/s real, 46360 c/s virtual

3099K is roughly 90% of 2x1713K.  Let's see if this is achieved in
practice, intentionally avoiding any matching salts for now:

host!solar:~/john$ ./john-omp-des-2 -e=double --salts=-2 pw-fake-unix
Loaded 1458 password hashes with 1458 different salts (Traditional DES [128/128 BS SSE2-16])
mimi             (u3044-des)
aaaa             (u1638-des)
xxxx             (u845-des)
aaaaaa           (u156-des)
bebe             (u1731-des)
gigi             (u2082-des)
jojo             (u3027-des)
lulu             (u3034-des)
booboo           (u171-des)
cloclo           (u2989-des)
cccccc           (u982-des)
guesses: 11  time: 0:00:00:02  c/s: 3073K  trying: iciici - jprjpr
jamjam           (u2207-des)
guesses: 12  time: 0:00:00:04  c/s: 3080K  trying: odwodw - prfprf
simsim           (u2671-des)
ssssss           (u3087-des)
guesses: 14  time: 0:00:00:13  c/s: 3086K  trying: aqyeaqye - aslnasln

Yes, it is.

Now let's try some quad-core and 8-core servers under low load.

On a Q6600 2.4 GHz, which normally does 2300K to 2400K c/s for traditional
DES-based crypt(3) in the multi-salt case for a single process (using
just one CPU core) with current versions of JtR under no other load, an
OpenMP-enabled build of 1.7.6-omp-des-2 achieves:

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     8062K c/s real, 2005K c/s virtual
Only one salt:  4558K c/s real, 1136K c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     259072 c/s real, 64445 c/s virtual
Only one salt:  236544 c/s real, 58695 c/s virtual

That's around 85% efficiency (and a lot worse than that for the
single-salt case, but the code was not supposed to perform very well for
that case).  Let's confirm this:

host!solar:~/john$ ./john-omp-des-2 -e=double --salts=-2 pw-fake-unix
Loaded 1458 password hashes with 1458 different salts (Traditional DES [128/128 BS SSE2-16])
mimi             (u3044-des)
aaaa             (u1638-des)
xxxx             (u845-des)
aaaaaa           (u156-des)
bebe             (u1731-des)
gigi             (u2082-des)
jojo             (u3027-des)
lulu             (u3034-des)
booboo           (u171-des)
cloclo           (u2989-des)
cccccc           (u982-des)
jamjam           (u2207-des)
guesses: 12  time: 0:00:00:01  c/s: 6645K  trying: ldcldc - mqlmql
simsim           (u2671-des)
ssssss           (u3087-des)
guesses: 14  time: 0:00:00:05  c/s: 4858K  trying: abuiabui - adhradhr
guesses: 14  time: 0:00:00:11  c/s: 6157K  trying: bvfwbvfw - bwtfbwtf
guesses: 14  time: 0:00:00:21  c/s: 6726K  trying: eszceszc - eumleuml
guesses: 14  time: 0:00:00:27  c/s: 6673K  trying: ghwmghwm - gjjvgjjv
guesses: 14  time: 0:00:00:35  c/s: 6842K  trying: iqlwiqlw - irzfirzf
guesses: 14  time: 0:00:01:02  c/s: 7095K  trying: qomeqome - qpznqpzn
woofwoof         (u1435-des)
guesses: 15  time: 0:00:01:36  c/s: 7118K  trying: zzwmzzwm - zzzzzzzz

Well, it's a bit worse in practice.  Perhaps I would have captured a
result like this (or worse) if I simply re-ran the "--test" benchmark or
if I let "--test" run for longer.  There's some load on the server, but
not enough to cause such a performance hit on JtR "directly".  Rather,
it does this by causing JtR to wait for termination of all of its
threads when it temporarily switches from parallel to sequential
execution, which it does very often (with this patch).

That's roughly 75% of the full speed of this CPU (achievable with four
separate JtR processes on an idle system), whereas the server load was
directly attributable for maybe 5% of the CPU time.  Another 10% was
lost to thread-safety of the code in this build, and presumably at least
another 10% to thread synchronization (an indirect performance hit from
other load).

Next test system, Core i7 920 2.67 GHz (quad-core, 2 logical CPUs per
core) with a little bit of other load (perhaps under 5%).  Normally, it
does around 2600K c/s for multi-salt with one non-parallelized process,
11500K c/s combined with 8 separate processes.

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     9802K c/s real, 1237K c/s virtual
Only one salt:  4683K c/s real, 585472 c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     324608 c/s real, 40576 c/s virtual
Only one salt:  285696 c/s real, 35712 c/s virtual

Again, roughly 85% efficiency (9800K of 11500K).  Let's confirm:

host!solar:~/john/john-1.7.6-omp-des-2/run$ ./john -e=double --salts=-2 ~/john/pw-fake-unix
Loaded 1458 password hashes with 1458 different salts (Traditional DES [128/128 BS SSE2-16])
mimi             (u3044-des)
aaaa             (u1638-des)
xxxx             (u845-des)
aaaaaa           (u156-des)
bebe             (u1731-des)
gigi             (u2082-des)
jojo             (u3027-des)
lulu             (u3034-des)
booboo           (u171-des)
cloclo           (u2989-des)
cccccc           (u982-des)
jamjam           (u2207-des)
guesses: 12  time: 0:00:00:01  c/s: 6952K  trying: mqmmqm - odvodv
simsim           (u2671-des)
ssssss           (u3087-des)
guesses: 14  time: 0:00:00:09  c/s: 9121K  trying: cnkmcnkm - coxvcoxv
guesses: 14  time: 0:00:00:24  c/s: 9412K  trying: ieiuieiu - ifwdifwd
guesses: 14  time: 0:00:00:34  c/s: 9491K  trying: maiemaie - mbvnmbvn

This looks about right, 82% of full speed on this test.  Maybe the load
changed a bit.

Finally, a faster system - dual Xeon X5460 (8 cores total) at 3.16 GHz,
but with a bit more load (around 10%).  The benchmarks vary a lot:

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     17632K c/s real, 2204K c/s virtual
Only one salt:  3697K c/s real, 480216 c/s virtual

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     11803K c/s real, 1526K c/s virtual
Only one salt:  5240K c/s real, 655923 c/s virtual

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     11730K c/s real, 1519K c/s virtual
Only one salt:  5500K c/s real, 693533 c/s virtual

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     12247K c/s real, 1582K c/s virtual
Only one salt:  5326K c/s real, 668362 c/s virtual

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     17444K c/s real, 2202K c/s virtual
Only one salt:  5561K c/s real, 694300 c/s virtual

Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE
Many salts:     13569K c/s real, 1741K c/s virtual
Only one salt:  5575K c/s real, 696089 c/s virtual

A few were totally off - showing a speed of less than 1M c/s (for a
one-second interval), even though the load was never very high.  This
code is really sensitive to any other load.

Normally, this system would do around 3000K c/s for multi-salt per-core,
or up to 24000K c/s total.  So the best measured speed is only 73%.  Yet
it is an impressive number, which compares favorably against that for an
FPGA implementation:

http://www.sump.org/projects/password/

(and JtR lacks the many limitations of that implementation).

For completeness' sake, here are some BSDI-style crypt(3) benchmarks on
this system:

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     660480 c/s real, 82456 c/s virtual
Only one salt:  395264 c/s real, 51266 c/s virtual

Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts:     535552 c/s real, 68837 c/s virtual
Only one salt:  493568 c/s real, 62556 c/s virtual

As expected, these were less sensitive to load (a slower hash).

Let's test:

host!solar:~/john$ ./john-omp-des-2 -e=double --salts=-2 pw-fake-unix
Loaded 1458 password hashes with 1458 different salts (Traditional DES [128/128 BS SSE2-16])
mimi             (u3044-des)
aaaa             (u1638-des)
xxxx             (u845-des)
aaaaaa           (u156-des)
bebe             (u1731-des)
gigi             (u2082-des)
jojo             (u3027-des)
lulu             (u3034-des)
booboo           (u171-des)
cloclo           (u2989-des)
cccccc           (u982-des)
jamjam           (u2207-des)
guesses: 12  time: 0:00:00:00  c/s: 10624K  trying: jpsjps - ldbldb
simsim           (u2671-des)
ssssss           (u3087-des)
guesses: 14  time: 0:00:00:04  c/s: 10836K  trying: bbnwbbnw - bdbfbdbf
guesses: 14  time: 0:00:00:25  c/s: 8989K  trying: icvkicvk - ieitieit
guesses: 14  time: 0:00:01:10  c/s: 7275K  trying: thrgthrg - tjeptjep
woofwoof         (u1435-des)
guesses: 15  time: 0:00:01:23  c/s: 8203K  trying: zzwmzzwm - zzzzzzzz

I actually let it complete this time.  Well, the speed is down to 47% of
what we saw on the best benchmark run, perhaps because the load was
changing (which was seen between the benchmark runs as well).  Looking at
it another way, this is only 34% of what this system could have achieved
with 8 separate processes and if it were 100% idle.  So the 10% load (my
guesstimate based on looking at "top", etc.) and our desire to simplify
the JtR invocation (just one 8-thread process instead of 8 processes)
and usage (just one .rec file, with one entry in it) combined have cost
us a 3x performance hit in this case...

Overall, I think the approach is usable and reasonable in enough cases
and its added code complexity is low enough for me to integrate
something like it, but it shouldn't be the only way to parallelize JtR.

I am going to try out other approaches later, if time permits.  Two of
the ideas are a specific hybrid process + thread model and partially
asynchronous interfaces (to enable new candidate passwords to be
generated while hashes for the previous bunch are still being computed).

I also haven't given up on the idea of parallelizing JtR in a generic
and scalable way, with separate candidate password streams - somewhat
like what the MPI patch does, but with better usability and reliability
(and without a dependency on MPI).  I am just keeping this on hold and
implementing whatever I can do quickly instead, also considering that
these OpenMP patches and the like may be beneficial under a distributed
model as well, in cases when good efficiency is achieved (fewer nodes to
manage "from the console" if one node is a machine rather than an
individual logical CPU).

As usual, any feedback is welcome.

Alexander

View attachment "john-1.7.6-omp-des-2.diff" of type "text/plain" (26215 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.