john-users - sha512crypt & Drupal 7+ password cracking on FPGA

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180723152751.GA11178@openwall.com>
Date: Mon, 23 Jul 2018 17:27:51 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Cc: Denis Burykin <apingis@...nwall.net>
Subject: sha512crypt & Drupal 7+ password cracking on FPGA

Hi,

As many of you are aware, we support descrypt and bcrypt password hash
cracking on the old ZTEX 1.15y quad-FPGA boards.  Threads:

http://www.openwall.com/lists/john-users/2016/11/06/1
http://www.openwall.com/lists/john-users/2017/06/25/1

Now Denis has also added support for sha512crypt and Drupal 7+ SHA-512
based password hashes on those same old boards.

We had achieved energy-efficiency improvement over current high-end GPUs
at descrypt and bcrypt, and in the case of bcrypt also decent speed
improvement per board and per rig (see further messages in the above
threads).  However, for sha512crypt and Drupal 7+ hashes we're merely on
par with current high-end GPUs in terms of energy-efficiency and our
speeds per-board are lower (it takes four or so boards to match one
high-end GPU).  Thus, for practical purposes this is useful to those who
have those boards anyway or would acquire such boards primarily for
bcrypt and descrypt, so that the boards can also be put to more uses.

This is also valuable as being, to the best of my knowledge, the very
first implementation of these two hash types on FPGA.  And it is also
our first attempt to use specialized soft CPU cores(*) along with
cryptographic cores in an FPGA design to combine some limited
flexibility (in this case, used to implement two higher-level hash types
in one bitstream) with resource savings (no need to waste logic on
sha512crypt's higher-level algorithm specifics) and efficient
cryptographic cores (in this case, SHA-512).  Application of a similar
approach to newer and much larger FPGAs (such as those available on AWS
F1) will result in improvement over current GPUs at least in
energy-efficiency (and for the largest FPGAs probably also in
performance).

(*) Denis' bcrypt design uses microcode to save on logic, but it's a
closer match to historical CPUs' wide microcode than to a CPU program.
Maybe it'll help us implement bcrypt-pbkdf at some point, though.

Denis wrote a good description of the design with some ASCII diagrams,
currently found here:

https://github.com/magnumripper/JohnTheRipper/tree/bleeding-jumbo/src/ztex/fpga-sha512crypt

Each soft CPU core is 16-way SMT (runs 16 hardware threads with their
separate register files) and it controls four SHA-512 cores with each of
those capable of up to four in-flight hash computations (most of the
time only two are being computed, but there's some overlap between
finishing processing on one pair of hashes and starting on the next).

One soft CPU core (plus its memory and glue logic) and four SHA-512
cores form a unit.  The SHA-512 cores occupy 80% of the unit's area,
so in those terms the overhead of using soft CPUs is at most 25% (but
they actually help save on algorithm-specific logic).

10 units fit in one Spartan-6 LX150 FPGA.  This means 10 soft CPU cores,
160 hardware threads, 40 SHA-512 cores, up to 160 in-flight SHA-512 per
FPGA.  Four times that per board.

Also included are on-device candidate password generator (for mask mode,
including in hybrid modes along with a wordlist coming from host, etc.)
and hash comparator (capable of up to 512 loaded hashes per salt; no
limit on total loaded hashes as that's handled on host).  This is
similar to what Denis' designs for descrypt and bcrypt also have.

sha512crypt and Drupal 7+ hashes are two entry points into the program
memory.  (The Drupal 7+ program is much simpler than sha512crypt's.
It could also be more efficient on a more specialized design since it
does not need unaligned access to the buffers, which we support for
sha512crypt.  Yet it's good to have it along with sha512crypt
essentially for free.)

Per Xilinx tools, this design was supposed to work at 225 MHz.
Unfortunately, in our testing it only works at this frequency with very
few units built into the bitstream.  We don't know exactly why (maybe
it's the power draw).  With 10 units, the design works reliably for us
at 135 MHz on many boards tested, so that's what we set as the current
default.  It also sometimes works at higher frequencies such as 160 MHz,
but other times not.  This is configurable in john.conf.

Here's a test run against 512 of same-salt sha512crypt hashes (good for
quick reliability testing as all 512 are supposed to be cracked) on one
board (4 FPGAs) at 135 MHz:

$ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
[...]
Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
327g 0:00:00:42 62.00% (ETA: 15:55:22) 7.746g/s 47003p/s 47003c/s 16282KC/s 40447..40137
512g 0:00:01:05 DONE (2018-07-23 15:55) 7.825g/s 46950p/s 46950c/s 12179KC/s 40500..40190
Session completed

Four boards (16 FPGAs), 135 MHz:

$ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
[...]
Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
378g 0:00:00:12 72.00% (ETA: 15:53:55) 30.45g/s 185656p/s 185656c/s 62318KC/s 40348..1AF58
512g 0:00:00:16 DONE (2018-07-23 15:53) 30.89g/s 185395p/s 185395c/s 51138KC/s 40000..40140
Session completed

Scaling efficiency 185395/46950/4 = 98.7%.

Four boards (16 FPGAs), 160 MHz:

$ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
[...]
Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
174g 0:00:00:04 32.00% (ETA: 15:57:33) 36.78g/s 216490p/s 216490c/s 94714KC/s 40044..1AF54
512g 0:00:00:14 DONE (2018-07-23 15:57) 36.44g/s 218647p/s 218647c/s 60310KC/s 40000..40340
Session completed

This is similar speed to what Jeremi Gosney reported for hashcat on one
GTX 1080 Ti at stock clocks:

https://gist.github.com/epixoip/973da7352f4cc005746c627527e4d073

Hashtype: sha512crypt, SHA512(Unix)

Speed.Dev.#1.....:   216.0 kH/s (53.53ms)

Somehow a newer benchmark of 8x GTX 1080 Ti shows slightly higher speed
per GPU:

https://gist.github.com/epixoip/ace60d09981be09544fdd35005051505

Hashtype: sha512crypt $6$, SHA512 (Unix)

Speed.Dev.#1.....:   235.9 kH/s (96.29ms)
Speed.Dev.#2.....:   228.3 kH/s (50.67ms)
Speed.Dev.#3.....:   230.4 kH/s (50.22ms)
Speed.Dev.#4.....:   230.5 kH/s (50.18ms)
Speed.Dev.#5.....:   230.6 kH/s (50.16ms)
Speed.Dev.#6.....:   230.1 kH/s (50.27ms)
Speed.Dev.#7.....:   232.0 kH/s (49.85ms)
Speed.Dev.#8.....:   231.3 kH/s (50.01ms)
Speed.Dev.#*.....:  1849.1 kH/s

We're probably consuming around 160W for the boards (Denis measured 3.4A
at 12V per board at 160 MHz, which translates to ~40W/board) or 180W at
the wall at ~90% PSU efficiency.

I guess GTX 1080 Ti might consume a little bit more at this benchmark
(it's a 300W TDP card).  Jeremi (or someone else who has one of those
cards) can probably check via nvidia-smi while running hashcat.

Drupal 7+ hash, one board (4 FPGAs) at 135 MHz:

$ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
[...]
Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
Cost 1 (iteration count) is 16384 for all loaded hashes
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:10 2.49% (ETA: 16:08:54) 0g/s 14250p/s 14250c/s 14250C/s prdowaap..oooarsap
0g 0:00:02:03 30.91% (ETA: 16:08:49) 0g/s 14421p/s 14421c/s 14421C/s awoppaas..rssoasas
0g 0:00:03:31 52.93% (ETA: 16:08:50) 0g/s 14427p/s 14427c/s 14427C/s wdwdwdow..pdawrprw
0g 0:00:06:20 95.21% (ETA: 16:08:51) 0g/s 14430p/s 14430c/s 14430C/s wpddwood..ppowrrod
password         (?)
1g 0:00:06:28 DONE (2018-07-23 16:08) 0.002571g/s 14428p/s 14428c/s 14428C/s password..orpadord
Use the "--show" option to display all of the cracked passwords reliably
Session completed

Four boards (16 FPGAs), 135 MHz:

$ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
[...]
Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
Cost 1 (iteration count) is 16384 for all loaded hashes
Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:10 10.23% (ETA: 16:01:23) 0g/s 56120p/s 56120c/s 56120C/s oaoopprp..rooddwrp
0g 0:00:00:35 35.24% (ETA: 16:01:26) 0g/s 56590p/s 56590c/s 56590C/s dwpadaws..ppawrrws
0g 0:00:01:01 60.25% (ETA: 16:01:27) 0g/s 56662p/s 56662c/s 56662C/s adwoowao..ssodwpso
password         (?)
1g 0:00:01:39 DONE (2018-07-23 16:01) 0.01005g/s 56678p/s 56678c/s 56678C/s password..wsrssdrd
Use the "--show" option to display all of the cracked passwords reliably
Session completed

Scaling efficiency 56678/14428/4 = 98.2% despite of the complaint about
too small mask (too few different characters for the mask positions
handled on device).

Four boards (16 FPGAs), 160 MHz:

$ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
[...]
Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
Cost 1 (iteration count) is 16384 for all loaded hashes
Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:12 14.78% (ETA: 16:11:22) 0g/s 65890p/s 65890c/s 65890C/s rpdroapa..dwdporpa
0g 0:00:00:31 36.38% (ETA: 16:11:25) 0g/s 66386p/s 66386c/s 66386C/s apawrrws..swarosos
0g 0:00:01:16 88.67% (ETA: 16:11:26) 0g/s 66586p/s 66586c/s 66586C/s soapawad..wpssppsd
password         (?)
1g 0:00:01:24 DONE (2018-07-23 16:11) 0.01180g/s 66541p/s 66541c/s 66541C/s password..wsrssdrd
Use the "--show" option to display all of the cracked passwords reliably
Session completed

We'd appreciate more testing, such as on Royce' larger cluster of these
boards maybe.  Please post your results as follow-ups to this message.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.