Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180827160716.GA13109@openwall.com>
Date: Mon, 27 Aug 2018 18:07:16 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Cc: Denis Burykin <apingis@...nwall.net>
Subject: sha256crypt password cracking on FPGA

Hi,

As many of you are aware, we support descrypt, bcrypt, sha512crypt, and
Drupal7 password hash cracking on the old ZTEX 1.15y quad-FPGA boards.
Threads:

http://www.openwall.com/lists/john-users/2016/11/06/1
http://www.openwall.com/lists/john-users/2017/06/25/1
http://www.openwall.com/lists/john-users/2018/07/23/1

Now Denis has also added support for sha256crypt on those same boards.

Similarly to sha512crypt and Drupal7, this addition is not so much to
compete with GPUs as it is to provide a way to put those FPGA boards to
more uses.  Also just like sha512crypt and Drupal7 this is, to the best
of my knowledge, the very first time sha256crypt is implemented on FPGA.

Denis wrote a good description of the design with some ASCII diagrams,
currently found here:

https://github.com/magnumripper/JohnTheRipper/tree/bleeding-jumbo/src/ztex/fpga-sha256crypt

Similarly to Denis' design for sha512crypt and Drupal7, the new one for
sha256crypt uses specialized soft CPU cores along with cryptographic
cores.  However, the specific parameters of those cores changed: while
the sha512crypt and Drupal7 design used 32-bit 16-way SMT CPU cores, the
one for sha256crypt uses smaller 16-bit 6-way SMT CPU cores, and while
the SHA-512 cores handled up to 4 in-flight hashes, the SHA-256 ones
handle only up to 2.  Accordingly, the ratio of SHA-2 to CPU cores was 4
to 1, and is now 3 to 1.  These changes are in part due to SHA-256 being
smaller and faster (so without the smaller CPU cores the ratio would
have been even lower), and in part due to Denis not optimizing this for
maximum theoretical clock rate (per design tools) to the same extent, as
that clock rate for sha512crypt and Drupal7 turned out to be unreachable
in practice on the ZTEX boards anyway (to remind, for those hashes the
toolset reported clock rate was 225 MHz while actual stable under full
device utilization was up to 160 MHz).

Three SHA-256 cores, one soft CPU core, and memory and glue logic form a
unit.  The SHA-256 cores occupy 2/3 of the unit's area, and the soft CPU
core occupies 10%.  The rest goes primarily to shared SHA-256 context
logic such as buffering and padding, which isn't in the cores.

25 units fit in one Spartan-6 LX150 FPGA.  This means 25 soft CPU cores,
150 hardware threads, 75 SHA-256 cores, up to 150 in-flight SHA-256 per
FPGA.  Four times that per board.

Also included are on-device candidate password generator (for mask mode,
including in hybrid modes along with a wordlist coming from host, etc.)
and hash comparator (capable of up to 512 loaded hashes per salt; no
limit on total loaded hashes as that's handled on host).  This is the
same as Denis' design for sha512crypt and Drupal7 also has.

Per Xilinx tools, this design was supposed to work at 166 MHz.  In our
testing on actual boards, the design works reliably for us at 135 MHz on
many boards tested, and at 160 MHz on some.  The frequency is
configurable in john.conf, where we set the default to 135 MHz.

As discussed in the Twitter thread below, sha256crypt's performance is
very sensitive to combination of the salt and password lengths (and this
is also a reason to avoid using sha256crypt defensively - you get major
timing leaks of the password length even for realistically small lengths
such as 7 vs. 8 or 11 vs. 12 characters, with the exact thresholds
varying by salt length):

https://twitter.com/solardiz/status/1031235063181189120

For consistency with Hashcat benchmarks, I chose to use salt length 8
and password length 7, generating a test password hash with:

$ perl -e 'print crypt("pass256", "\$5\$saltsalt"), "\n";' > pw-sha256crypt-1
$ cat pw-sha256crypt-1
$5$saltsalt$ntUtUcOovI4zhuDuXQTtZ4lD7F8GHhVVRI4q1SIfQN3

Here's a test run against one sha256crypt hash on one board (4 FPGAs) at
135 MHz:

$ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' pw-sha256crypt-1
[...]
Loaded 1 password hash (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX])
Cost 1 (iteration count) is 5000 for all loaded hashes
Press 'q' or Ctrl-C to abort, almost any other key for status
pass256          (?)
1g 0:00:02:57 DONE (2018-08-27 15:41) 0.005640g/s 112392p/s 112392c/s 112392C/s pass256..pas##u6

Four boards (16 FPGAs), 135 MHz:

$ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' pw-sha256crypt-1
[...]
Loaded 1 password hash (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX])
Cost 1 (iteration count) is 5000 for all loaded hashes
Press 'q' or Ctrl-C to abort, almost any other key for status
pass256          (?)
1g 0:00:00:44 DONE (2018-08-27 15:57) 0.02234g/s 445201p/s 445201c/s 445201C/s pass256..pas##u6

Scaling efficiency 445201/112392/4 = 99.0%.

This is roughly 74% of the speed of one GTX 1080 Ti, which is reported
to achieve around 600 kH/s in Jeremi Gosney's Hashcat benchmarks:

https://gist.github.com/epixoip/ace60d09981be09544fdd35005051505

Hashtype: sha256crypt $5$, SHA256 (Unix)

Speed.Dev.#1.....:   599.8 kH/s (75.76ms)
Speed.Dev.#2.....:   593.7 kH/s (76.53ms)
Speed.Dev.#3.....:   593.1 kH/s (76.59ms)
Speed.Dev.#4.....:   590.5 kH/s (76.94ms)
Speed.Dev.#5.....:   596.1 kH/s (76.24ms)
Speed.Dev.#6.....:   596.2 kH/s (76.22ms)
Speed.Dev.#7.....:   603.7 kH/s (75.27ms)
Speed.Dev.#8.....:   601.5 kH/s (75.53ms)
Speed.Dev.#*.....:  4774.6 kH/s

With lucky ZTEX boards doing this at 160 MHz, it'd be ~88% of a 1080 Ti.
(Only two of my four boards tested here are lucky enough.  All four
might pass the one password test, but from more extensive testing I know
that two would often miss guesses when running at 160 MHz.)

One board (4 FPGAs), 160 MHz:

$ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' pw-sha256crypt-1
[...]
Loaded 1 password hash (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX])
Cost 1 (iteration count) is 5000 for all loaded hashes
Press 'q' or Ctrl-C to abort, almost any other key for status
pass256          (?)
1g 0:00:02:29 DONE (2018-08-27 15:44) 0.006675g/s 133016p/s 133016c/s 133016C/s pass256..pas##u6

Denis says his board consumes 2.6A at 12V running this at 160 MHz, which
is 31.2W.  Comparing this to atom's "This same hash, running a GTX1080
and capped at 90W, is doing 355kH/s" (referring to a different hash with
the same salt length, so should be a valid comparison), we get 383 kH/s
per 90W for the FPGAs, which is slightly more energy-efficient than the
power-capped GPU's 355 kH/s.

Now to some multi-hash runs for reliability testing:

$ perl -e 'for ($i = 100; $i < 612; $i++) { print crypt("pass$i", "\$5\$saltsalt"), "\n"; }' > pw-sha256crypt

One board (4 FPGAs), 160 MHz:

$ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' -dev=04A3465XXX -verb=1 pw-sha256crypt
[...]
Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:05 0.93% (ETA: 15:54:43) 0g/s 130932p/s 130932c/s 67037KC/s pasaa"a..pasat"a
52g 0:00:00:23 3.86% (ETA: 15:55:41) 2.195g/s 132574p/s 132574c/s 65162KC/s pasaaVi..pasatVi
206g 0:00:01:33 15.29% (ETA: 15:55:53) 2.197g/s 132848p/s 132848c/s 57501KC/s pasaaYc..pasatYc
461g 0:00:02:38 25.93% (ETA: 15:55:55) 2.901g/s 132895p/s 132895c/s 42594KC/s pasaa*b..pasat*b
512g 0:00:02:43 DONE (2018-08-27 15:48) 3.124g/s 132846p/s 132846c/s 41449KC/s pass577..pas##E7

Note that it's almost same p/s and c/s rate as we had for one hash (just
slightly slower: 133.0k vs. 132.8k), but much higher C/s rate
(comparisons per second) due to the matching salts (in fact, only one
salt for all hashes).

Two boards (8 FPGAs), 160 MHz:

$ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' -dev=04A3465XXX,04A3466XXX -verb=1 pw-sha256crypt
[...]
Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
52g 0:00:00:27 9.04% (ETA: 15:54:11) 1.868g/s 264620p/s 264620c/s 125368KC/s pasaaKs..pasa6Ks
206g 0:00:00:47 15.42% (ETA: 15:54:16) 4.345g/s 264982p/s 264982c/s 114756KC/s pasaa@...pasa6@c
410g 0:00:01:07 22.07% (ETA: 15:54:16) 6.037g/s 264729p/s 264729c/s 96949KC/s pasaa{4..pasa6{4
512g 0:00:01:22 DONE (2018-08-27 15:50) 6.194g/s 264657p/s 264657c/s 82757KC/s pass177..pas##D7

Four boards (16 FPGAs), 160 MHz:

$ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' -verb=1 pw-sha256crypt
[...]
Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:03 2.13% (ETA: 15:53:17) 0g/s 514183p/s 514183c/s 263262KC/s pasaa11..pasa.11
50g 0:00:00:08 5.32% (ETA: 15:53:26) 6.024g/s 521927p/s 521927c/s 254178KC/s pas32%o..pasa.nn
168g 0:00:00:21 13.83% (ETA: 15:53:27) 7.835g/s 525335p/s 525335c/s 239972KC/s pass223..pasa.33
503g 0:00:00:46 30.32% (ETA: 15:53:28) 10.72g/s 526490p/s 526490c/s 151516KC/s pasaaQp..pasa.Qp
[...]
503g 0:00:02:34 DONE (2018-08-27 15:53) 3.252g/s 526652p/s 526652c/s 48574KC/s pasaa||..pas||}|

Oops, like I said the other two boards don't manage this frequency -
only 503 out of 512 passwords got cracked.  (The longer runtime and
lower C/s rate is explained by this run having done more work: it
continued to test other candidate passwords against the remaining 9
hashes past the point where the previous two runs had stopped upon
cracking all passwords.)

Four boards (16 FPGAs), 135 MHz:

$ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' -verb=1 pw-sha256crypt
[...]
Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:03 1.60% (ETA: 15:57:39) 0g/s 428910p/s 428910c/s 219602KC/s pasaaBe..pasa.Be
115g 0:00:00:19 10.64% (ETA: 15:57:30) 5.888g/s 443625p/s 443625c/s 208548KC/s pass302..pasa.22
206g 0:00:00:27 14.89% (ETA: 15:57:32) 7.545g/s 444307p/s 444307c/s 196239KC/s pasaacc..pasa.cc
410g 0:00:00:39 21.81% (ETA: 15:57:30) 10.26g/s 444808p/s 444808c/s 167448KC/s pass374..pasa./4
512g 0:00:00:49 DONE (2018-08-27 15:55) 10.29g/s 444352p/s 444352c/s 139720KC/s pass477..pas##\7
Session completed

This worked OK.

Now to some wordlist mode runs, using RockYou top 1 million passwords
sorted for decreasing number of occurrences.  (More precisely, 1136144
passwords to have a consistent cut-off number of occurrences.)

Two boards (8 FPGAs), 160 MHz:

$ ./john -form=sha256crypt-ztex -w=rtop1m -dev=04A3465XXX,04A3466XXX -verb=1 pw-sha256crypt
[...]
Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX])
Note: This format may be a lot faster with --mask acceleration (see doc/MASK).
Press 'q' or Ctrl-C to abort, almost any other key for status
11g 0:00:00:05 DONE (2018-08-27 16:26) 1.867g/s 192888p/s 192888c/s 97829KC/s br0926..  nam

That was too quick, let's add a digit:

$ ./john -form=sha256crypt-ztex -w=rtop1m -mask='?w?d' -dev=04A3465XXX,04A3466XXX -verb=1 pw-sha256crypt
[...]
Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:01 4.16% (ETA: 16:28:03) 0g/s 180431p/s 180431c/s 92381KC/s scofield1..1403927
20g 0:00:00:03 8.41% (ETA: 16:28:14) 5.089g/s 191450p/s 191450c/s 97384KC/s pass116..tiamarie7
60g 0:00:00:06 12.70% (ETA: 16:28:26) 9.118g/s 190577p/s 190577c/s 95288KC/s pass146..070319957
70g 0:00:00:09 17.02% (ETA: 16:28:31) 7.567g/s 189794p/s 189794c/s 92294KC/s pass154..0116007
80g 0:00:00:10 19.21% (ETA: 16:28:31) 7.554g/s 189461p/s 189461c/s 91083KC/s pass217..danny937
100g 0:00:00:13 23.57% (ETA: 16:28:34) 7.513g/s 188429p/s 188429c/s 88561KC/s pass107..240425257
150g 0:00:00:18 32.32% (ETA: 16:28:34) 7.991g/s 187064p/s 187064c/s 83885KC/s pass167..hanneman7
180g 0:00:00:27 45.63% (ETA: 16:28:38) 6.649g/s 185297p/s 185297c/s 78288KC/s pass267..im1ru127
210g 0:00:00:35 59.02% (ETA: 16:28:38) 5.920g/s 183839p/s 183839c/s 73620KC/s hickling1..gunpowder17
280g 0:00:01:02 DONE (2018-08-27 16:28) 4.480g/s 181778p/s 181778c/s 61995KC/s 060850#..-----

(Like in some other runs, I pressed a key a few times during this one to
see status.  "pass" is seen so often at the start of a range due to
peculiarity of JtR's internal "formats" interface: when successful
guesses are found in the range, the same interface returns them.)

Note that the c/s rate is much lower than it was for the 7 character
mask runs (was 264657c/s, now 181778c/s).  That's primarily because of
sha256crypt's sensitivity to (candidate) password lengths.  Our wordlist
contains many lines longer than 7, and they're not sorted by length.

Let's try sorting them for increasing length:

$ awk '{ print length, $0 }' < rtop1m | sort -n | cut -d' ' -f2- > rtop1m-by-length

$ ./john -form=sha256crypt-ztex -w=rtop1m-by-length -mask='?w?d' -dev=04A3465XXX,04A3466XXX -verb=1 pw-sha256crypt
[...]
Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:03 8.40% (ETA: 16:34:54) 0g/s 254619p/s 254619c/s 130365KC/s 1618251..1783107
0g 0:00:00:05 12.01% (ETA: 16:35:00) 0g/s 257671p/s 257671c/s 131927KC/s 6805021..7533577
280g 0:00:00:17 32.63% (ETA: 16:35:11) 16.14g/s 245882p/s 245882c/s 109692KC/s ashes011..baby6217
280g 0:00:00:24 42.94% (ETA: 16:35:14) 11.32g/s 223203p/s 223203c/s 88713KC/s pencere1..pipe1237
280g 0:00:00:29 49.62% (ETA: 16:35:16) 9.605g/s 215094p/s 215094c/s 81219KC/s 181019611..198602257
280g 0:00:00:35 58.90% (ETA: 16:35:17) 7.984g/s 207390p/s 207390c/s 74145KC/s funy65411..giants247
280g 0:00:00:51 86.55% (ETA: 16:35:17) 5.455g/s 195479p/s 195479c/s 63139KC/s allahlove11..ashleybabe7
280g 0:00:01:00 DONE (2018-08-27 16:35) 4.662g/s 189163p/s 189163c/s 59085KC/s andresydaniela#..-----

This is now slightly faster: 182k to 189k c/s overall, and 191k to 254k
early on (on smaller password lengths, especially in the second run).
And in this case the final guess count is reached at least twice sooner.

BTW, such sorting by length is also relevant on GPU (and also for
md5crypt).  On the original HD 7970 (925 MHz), without sorting:

$ ./john -form=sha256crypt-opencl -w=rtop1m -mask='?w?d' -verb=1 pw-sha256crypt
Using default input encoding: UTF-8
Loaded 512 password hashes with no different salts (sha256crypt-opencl, crypt(3) $5$ [SHA256 OpenCL])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:04 4.36% (ETA: 16:41:54) 0g/s 64250p/s 64250c/s 32896KC/s 1228859..godsgrace4
10g 0:00:00:06 6.57% (ETA: 16:41:54) 1.647g/s 86373p/s 86373c/s 44223KC/s godsgrace6..leontina0
30g 0:00:00:11 11.03% (ETA: 16:42:02) 2.710g/s 94722p/s 94722c/s 47787KC/s royden5..koreans7
60g 0:00:00:14 15.54% (ETA: 16:41:53) 4.261g/s 111709p/s 111709c/s 54960KC/s morena239..lavie4
80g 0:00:00:17 17.80% (ETA: 16:41:58) 4.686g/s 107499p/s 107499c/s 52275KC/s lavie6..0905760
100g 0:00:00:23 24.62% (ETA: 16:41:56) 4.332g/s 113580p/s 113580c/s 53269KC/s rebecca261..yngrid3
150g 0:00:00:30 31.56% (ETA: 16:41:58) 4.986g/s 113293p/s 113293c/s 51295KC/s nigga192..ambermay8
180g 0:00:00:44 45.42% (ETA: 16:41:59) 4.080g/s 112916p/s 112916c/s 47888KC/s karen4565..chachi27
210g 0:00:01:07 66.18% (ETA: 16:42:04) 3.131g/s 109438p/s 109438c/s 42704KC/s 178417842..velvet98
280g 0:00:01:16 75.82% (ETA: 16:42:03) 3.680g/s 110260p/s 110260c/s 41809KC/s nodarse6..marklt0
280g 0:00:01:42 DONE (2018-08-27 16:42) 2.727g/s 110683p/s 110683c/s 37708KC/s 0743382..  nam7

With sorting:

$ ./john -form=sha256crypt-opencl -w=rtop1m-by-length -mask='?w?d' -verb=1 pw-sha256crypt
Using default input encoding: UTF-8
Loaded 512 password hashes with no different salts (sha256crypt-opencl, crypt(3) $5$ [SHA256 OpenCL])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:04 5.03% (ETA: 16:38:42) 0g/s 129134p/s 129134c/s 66117KC/s tk1236..0923250
0g 0:00:00:07 8.81% (ETA: 16:38:42) 0g/s 148313p/s 148313c/s 75936KC/s 1901025..2908127
0g 0:00:00:20 23.90% (ETA: 16:38:46) 0g/s 156270p/s 156270c/s 80010KC/s niki886..sammy10
280g 0:00:00:35 36.46% (ETA: 16:38:58) 7.986g/s 134586p/s 134586c/s 58440KC/s felix132..jhennel8
280g 0:00:01:40 DONE (2018-08-27 16:39) 2.791g/s 113266p/s 113266c/s 35791KC/s sporting4ever2..????????7

This is also a slight speedup overall, and a larger increase in the c/s
rate early on (while on smaller lengths).

We'd appreciate more testing, such as on Royce' larger cluster of ZTEX
boards maybe.  Please post your results as follow-ups to this message.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.