Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20181012185956.GA23135@openwall.com>
Date: Fri, 12 Oct 2018 20:59:57 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Cc: Denis Burykin <apingis@...nwall.net>
Subject: md5crypt & phpass password cracking on FPGA

Hi,

As many of you are aware, we support descrypt, bcrypt, sha512crypt,
sha256crypt, and Drupal7 password hash cracking on the old ZTEX 1.15y
quad-FPGA boards.  Threads:

https://www.openwall.com/lists/john-users/2016/11/06/1
https://www.openwall.com/lists/john-users/2017/06/25/1
https://www.openwall.com/lists/john-users/2018/07/23/1
https://www.openwall.com/lists/john-users/2018/08/27/11

Now Denis has also added support for md5crypt and for phpass "portable
hashes" on those same boards.  (While phpass as released by Openwall
primarily uses bcrypt, it also includes a "last resort fallback" to
MD5-based "portable hashes", which many popular web apps forced use of
for portability.)

Similarly to sha512crypt, sha256crypt, and Drupal7, this addition is not
so much to compete with GPUs as it is to provide a way to put those FPGA
boards to more uses.  Also just like our implementations of sha512crypt,
sha256crypt, and Drupal7 hashes on FPGA, this is, to the best of my
knowledge, the very first time md5crypt and phpass "portable hashes" are
implemented on FPGA.

Denis wrote a good description of the design with some ASCII diagrams,
currently found here:

https://github.com/magnumripper/JohnTheRipper/tree/bleeding-jumbo/src/ztex/fpga-md5crypt

Similarly to Denis' designs for sha512crypt + Drupal7 and sha256crypt,
the new one for md5crypt + phpass uses specialized soft CPU cores along
with cryptographic cores.  However, the specific parameters of those
cores changed once again: this time, it's 16-bit 12-way SMT CPU cores
along with sets of 3 MD5 cores each capable of up to 4 in-flight hashes.

Three MD5 cores, one soft CPU core, and memory and glue logic form a
unit.  32 units fit in one Spartan-6 LX150 FPGA.  This means 32 soft CPU
cores, 384 hardware threads, 96 MD5 cores, up to 384 in-flight MD5 per
FPGA.  Four times that - meaning 1536 in-flight hashes - per board.

Also included are on-device candidate password generator (for mask mode,
including in hybrid modes along with a wordlist coming from host, etc.)
and hash comparator (capable of up to 512 loaded hashes per salt; no
limit on total loaded hashes as that's handled on host).  This is the
same as the designs for sha512crypt + Drupal7 and sha256crypt also have.

Per Xilinx tools, this design was supposed to work at 202 MHz.  In our
testing on actual boards, the design works reliably for us at 180 MHz,
which we set as the default and made it configurable in john.conf.

Some other implementations of md5crypt and phpass "portable hashes" that
we have in JtR have password length limitations of 15 and 39 characters,
respectively - as an optimization.  This FPGA implementation of these
hashes is capable of password lengths up to 64.  Actual performance
varies by salt and password length.  For the benchmarks below, I'll use
salt length of 8 and password length of 7 to match Hashcat benchmarks.

Here's a test run against one md5crypt hash on one board (4 FPGAs) at
180 MHz:

$ perl -e 'print crypt("passMD5", "\$1\$saltsalt"), "\n";' > pw-md5crypt-1
$ cat pw-md5crypt-1
$1$saltsalt$TwZH0EJ82F8jZZW.s.uLn/
$ ./john -form=md5crypt-ztex -mask='pas?a?a?a?a' pw-md5crypt-1
[...]
Loaded 1 password hash (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
passMD5          (?)
1g 0:00:00:18 DONE (2018-10-12 19:31) 0.05506g/s 944245p/s 944245c/s 944245C/s passMD5..pas##|5

A higher frequency might work, but isn't reliable across the boards we
tested.  That said, here's a lucky run on a lucky board at 210 MHz just
to hit and exceed 1M c/s:

$ ./john -form=md5crypt-ztex -mask='pas?a?a?a?a' pw-md5crypt-1 -dev=04A3466XXX
ZTEX 04A3466XXX bus:2 dev:8 Frequency:210 210 210 210
Using default input encoding: UTF-8
Loaded 1 password hash (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
passMD5          (?)
1g 0:00:00:15 DONE (2018-10-12 19:43) 0.06389g/s 1095Kp/s 1095Kc/s 1095KC/s passMD5..pas##|5

For all further tests, I'll use 180 MHz.  Let's pretend more of the
password is unknown, for a longer test on one board (4 FPGAs) and also
to test a hybrid mode:

$ ./john -form=md5crypt-ztex -inc -mask='?w?a?a?a' -min-len=7 -max-len=7 pw-md5crypt-1
[...]
Loaded 1 password hash (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
passMD5          (?)
1g 0:00:02:23 DONE (2018-10-12 19:55) 0.006945g/s 952837p/s 952837c/s 952837C/s passMD5..pass##|

Four boards (16 FPGAs):

$ ./john -form=md5crypt-ztex -inc -mask='?w?a?a?a' -min-len=7 -max-len=7 pw-md5crypt-1
[...]
Loaded 1 password hash (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
passMD5          (?)
1g 0:00:00:36 DONE (2018-10-12 19:55) 0.02751g/s 3773Kp/s 3773Kc/s 3773KC/s passMD5..pass##|

Scaling efficiency 3773000/952837/4 = 99.0%.

Modern high-end GPUs are several times faster than that.  However, this
speed is on par with what was achieved with Hashcat on AMD HD 7970 aka
Tahiti, the fastest GPU contemporary to these FPGA boards (circa 2012),
and ours is achieved at moderately lower power consumption.  Denis says
the boards running this design consume around 2.8A at 12V at 190 MHz (we
don't readily have a figure for 180 MHz and I don't want to delay
posting this), which means about 135W for the four boards.  HD 7970 was
a 250W TDP GPU card; its actual power usage could be less, and
underclocking could provide better power efficiency, but probably not to
the extent of reaching 135W at this performance level.

Now to some multi-hash runs for reliability testing.  Four boards, mask:

$ perl -e 'for ($i = 100; $i < 612; $i++) { print crypt("pass$i", "\$1\$saltsalt"), "\n"; }' > pw-md5crypt
$ ./john -form=md5crypt-ztex -mask='pas?a?a?a?a' -verb=1 pw-md5crypt
[...]
Loaded 512 password hashes with no different salts (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
512g 0:00:00:06 DONE (2018-10-11 20:55) 74.41g/s 3672Kp/s 3672Kc/s 1263MC/s pass477..pas##Mj

Four boards, larger mask for a longer run:

$ ./john -form=md5crypt-ztex -mask='pa?l?a?a?a?a' -verb=1 pw-md5crypt
[...]
Loaded 512 password hashes with no different salts (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:06 1.06% (ETA: 21:19:48) 0g/s 3711Kp/s 3711Kc/s 1922MC/s paaaaee..paaa9ee
52g 0:00:00:18 3.37% (ETA: 21:19:19) 2.744g/s 3764Kp/s 3764Kc/s 1968MC/s paaaa5i..paaa95i
155g 0:00:01:00 10.81% (ETA: 21:19:38) 2.555g/s 3776Kp/s 3776Kc/s 1779MC/s pass372..pass242
359g 0:00:01:57 20.92% (ETA: 21:19:43) 3.065g/s 3783Kp/s 3783Kc/s 1414MC/s paaaa%5..paaa9%5
507g 0:00:02:28 26.59% (ETA: 21:19:41) 3.404g/s 3781Kp/s 3781Kc/s 1188MC/s pass367..pass587
512g 0:00:02:30 DONE (2018-10-11 21:12) 3.413g/s 3779Kp/s 3779Kc/s 1172MC/s pass477..pa###R7

(I pressed a key a few times.)

This shows roughly the same speed as we have for large enough mask when
running against one hash, meaning the comparator against 512 loaded
hashes (sharing this same salt for testing) doesn't slow things down.

Four boards, wordlist and mask:

$ ./john -form=md5crypt-ztex -w=rtop1m -mask='?w?d' -verb=1 pw-md5crypt
[...]
Loaded 512 password hashes with no different salts (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
280g 0:00:00:03 DONE (2018-10-11 21:01) 80.92g/s 3283Kp/s 3283Kc/s 1468MC/s 1alyssak#..Tom_Ere_2k9#

The number 280 is correct - you can also see it for a similar test in
the posting about sha256crypt referenced above.  But 3 seconds is too
quick for a speed measurement, and transferring a word from host over
USB for every 10 hashes computed is probably slow.  Let's pretend we
didn't know the last character is a digit, so we can increase our "mask
amplifier":

$ ./john -form=md5crypt-ztex -w=rtop1m -mask='?w?a' -verb=1 pw-md5crypt
[...]
Loaded 512 password hashes with no different salts (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
280g 0:00:00:29 DONE (2018-10-11 21:04) 9.582g/s 3693Kp/s 3693Kc/s 1271MC/s 004811010124i..-----

Now we get performance figures closer to the maximum we've seen before.

A similarly amplifying mask is two digits:

$ ./john -form=md5crypt-ztex -w=rtop1m -mask='?w?d?d' -verb=1 pw-md5crypt
[...]
Loaded 512 password hashes with no different salts (md5crypt-ztex, crypt(3) $1$ [md5crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
300g 0:00:00:30 DONE (2018-10-12 20:27) 9.708g/s 3676Kp/s 3676Kc/s 1282MC/s 1alyssak##..-----

Again, 300 is the correct number here, as confirmed by a run on the
original HD 7970 (925 MHz):

$ ./john -form=md5crypt-opencl -w=rtop1m -mask='?w?d?d' -verb=1 pw-md5crypt
Using default input encoding: UTF-8
Loaded 512 password hashes with no different salts (md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL])
Press 'q' or Ctrl-C to abort, almost any other key for status
300g 0:00:00:55 DONE (2018-10-12 20:35) 5.404g/s 2043Kp/s 2043Kc/s 691495KC/s !ejr8!69..  nam77

Incidentally, "  nam" is in fact the last line in the "rtop1m" wordlist.
Unfortunately, the reporting of the range of candidate passwords is
wrong when using on-device mask (which for this hash type we have on
FPGA, but not on GPU).

Now to phpass tests.  Four boards, mask only:

$ cat pw-phpass
$P$9saltstriXeNc.xV8N.K9cTs/XEn13.
$ ./john -form=phpass-ztex -mask='a?l?l?l?l?l?l' pw-phpass
[...]
Loaded 1 password hash (phpass-ztex [phpass ($P$ or $H$)])
Cost 1 (iteration count) is 2048 for all loaded hashes
Press 'q' or Ctrl-C to abort, almost any other key for status
abcdefg          (?)
1g 0:00:01:53 DONE (2018-10-11 21:42) 0.008816g/s 1881Kp/s 1881Kc/s 1881KC/s abcdefg..a###eqg

As expected, this is roughly one half of md5crypt's speed because the
number of iterations of MD5 is increased from 1000 to 2048.  (It can
vary between different phpass hashes.  It's just that 2048 is used for
benchmarks for historical reasons.  This is also the value that phpBB3
uses, whereas WordPress uses 8192 - thus, cracking of WordPress phpass
hashes is 4 times slower yet - but is also supported on FPGA now.)

It's possible to implement phpass "portable hashes" on FPGA slightly
more efficiently, without bothering with the soft CPUs and reclaiming
their area for more MD5 cores, since it's a much simpler algorithm than
md5crypt.  However, we preferred to get the implementation almost for
free on top of the md5crypt design, by having phpass implemented as a
different entry point into the soft CPU program running on the exact
same hardware design (same bitstream).  (It's the exact same approach we
used for having sha512crypt and Drupal7 share the hardware design.)

Hybrid with somewhat low mask amplifier (one letter):

$ ./john -form=phpass-ztex -inc=lower -mask='?w?l' -min-len=7 -max-len=7 pw-phpass
[...]
Loaded 1 password hash (phpass-ztex [phpass ($P$ or $H$)])
Cost 1 (iteration count) is 2048 for all loaded hashes
Press 'q' or Ctrl-C to abort, almost any other key for status
abcdefg          (?)
1g 0:00:00:05 DONE (2018-10-11 21:43) 0.1904g/s 1797Kp/s 1797Kc/s 1797KC/s abcdefg..llabat#

No mask, every candidate password is transferred over USB from host:

$ ./john -form=phpass-ztex -inc=lower -min-len=7 -max-len=7 pw-phpass
[...]
Loaded 1 password hash (phpass-ztex [phpass ($P$ or $H$)])
Cost 1 (iteration count) is 2048 for all loaded hashes
Note: This format may be a lot faster with --mask acceleration (see doc/MASK).
Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
Press 'q' or Ctrl-C to abort, almost any other key for status
abcdefg          (?)
1g 0:00:00:01 DONE (2018-10-11 21:43) 0.7812g/s 1228Kp/s 1228Kc/s 1228KC/s abcdefg..lyziesi

The c/s rate is lower by a third, but the attack duration is a lot
lower due to the more optimal ordering of candidate passwords.  In fact,
the attack duration is still low even if we don't specify the password
length (let incremental mode try different lengths):

$ ./john -form=phpass-ztex -inc=lower pw-phpass
[...]
Loaded 1 password hash (phpass-ztex [phpass ($P$ or $H$)])
Cost 1 (iteration count) is 2048 for all loaded hashes
Note: This format may be a lot faster with --mask acceleration (see doc/MASK).
Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
Press 'q' or Ctrl-C to abort, almost any other key for status
abcdefg          (?)
1g 0:00:00:02 DONE (2018-10-11 21:45) 0.4149g/s 1305Kp/s 1305Kc/s 1305KC/s abcdefg..shunnas

And it's essentially zero if we let JtR use its default wordlist, as
this is a common password:

$ ./john -form=phpass-ztex pw-phpass
[...]
Loaded 1 password hash (phpass-ztex [phpass ($P$ or $H$)])
Cost 1 (iteration count) is 2048 for all loaded hashes
Note: This format may be a lot faster with --mask acceleration (see doc/MASK).
Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
Press 'q' or Ctrl-C to abort, almost any other key for status
abcdefg          (?)
1g 0:00:00:00 DONE 2/3 (2018-10-11 21:45) 1.694g/s 265840p/s 265840c/s 265840C/s abcdefg..Sssing

This shows that while raw performance is important, being smart is even
more important.

We'd appreciate more testing of this and other ZTEX formats by the
community.  Please post your results as follow-ups to the appropriate
threads, or start a new thread if your posting isn't hash type specific.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.