|
Message-ID: <20180827160716.GA13109@openwall.com> Date: Mon, 27 Aug 2018 18:07:16 +0200 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Cc: Denis Burykin <apingis@...nwall.net> Subject: sha256crypt password cracking on FPGA Hi, As many of you are aware, we support descrypt, bcrypt, sha512crypt, and Drupal7 password hash cracking on the old ZTEX 1.15y quad-FPGA boards. Threads: http://www.openwall.com/lists/john-users/2016/11/06/1 http://www.openwall.com/lists/john-users/2017/06/25/1 http://www.openwall.com/lists/john-users/2018/07/23/1 Now Denis has also added support for sha256crypt on those same boards. Similarly to sha512crypt and Drupal7, this addition is not so much to compete with GPUs as it is to provide a way to put those FPGA boards to more uses. Also just like sha512crypt and Drupal7 this is, to the best of my knowledge, the very first time sha256crypt is implemented on FPGA. Denis wrote a good description of the design with some ASCII diagrams, currently found here: https://github.com/magnumripper/JohnTheRipper/tree/bleeding-jumbo/src/ztex/fpga-sha256crypt Similarly to Denis' design for sha512crypt and Drupal7, the new one for sha256crypt uses specialized soft CPU cores along with cryptographic cores. However, the specific parameters of those cores changed: while the sha512crypt and Drupal7 design used 32-bit 16-way SMT CPU cores, the one for sha256crypt uses smaller 16-bit 6-way SMT CPU cores, and while the SHA-512 cores handled up to 4 in-flight hashes, the SHA-256 ones handle only up to 2. Accordingly, the ratio of SHA-2 to CPU cores was 4 to 1, and is now 3 to 1. These changes are in part due to SHA-256 being smaller and faster (so without the smaller CPU cores the ratio would have been even lower), and in part due to Denis not optimizing this for maximum theoretical clock rate (per design tools) to the same extent, as that clock rate for sha512crypt and Drupal7 turned out to be unreachable in practice on the ZTEX boards anyway (to remind, for those hashes the toolset reported clock rate was 225 MHz while actual stable under full device utilization was up to 160 MHz). Three SHA-256 cores, one soft CPU core, and memory and glue logic form a unit. The SHA-256 cores occupy 2/3 of the unit's area, and the soft CPU core occupies 10%. The rest goes primarily to shared SHA-256 context logic such as buffering and padding, which isn't in the cores. 25 units fit in one Spartan-6 LX150 FPGA. This means 25 soft CPU cores, 150 hardware threads, 75 SHA-256 cores, up to 150 in-flight SHA-256 per FPGA. Four times that per board. Also included are on-device candidate password generator (for mask mode, including in hybrid modes along with a wordlist coming from host, etc.) and hash comparator (capable of up to 512 loaded hashes per salt; no limit on total loaded hashes as that's handled on host). This is the same as Denis' design for sha512crypt and Drupal7 also has. Per Xilinx tools, this design was supposed to work at 166 MHz. In our testing on actual boards, the design works reliably for us at 135 MHz on many boards tested, and at 160 MHz on some. The frequency is configurable in john.conf, where we set the default to 135 MHz. As discussed in the Twitter thread below, sha256crypt's performance is very sensitive to combination of the salt and password lengths (and this is also a reason to avoid using sha256crypt defensively - you get major timing leaks of the password length even for realistically small lengths such as 7 vs. 8 or 11 vs. 12 characters, with the exact thresholds varying by salt length): https://twitter.com/solardiz/status/1031235063181189120 For consistency with Hashcat benchmarks, I chose to use salt length 8 and password length 7, generating a test password hash with: $ perl -e 'print crypt("pass256", "\$5\$saltsalt"), "\n";' > pw-sha256crypt-1 $ cat pw-sha256crypt-1 $5$saltsalt$ntUtUcOovI4zhuDuXQTtZ4lD7F8GHhVVRI4q1SIfQN3 Here's a test run against one sha256crypt hash on one board (4 FPGAs) at 135 MHz: $ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' pw-sha256crypt-1 [...] Loaded 1 password hash (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX]) Cost 1 (iteration count) is 5000 for all loaded hashes Press 'q' or Ctrl-C to abort, almost any other key for status pass256 (?) 1g 0:00:02:57 DONE (2018-08-27 15:41) 0.005640g/s 112392p/s 112392c/s 112392C/s pass256..pas##u6 Four boards (16 FPGAs), 135 MHz: $ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' pw-sha256crypt-1 [...] Loaded 1 password hash (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX]) Cost 1 (iteration count) is 5000 for all loaded hashes Press 'q' or Ctrl-C to abort, almost any other key for status pass256 (?) 1g 0:00:00:44 DONE (2018-08-27 15:57) 0.02234g/s 445201p/s 445201c/s 445201C/s pass256..pas##u6 Scaling efficiency 445201/112392/4 = 99.0%. This is roughly 74% of the speed of one GTX 1080 Ti, which is reported to achieve around 600 kH/s in Jeremi Gosney's Hashcat benchmarks: https://gist.github.com/epixoip/ace60d09981be09544fdd35005051505 Hashtype: sha256crypt $5$, SHA256 (Unix) Speed.Dev.#1.....: 599.8 kH/s (75.76ms) Speed.Dev.#2.....: 593.7 kH/s (76.53ms) Speed.Dev.#3.....: 593.1 kH/s (76.59ms) Speed.Dev.#4.....: 590.5 kH/s (76.94ms) Speed.Dev.#5.....: 596.1 kH/s (76.24ms) Speed.Dev.#6.....: 596.2 kH/s (76.22ms) Speed.Dev.#7.....: 603.7 kH/s (75.27ms) Speed.Dev.#8.....: 601.5 kH/s (75.53ms) Speed.Dev.#*.....: 4774.6 kH/s With lucky ZTEX boards doing this at 160 MHz, it'd be ~88% of a 1080 Ti. (Only two of my four boards tested here are lucky enough. All four might pass the one password test, but from more extensive testing I know that two would often miss guesses when running at 160 MHz.) One board (4 FPGAs), 160 MHz: $ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' pw-sha256crypt-1 [...] Loaded 1 password hash (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX]) Cost 1 (iteration count) is 5000 for all loaded hashes Press 'q' or Ctrl-C to abort, almost any other key for status pass256 (?) 1g 0:00:02:29 DONE (2018-08-27 15:44) 0.006675g/s 133016p/s 133016c/s 133016C/s pass256..pas##u6 Denis says his board consumes 2.6A at 12V running this at 160 MHz, which is 31.2W. Comparing this to atom's "This same hash, running a GTX1080 and capped at 90W, is doing 355kH/s" (referring to a different hash with the same salt length, so should be a valid comparison), we get 383 kH/s per 90W for the FPGAs, which is slightly more energy-efficient than the power-capped GPU's 355 kH/s. Now to some multi-hash runs for reliability testing: $ perl -e 'for ($i = 100; $i < 612; $i++) { print crypt("pass$i", "\$5\$saltsalt"), "\n"; }' > pw-sha256crypt One board (4 FPGAs), 160 MHz: $ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' -dev=04A3465XXX -verb=1 pw-sha256crypt [...] Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX]) Press 'q' or Ctrl-C to abort, almost any other key for status 0g 0:00:00:05 0.93% (ETA: 15:54:43) 0g/s 130932p/s 130932c/s 67037KC/s pasaa"a..pasat"a 52g 0:00:00:23 3.86% (ETA: 15:55:41) 2.195g/s 132574p/s 132574c/s 65162KC/s pasaaVi..pasatVi 206g 0:00:01:33 15.29% (ETA: 15:55:53) 2.197g/s 132848p/s 132848c/s 57501KC/s pasaaYc..pasatYc 461g 0:00:02:38 25.93% (ETA: 15:55:55) 2.901g/s 132895p/s 132895c/s 42594KC/s pasaa*b..pasat*b 512g 0:00:02:43 DONE (2018-08-27 15:48) 3.124g/s 132846p/s 132846c/s 41449KC/s pass577..pas##E7 Note that it's almost same p/s and c/s rate as we had for one hash (just slightly slower: 133.0k vs. 132.8k), but much higher C/s rate (comparisons per second) due to the matching salts (in fact, only one salt for all hashes). Two boards (8 FPGAs), 160 MHz: $ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' -dev=04A3465XXX,04A3466XXX -verb=1 pw-sha256crypt [...] Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX]) Press 'q' or Ctrl-C to abort, almost any other key for status 52g 0:00:00:27 9.04% (ETA: 15:54:11) 1.868g/s 264620p/s 264620c/s 125368KC/s pasaaKs..pasa6Ks 206g 0:00:00:47 15.42% (ETA: 15:54:16) 4.345g/s 264982p/s 264982c/s 114756KC/s pasaa@...pasa6@c 410g 0:00:01:07 22.07% (ETA: 15:54:16) 6.037g/s 264729p/s 264729c/s 96949KC/s pasaa{4..pasa6{4 512g 0:00:01:22 DONE (2018-08-27 15:50) 6.194g/s 264657p/s 264657c/s 82757KC/s pass177..pas##D7 Four boards (16 FPGAs), 160 MHz: $ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' -verb=1 pw-sha256crypt [...] Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX]) Press 'q' or Ctrl-C to abort, almost any other key for status 0g 0:00:00:03 2.13% (ETA: 15:53:17) 0g/s 514183p/s 514183c/s 263262KC/s pasaa11..pasa.11 50g 0:00:00:08 5.32% (ETA: 15:53:26) 6.024g/s 521927p/s 521927c/s 254178KC/s pas32%o..pasa.nn 168g 0:00:00:21 13.83% (ETA: 15:53:27) 7.835g/s 525335p/s 525335c/s 239972KC/s pass223..pasa.33 503g 0:00:00:46 30.32% (ETA: 15:53:28) 10.72g/s 526490p/s 526490c/s 151516KC/s pasaaQp..pasa.Qp [...] 503g 0:00:02:34 DONE (2018-08-27 15:53) 3.252g/s 526652p/s 526652c/s 48574KC/s pasaa||..pas||}| Oops, like I said the other two boards don't manage this frequency - only 503 out of 512 passwords got cracked. (The longer runtime and lower C/s rate is explained by this run having done more work: it continued to test other candidate passwords against the remaining 9 hashes past the point where the previous two runs had stopped upon cracking all passwords.) Four boards (16 FPGAs), 135 MHz: $ ./john -form=sha256crypt-ztex -mask='pas?a?a?a?a' -verb=1 pw-sha256crypt [...] Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX]) Press 'q' or Ctrl-C to abort, almost any other key for status 0g 0:00:00:03 1.60% (ETA: 15:57:39) 0g/s 428910p/s 428910c/s 219602KC/s pasaaBe..pasa.Be 115g 0:00:00:19 10.64% (ETA: 15:57:30) 5.888g/s 443625p/s 443625c/s 208548KC/s pass302..pasa.22 206g 0:00:00:27 14.89% (ETA: 15:57:32) 7.545g/s 444307p/s 444307c/s 196239KC/s pasaacc..pasa.cc 410g 0:00:00:39 21.81% (ETA: 15:57:30) 10.26g/s 444808p/s 444808c/s 167448KC/s pass374..pasa./4 512g 0:00:00:49 DONE (2018-08-27 15:55) 10.29g/s 444352p/s 444352c/s 139720KC/s pass477..pas##\7 Session completed This worked OK. Now to some wordlist mode runs, using RockYou top 1 million passwords sorted for decreasing number of occurrences. (More precisely, 1136144 passwords to have a consistent cut-off number of occurrences.) Two boards (8 FPGAs), 160 MHz: $ ./john -form=sha256crypt-ztex -w=rtop1m -dev=04A3465XXX,04A3466XXX -verb=1 pw-sha256crypt [...] Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX]) Note: This format may be a lot faster with --mask acceleration (see doc/MASK). Press 'q' or Ctrl-C to abort, almost any other key for status 11g 0:00:00:05 DONE (2018-08-27 16:26) 1.867g/s 192888p/s 192888c/s 97829KC/s br0926.. nam That was too quick, let's add a digit: $ ./john -form=sha256crypt-ztex -w=rtop1m -mask='?w?d' -dev=04A3465XXX,04A3466XXX -verb=1 pw-sha256crypt [...] Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX]) Press 'q' or Ctrl-C to abort, almost any other key for status 0g 0:00:00:01 4.16% (ETA: 16:28:03) 0g/s 180431p/s 180431c/s 92381KC/s scofield1..1403927 20g 0:00:00:03 8.41% (ETA: 16:28:14) 5.089g/s 191450p/s 191450c/s 97384KC/s pass116..tiamarie7 60g 0:00:00:06 12.70% (ETA: 16:28:26) 9.118g/s 190577p/s 190577c/s 95288KC/s pass146..070319957 70g 0:00:00:09 17.02% (ETA: 16:28:31) 7.567g/s 189794p/s 189794c/s 92294KC/s pass154..0116007 80g 0:00:00:10 19.21% (ETA: 16:28:31) 7.554g/s 189461p/s 189461c/s 91083KC/s pass217..danny937 100g 0:00:00:13 23.57% (ETA: 16:28:34) 7.513g/s 188429p/s 188429c/s 88561KC/s pass107..240425257 150g 0:00:00:18 32.32% (ETA: 16:28:34) 7.991g/s 187064p/s 187064c/s 83885KC/s pass167..hanneman7 180g 0:00:00:27 45.63% (ETA: 16:28:38) 6.649g/s 185297p/s 185297c/s 78288KC/s pass267..im1ru127 210g 0:00:00:35 59.02% (ETA: 16:28:38) 5.920g/s 183839p/s 183839c/s 73620KC/s hickling1..gunpowder17 280g 0:00:01:02 DONE (2018-08-27 16:28) 4.480g/s 181778p/s 181778c/s 61995KC/s 060850#..----- (Like in some other runs, I pressed a key a few times during this one to see status. "pass" is seen so often at the start of a range due to peculiarity of JtR's internal "formats" interface: when successful guesses are found in the range, the same interface returns them.) Note that the c/s rate is much lower than it was for the 7 character mask runs (was 264657c/s, now 181778c/s). That's primarily because of sha256crypt's sensitivity to (candidate) password lengths. Our wordlist contains many lines longer than 7, and they're not sorted by length. Let's try sorting them for increasing length: $ awk '{ print length, $0 }' < rtop1m | sort -n | cut -d' ' -f2- > rtop1m-by-length $ ./john -form=sha256crypt-ztex -w=rtop1m-by-length -mask='?w?d' -dev=04A3465XXX,04A3466XXX -verb=1 pw-sha256crypt [...] Loaded 512 password hashes with no different salts (sha256crypt-ztex, crypt(3) $5$ [sha256crypt ZTEX]) Press 'q' or Ctrl-C to abort, almost any other key for status 0g 0:00:00:03 8.40% (ETA: 16:34:54) 0g/s 254619p/s 254619c/s 130365KC/s 1618251..1783107 0g 0:00:00:05 12.01% (ETA: 16:35:00) 0g/s 257671p/s 257671c/s 131927KC/s 6805021..7533577 280g 0:00:00:17 32.63% (ETA: 16:35:11) 16.14g/s 245882p/s 245882c/s 109692KC/s ashes011..baby6217 280g 0:00:00:24 42.94% (ETA: 16:35:14) 11.32g/s 223203p/s 223203c/s 88713KC/s pencere1..pipe1237 280g 0:00:00:29 49.62% (ETA: 16:35:16) 9.605g/s 215094p/s 215094c/s 81219KC/s 181019611..198602257 280g 0:00:00:35 58.90% (ETA: 16:35:17) 7.984g/s 207390p/s 207390c/s 74145KC/s funy65411..giants247 280g 0:00:00:51 86.55% (ETA: 16:35:17) 5.455g/s 195479p/s 195479c/s 63139KC/s allahlove11..ashleybabe7 280g 0:00:01:00 DONE (2018-08-27 16:35) 4.662g/s 189163p/s 189163c/s 59085KC/s andresydaniela#..----- This is now slightly faster: 182k to 189k c/s overall, and 191k to 254k early on (on smaller password lengths, especially in the second run). And in this case the final guess count is reached at least twice sooner. BTW, such sorting by length is also relevant on GPU (and also for md5crypt). On the original HD 7970 (925 MHz), without sorting: $ ./john -form=sha256crypt-opencl -w=rtop1m -mask='?w?d' -verb=1 pw-sha256crypt Using default input encoding: UTF-8 Loaded 512 password hashes with no different salts (sha256crypt-opencl, crypt(3) $5$ [SHA256 OpenCL]) Press 'q' or Ctrl-C to abort, almost any other key for status 0g 0:00:00:04 4.36% (ETA: 16:41:54) 0g/s 64250p/s 64250c/s 32896KC/s 1228859..godsgrace4 10g 0:00:00:06 6.57% (ETA: 16:41:54) 1.647g/s 86373p/s 86373c/s 44223KC/s godsgrace6..leontina0 30g 0:00:00:11 11.03% (ETA: 16:42:02) 2.710g/s 94722p/s 94722c/s 47787KC/s royden5..koreans7 60g 0:00:00:14 15.54% (ETA: 16:41:53) 4.261g/s 111709p/s 111709c/s 54960KC/s morena239..lavie4 80g 0:00:00:17 17.80% (ETA: 16:41:58) 4.686g/s 107499p/s 107499c/s 52275KC/s lavie6..0905760 100g 0:00:00:23 24.62% (ETA: 16:41:56) 4.332g/s 113580p/s 113580c/s 53269KC/s rebecca261..yngrid3 150g 0:00:00:30 31.56% (ETA: 16:41:58) 4.986g/s 113293p/s 113293c/s 51295KC/s nigga192..ambermay8 180g 0:00:00:44 45.42% (ETA: 16:41:59) 4.080g/s 112916p/s 112916c/s 47888KC/s karen4565..chachi27 210g 0:00:01:07 66.18% (ETA: 16:42:04) 3.131g/s 109438p/s 109438c/s 42704KC/s 178417842..velvet98 280g 0:00:01:16 75.82% (ETA: 16:42:03) 3.680g/s 110260p/s 110260c/s 41809KC/s nodarse6..marklt0 280g 0:00:01:42 DONE (2018-08-27 16:42) 2.727g/s 110683p/s 110683c/s 37708KC/s 0743382.. nam7 With sorting: $ ./john -form=sha256crypt-opencl -w=rtop1m-by-length -mask='?w?d' -verb=1 pw-sha256crypt Using default input encoding: UTF-8 Loaded 512 password hashes with no different salts (sha256crypt-opencl, crypt(3) $5$ [SHA256 OpenCL]) Press 'q' or Ctrl-C to abort, almost any other key for status 0g 0:00:00:04 5.03% (ETA: 16:38:42) 0g/s 129134p/s 129134c/s 66117KC/s tk1236..0923250 0g 0:00:00:07 8.81% (ETA: 16:38:42) 0g/s 148313p/s 148313c/s 75936KC/s 1901025..2908127 0g 0:00:00:20 23.90% (ETA: 16:38:46) 0g/s 156270p/s 156270c/s 80010KC/s niki886..sammy10 280g 0:00:00:35 36.46% (ETA: 16:38:58) 7.986g/s 134586p/s 134586c/s 58440KC/s felix132..jhennel8 280g 0:00:01:40 DONE (2018-08-27 16:39) 2.791g/s 113266p/s 113266c/s 35791KC/s sporting4ever2..????????7 This is also a slight speedup overall, and a larger increase in the c/s rate early on (while on smaller lengths). We'd appreciate more testing, such as on Royce' larger cluster of ZTEX boards maybe. Please post your results as follow-ups to this message. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.