|
Message-ID: <20200812172333.GA9198@openwall.com> Date: Wed, 12 Aug 2020 19:23:33 +0200 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: sha512crypt-opencl / Self test failed (cmp_all(1)) To include anything on-topic (sha512crypt-opencl) in this thread again: Until recently, sha512crypt-opencl and sha256crypt-opencl didn't use optimal internal settings for NVIDIA Volta and NVIDIA Turing cards. Claudio has fixed this in very recent commits first by recognizing Volta specially and then (on my advice) by simply treating all newer and future NVIDIA GPUs the same as the last NVIDIA GPU family we tuned for. With this, latest jumbo's sha512crypt-opencl delivers speeds on NVIDIA Tesla V100 and "GeForce RTX 2070 with Max-Q Design" (a laptop GPU) that are on par with hashcat's. (Just whatever I happen to have results for. We still don't have an NVIDIA RTX 20xx GPU in a JtR dev box.) My own test on AWS p3.2xlarge, V100 16GB, "Driver Version: 418.87.00 CUDA Version: 10.1", before the commits mentioned above: Device 1: Tesla V100-SXM2-16GB [...] LWS=32 GWS=20480 (640 blocks) DONE Speed for cost 1 (iteration count) of 5000 Raw: 266240 c/s real, 264915 c/s virtual After: Device 1: Tesla V100-SXM2-16GB Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... LWS=256 GWS=2621440 (10240 blocks) DONE Speed for cost 1 (iteration count) of 5000 Raw: 393019 c/s real, 392725 c/s virtual With these changes, the auto-tuning results in very large GWS, which may sometimes be inconvenient (2621440/393019 = ~7 seconds per salt, so a long time to advance to the next batch of candidates when there are many salts). However, forcing a lower GWS nevertheless delivers reasonable speed (just moderately lower): $ ./john -test -form=sha512crypt-opencl -gws=20480 Device 1: Tesla V100-SXM2-16GB Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... LWS=32 GWS=20480 (640 blocks) DONE Speed for cost 1 (iteration count) of 5000 Raw: 368640 c/s real, 368640 c/s virtual For comparison, on the same AWS instance with current hashcat, CUDA API: $ ./hashcat -b -O -w4 -m1800 hashcat (v6.1.1-20-gdc9a2468) starting in benchmark mode... [...] CUDA API (CUDA 10.1) ==================== * Device #1: Tesla V100-SXM2-16GB, 15814/16130 MB, 80MCU [...] Hashmode: 1800 - sha512crypt $6$, SHA512 (Unix) (Iterations: 5000) Speed.#1.........: 398.6 kH/s (326.13ms) @ Accel:8 Loops:1024 Thr:1024 Vec:1 OpenCL API: $ ./hashcat -b -O -w4 -m1800 -d2 hashcat (v6.1.1-20-gdc9a2468) starting in benchmark mode... [...] OpenCL API (OpenCL 1.2 CUDA 10.1.236) - Platform #1 [NVIDIA Corporation] ======================================================================== * Device #2: Tesla V100-SXM2-16GB, 15744/16130 MB (4032 MB allocatable), 80MCU [...] Hashmode: 1800 - sha512crypt $6$, SHA512 (Unix) (Iterations: 5000) Speed.#2.........: 382.5 kH/s (340.11ms) @ Accel:8 Loops:1024 Thr:1024 Vec:1 So best hashcat is 398k+ (CUDA) and best John is 393k (OpenCL). The "-w4" option made a difference - speeds were lower with "-w3". "GeForce RTX 2070 with Max-Q Design" in a Windows laptop, latest build of JtR for Windows from: https://github.com/openwall/john-packages/releases Benchmarking: sha512crypt-opencl, crypt(3) $6$ (rounds=5000) [SHA512 OpenCL]... LWS=32 GWS=147456 (4608 blocks) DONE Speed for cost 1 (iteration count) of 5000 Raw: 156576 c/s real, 156618 c/s virtual BTW, credit for making these builds also goes to Claudio. Thanks! hashcat: Hashmode: 1800 - sha512crypt $6$, SHA512 (Unix) (Iterations: 5000) Speed.#1.........: 151.7 kH/s (385.92ms) @ Accel:8 Loops:1024 Thr:1024 Vec:1 (Not sure which exact version and command-line, but I had asked for "hashcat -b -O -w4 -m1800" to be used for this test.) Now to the ZTEX stuff, which is on-topic for john-users, but not so much for this thread: On Wed, Aug 12, 2020 at 07:41:07AM -0800, Royce Williams wrote: > On Thu, Aug 6, 2020 at 5:21 AM Solar Designer <solar@...nwall.com> wrote: > > On Wed, Aug 05, 2020 at 09:45:26PM -0800, Royce Williams wrote: > > > When this happened to me, I dropped the speed on the specific boards by > > > 10MHz or so until it stopped, > > > > When errors are infrequent, it's generally more efficient to just let > > them happen once in a while, giving a higher average c/s rate than you'd > > have at a lower clock rate. > > Indeed. There's definitely a sweet spot there. I'm sure that the various > Bitcoin forums from ZTEX have similar wisdom. It's actually quite different for password cracking vs. cryptocurrency mining. It's also different for password security audits that need to be reliable vs. those that are opportunistic (or are contests, indeed). For mining, all that matters is to maximize the effective hashrate (for shares accepted by a pool). Occasional errors (both false negatives aka missed winning nonces and false positives aka nonces that don't actually produce a valid share) are OK as long as the average effective hashrate is higher. For password cracking, one has to decide whether and what error rate to tolerate. And this is almost exclusively about false negatives (missing some otherwise crackable passwords), not about false positives (reporting a cracked password when there isn't one). False positives are almost impossible because we're checking for exact match (not for the computed hash being below a target, which is the case for mining). Another aspect is what kind of errors we're getting. Detected errors are not the worst - we waste time, but repeat the computation. Undetected errors are worse. When you see or don't see occasional detected errors, this doesn't tell you whether or not there are also undetected errors, although there might be correlation. We detect errors with communication (we use checksums) and some kinds of errors with computation (if a computation result isn't reported in time, something might have gone wrong with a state machine or control flow on a soft CPU). We do not detect most other potential errors with computation (can't do that without duplicating the work or using some kind of slower error-detecting computation primitives). So if the errors you're getting look like they're related to stress on the USB subsystem, it may be OK to ignore them and to optimize for the highest average c/s rate despite of the errors. And regardless of whether you're getting any detected errors, it may be a good idea to test that you're getting passwords cracked sufficiently reliably for your use case (e.g., 99%+ for a contest and 100% for a policy audit). Yet another aspect is that Bitcoin in particular involves just two computations of SHA-256. An error rate of, say, 1% per one SHA-256 computation would result in around a 2% error rate for the whole thing, which is acceptable and likely profitable (compared to trying to avoid errors by using a lower clock rate). However, the password hashes we have implemented on ZTEX are all of the "slow" kind - they use large numbers of iterations of a primitive. For example, sha512crypt hashes use 5000 iterations by default. A 1% error rate per one SHA-512 computation would turn into an almost 100% error rate for the whole construction. This means that a much more conservative clock rate is needed to have an error rate that is acceptable for any kind of password cracking (even for a contest). When the computation error rate is non-zero, it will also tend to increase for higher "cost" settings in the variable cost hashes (bcrypt, sha512crypt, sha256crypt, Drupal7, phpass). For example, in GitHub issue #3851 Aleksey reports a 0.1% error rate for bcrypt cost 10 on one of his ZTEX boards at 150 MHz, but no errors for bcrypt cost 5 nor at 148 MHz or below. I guess no errors for cost 5 at 150 MHz merely meant the error rate was 32x lower as expected (the difference in iterations count between bcrypt costs 5 and 10), which apparently was below the threshold of Aleksey's tests. > > Please remember that there's generally no point in adjusting frequencies > > per board (except for testing) if you use all of your boards as one big > > cluster. John is currently only able to use the boards synchronously, > > so the slowest board will determine the cluster's overall performance. > > > > This changes when you use "--fork" or "--devices", but in particular > > with "--fork" it'd probably be inconvenient for you to have some forked > > processes terminate much sooner than others. So the per-board frequency > > adjustment is generally only useful when you run per-board-set attacks, > > explicitly targeting attacks to same-frequency lists of "--devices". Of > > course, you'd also use "--session" to launch multiple attacks from the > > same "run" directory. > > Ah, yes - I'd been tuning relative to "--fork" ... once I re-discovered it. > :) Right, and please note that usage of "--fork" with ZTEX puts much more stress on the USB subsystem, so more frequent communication errors are expected (thus, errors of the kind you can ignore as long as they're infrequent enough that the average speed improves). > In my experience, for contests and similar time-critical scenarios, having > some boards finish sooner also means (theoretically) getting some *results* > sooner - which may be worth the inconvenience. Right. BTW, sha512crypt-ztex might be more suitable than sha512crypt-opencl for quick experiments in a contest because it becomes reasonably efficient at lower candidate password counts. For sha512crypt-ztex, you currently need at least 768 candidates per board (or preferably an exact multiple of that, but exactly 768 is OK), which is much less than the GWS figures seen above. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.