|
Message-ID: <20171112181913.GA17100@openwall.com> Date: Sun, 12 Nov 2017 19:19:13 +0100 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: max_keys_per_crypt tuning Hi, As recently discussed on these GitHub issues: Poor OpenMP scalability of MultiBit format https://github.com/magnumripper/JohnTheRipper/issues/2846 Auto-tune instead of OMP_SCALE https://github.com/magnumripper/JohnTheRipper/issues/2858 our current default/initial MAX_KEYS_PER_CRYPT for non-OpenMP builds, its value for the case of running 1 thread with OpenMP builds, and finally OMP_SCALE for running multiple threads are currently very wrong for many of JtR jumbo formats. While ideally we'd rework JtR OpenMP support entirely, first reworking JtR formats interface, this isn't going to happen at least in time for the next jumbo release (and would be too invasive a change), and besides as I point out above some of this is wrong even without OpenMP. Here's what I think our current approach should be: The default/initial MAX_KEYS_PER_CRYPT used for non-OpenMP builds and when running 1 thread with OpenMP builds should be no less than CRK_PREFETCH, which is currently 64 in our typical builds, for fast hash formats where having many hashes (per salt, where applicable) is a possibility, because otherwise cracker.c's prefetching of bitmap cache lines can't work well. And it's sometimes important: per my recent testing on an i7-4770K running JtR core tree against 20M LM hashes, disabling of prefetching has a 2.5x performance impact. Another reason to use a value significantly larger than 1 even in those builds is to amortize function call overhead, etc. Even before OpenMP support and prefetching were introduced, we commonly ended up tuning a format's MAX_KEYS_PER_CRYPT to 96 or 128 or so. We probably need to perform such tuning, but also ensure the resulting value is no less than 64 unless the hash type is slow, where bitmap prefetching would be relatively insignificant anyway. For slow hashes like scrypt, it is OK to have MAX_KEYS_PER_CRYPT of 1, or whichever is the number of hashes being computed simultaneously by our implementation (e.g., 2 or 3 per thread for our bcrypt code). For those, setting MAX_KEYS_PER_CRYPT to be as low as we can given our code may even be preferable as it improves the interactivity and lets JtR move to further candidate passwords sooner when running against many salts (desirable when the candidate passwords are in an optimized order). Currently, in many formats we only support max_keys_per_crypt > 1 when running OpenMP, with code like: static int crypt_all(int *pcount, struct db_salt *salt) { const int count = *pcount; int index = 0; #ifdef _OPENMP #pragma omp parallel for for (index = 0; index < count; index++) #endif { (example taken from dahua_fmt_plug.c). This is wrong, as it doesn't let us set non-OpenMP max_keys_per_crypt to be higher than 1. Loops like this should either be compiled unconditionally, or logic like this may be used: if defined(_OPENMP) || MAX_KEYS_PER_CRYPT > 1 for (index = 0; index < count; index++) #endif (example taken from cq_fmt_plug.c). I think unconditional is preferred, since either the format is fast enough that it should always have max_keys_per_crypt > 1, or it's slow enough that the loop overhead does not matter. Then, OMP_SCALE - the additional scaling factor to compensate for OpenMP thread desynchronization (to reduce the relative wait times for getting the threads back in sync) - should only be applied when actually running more than 1 thread. Currently we usually also apply it for the 1 thread case, which I now think is wrong. We just got this fixed in multibit_fmt_plug.c with commit 00655ad63366. We need changes like this: - self->params.min_keys_per_crypt *= omp_t; - omp_t *= OMP_SCALE; - self->params.max_keys_per_crypt *= omp_t; + if (omp_t > 1) { + self->params.min_keys_per_crypt *= omp_t; + omp_t *= OMP_SCALE; + self->params.max_keys_per_crypt *= omp_t; + } We might also want to move common code like this into a function (maybe in formats.c) that we'd use from formats' init(), instead of keeping it as duplicate code like we have now. After correcting the above, we need to (re-)adjust the split between MAX_KEYS_PER_CRYPT and OMP_SCALE. For example, now we might have MAX_KEYS_PER_CRYPT 1 (which also kills prefetching) and OMP_SCALE 128 for a certain format, whereas more optimal might be MAX_KEYS_PER_CRYPT 64 and OMP_SCALE 2 (it was per my testing of multibit_fmt_plug.c). Then we'd have max_keys_per_crypt = 64 for non-OpenMP and single thread, but 256 for 2 threads, 512 for 4 threads, etc. Ideally, such (re-)adjustments require running benchmarks for non-OpenMP, for 1 thread, and for multiple threads. Ideally not only "--test", but also actual cracking with varying hash per salt counts. That's a lot of work, so we'd want to automate it, even if as a result we end up merely adjusting the hard-coded values. For now, though, things are so bad that even simply setting each fast format's values to MAX_KEYS_PER_CRYPT 64 and OMP_SCALE 2 (along with making the code changes proposed above) would likely be an improvement. We could then relbench jumbo as a whole from before and after this mass-change, separately for non-OpenMP and OpenMP builds and for varying thread counts (1, 2, 4, and on to 32 or whatever the number of logical CPUs on the test system is). relbench should detect any regressions, which we'd need to deal with manually. Any takers? Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.