|
Message-ID: <20210515121001.GA23510@openwall.com> Date: Sat, 15 May 2021 14:10:02 +0200 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Cracking stats: p/s, c/s and C/s. Hashing cost factors. David, On Fri, May 14, 2021 at 09:00:51PM -0700, David Sontheimer wrote: > I should add the high-level goal: > > I'd like to test my password generation heuristics as you are testing > your .chr files' candidate generation - using the same number of > max-candidates across trials, and recording the percentage of passwords > cracked in each trial. This is preferable to comparing heuristics by total > time to crack all passwords. OK. > Further, I'd like to calculate the average number of candidates required to > successfully guess a hash. I assume it'd be average only out of those that are successfully guessed in your experiments. This might not be a very useful metric, such as because it'd be growing the longer you let your experiments run. Those longer runs would have a larger total number and percentage of guesses, but since the additional guesses would be the harder ones the average would also grow. In fact, if you were to include the unreached guesses as well - since all hashes would theoretically be cracked in infinite time - then your average is kind of half of infinity, or just nonsense. If you do reach 50% guessed in your experiments, then that's your median (not average) number of candidates to guess a password. I think that's a more reasonable metric. If you don't reach 50%, then you just call the metrics e.g. "number of candidates to successfully guess 10% of passwords in the test set", etc. Average is a useful metric when the total keyspace is within reach. For example, you can compare how much more efficient "--incremental=Digits --length=6" is than "--mask='?d' --length 6" and how much more efficient the latter possibly is than "--mask='[0-9]' --length 6" (yes, there's a difference between the latter two), including in terms of average candidates to achieve a successful guess within this well-defined subset of 1 million possible passwords. > A binary search wrapper was discussed, but > StatusShowCandidates appears to be perfect (and more efficient). > > Using StatusShowCandidates = Y shows the candidate count of a successful > candidate This is a feature I wasn't involved in implementing and have never used. However, my understanding is that yes, it's suitable for your needs. > for a single fork, not all candidates across all forks, correct? Correct. > If so, then the best use of my hardware appears to be running each trial on > a single fork, I think so, given that your use case is research of candidate password generators, not maximizing the number of passwords cracked per unit of real time. > and running many trials in parallel. Either that, or you can use OpenMP. With OpenMP, you can adjust OMP_NUM_THREADS to your liking (such as to maximize real time speeds), which won't affect your research. > Unique, session-named > log files will be required for exact candidate counts (as opposed to the > multiple of batch size printed to stdout) and to differentiate the results > between trials. Sure. You have that by default. > On Fri, May 14, 2021 at 8:46 PM David Sontheimer <david.sontheimer@...il.com> wrote: > >> You might find slides 7 to 10 in this presentation helpful: > >> > >> https://www.openwall.com/presentations/BSidesLjubljana2017-Yescrypt-Large-scale-Password-Hashing/ > > > > This is helpful - thank you. > > > > When parallelized, would optimal be that the set s0 for fork 0 contains > > the most-likely candidate, the set s1 for fork 1 contains the second > > most-likely candidate, ... , set sN for fork N contains the Nth most-likely > > candidate? I'm not sure what sets you refer to. The entire sets that the forked process will ever test, or the smaller sets being tested at a time. Regardless, no, especially in the former case what you describe would be very far from be optimal. Rather, with "--fork" we generally use a round-robin distribution of candidates or of groups of candidates taken off a (virtual) list ordered from most likely to least likely. So e.g. with two processes, process 1 will take candidates/groups 1, 3, 5, 7, ..., and process 2 will take 2, 4, 6, 8, ... This way, e.g. the two most likely candidates or groups (1 and 2 in this example) would be tested first (concurrently), and the less likely candidates would be tested later. (Of course, this is a simplified example. In reality, each process' candidate stream will also be buffered and tested in batches.) Here, by "groups" I refer to a cracking mode specific grouping - for some modes it's non-existent (individual candidates are in round-robin), for some others (incremental mode) it can be pretty large groups (e.g., commonly much larger than what we refer to as "batches"). In the latter case, optimality if the search order is negatively impacted by the large group size. In general, "--fork" is potentially less optimal than OpenMP in terms of candidate password ordering (with OpenMP there's just one optimally-ordered candidate passwords stream), but its lack of synchronization achieves a higher cumulative c/s rate. > > Now, with a fixed runtime/fixed candidates to test, I'd like to run > > separate cracking sessions in parallel. Where do I find the named log file > > based on session name, as described in: > > > > https://github.com/openwall/john/blob/bleeding-jumbo/doc/OPTIONS? > > > > It should be in the same directory as ${JOHN}/john.log, correct? Yes. However, it is wrong to use those curly braces to refer to $JOHN, because it is not an environment - it just looks like one, but it's actually a different concept specific and internal to JtR. (I now see we have that bug in the descriptions of SessionFileProtect and LogFileProtect. We should fix those.) > > Here is my command (for example, SESSION = > > "experiment_2021-05-12_16:13:06.153197"): > > > > ${JOHN}/john --session=${SESSION} --fork=1 > >> --pot=${EXPERIMENTS}/${SESSION}/cracked_len_10.pot --incremental=Digits > >> --min-len=10 --max-len=10 --max-candidates=10000000 > >> ${EXPERIMENTS}/${SESSION}/pwdfile_sha1_crypt_10.txt 2> > >> ${EXPERIMENTS}/${SESSION}/cracked.err 1> > >> ${EXPERIMENTS}/${SESSION}_cracked_len_10.out Looks OK to me, except that "--fork=1" is pointless. If you effectively don't use "--fork", you'd want to control OMP_NUM_THREADS instead to ensure you never overbook the system - OpenMP's performance is very sensitive to overbooking. > > Further, how does --session interact with LogFileProtect in john.conf? > > > > I would expect "LogFileProtect = Named" to not allow named sessions to > > write to john.log, nor to write to other named session logs, yet they would > > write to their own session's NAME.log. > > > > Instead, when uniquely naming sessions and with "LogFileProtect = Named" I > > do not see session-named logs created, and no new data is added to john.log > > (the latter which is expected). These are features I wasn't involved in implementing and have never used, and have no comments on. > > I have looked through both doc/OPTIONS and john.conf for a separate, > > independent variable to enable session-named logs, but no luck. This is enabled by default. In fact, there's no way to disable it short of disabling the logging altogether. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.