john-users - Re: Markov Sampling

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180213164701.GA10666@openwall.com>
Date: Tue, 13 Feb 2018 17:47:02 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Markov Sampling

On Tue, Feb 13, 2018 at 04:44:03PM +0100, Matlink wrote:
> > The pre-defined --external=Parallel mode will do what you ask for.
> > You'll just need to customize the "node" and "total" numbers in its
> > init() in john.conf.
> Well, I guess it's only 'not printing' generated candidates? Does it
> really speed up the process, since generating a password candidate is
> more costly than printing it?

It doesn't speed up the processing inside JtR; it actually adds extra
processing.

> Concretely, is --markov --stdout --external=Parallel with node 1/100,
> 100 times faster than with node 1/1?

No.  It's probably roughly same speed: the external mode adds overhead
internally to JtR, but then those skipped candidates don't need to be
printed to the Unix pipe.

> > However, note that "every 10th" doesn't necessarily produce a
> > representative sample: the underlying cracking mode (in this case,
> > Markov) might happen to have some periodicity in its output, and one of
> > its period lengths might just happen to be a multiple of 10 or whatever.
> > So ideally you'd want to randomize the order (if the order somehow
> > doesn't matter for your research) over a larger number of candidate
> > passwords - say, pass a million of them through GNU coreutils' shuf(1) -
> > and then take every 10th out of that randomized list.
> 
> My issue is that I can't get the whole output because it is too costly
> for me to gather them due to UNIX pipe. I would like to my
> 
>     john --stdout --markov --sample=100 | my_sublime_post-process
> 
> be somewhat 100 times faster than
> 
>     john --stdout --markov --sample=1 | my_sublime_post-process

You could use the built-in --node=1/100 feature, which probably will
speed things up a lot, but then it almost certainly doesn't result in a
representative sample - it's just a way to split the work between
multiple nodes, without regard as to whether each node would get a
representative sample and be expected to crack a similar percentage of
real-world passwords that other nodes crack or not (so this probably
won't be the case, making this approach unsuitable for use in research).

The same applies to incremental mode.

> Your solution requires to get the whole output of john and then
> post-process it, but I can't find a satisfiable way to get its whole
> output (since john is really fast to generate candidates).

A question is whether you actually need to get this many candidates (or
a sample from this many), or whether fewer would suffice.  That depends
on what your ultimate goal is.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.