john-users - Re: Markov Sampling

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ9ii1Ek65usPPCWmqZcfZwhdZOi6yb7XN9yjTHr0L_Fn_a8qA@mail.gmail.com>
Date: Tue, 13 Feb 2018 13:16:19 -0500
From: Matt Weir <cweir@...edu>
To: john-users@...ts.openwall.com
Subject: Re: Markov Sampling

So my initial reaction was that I'm doubtful printing out 1/10 of all
the password guesses will give you the end results you are looking
for. Here are a couple of other options:

1) You can always work backwards if your target plaintexts are known.
Aka calculate the level required to generate them, (that's easy to do.
Sum up all of the transitions based on your character file). Once you
have that, calculate how many guesses are required to get to that
level which the included tools in JtR will do for you. If you need
additional help on this approach I can write up something more
detailed. This is *super* fast and gives you a pretty good
approximation.

2) You can rely on JtR's own logging to see when passwords would be
cracked so that way you are not slowed down by piping the output to
stdout or using a 3rd party tool to figure out how many passwords you
cracked. If you want to go this route I'm sure the people on this list
can help out.

There's other options as well I guess. You could always modify the
Markov code directly in JtR. If you want other examples of the Markov
code, I re-implemented it in Python for my PCFG cracker. You can see
the guess generation code at:

 https://github.com/lakiw/pcfg_cracker/blob/master/python_pcfg_cracker/pcfg_manager/markov_cracker.py

Of course, my code is slow so that almost certainly is not the way to
go, but it may be useful as a reference. Long story short, if you let
us know more what you are trying to do I'm sure we can brainstorm some
better options.

Matt



On Tue, Feb 13, 2018 at 11:47 AM, Solar Designer <solar@...nwall.com> wrote:
> On Tue, Feb 13, 2018 at 04:44:03PM +0100, Matlink wrote:
>> > The pre-defined --external=Parallel mode will do what you ask for.
>> > You'll just need to customize the "node" and "total" numbers in its
>> > init() in john.conf.
>> Well, I guess it's only 'not printing' generated candidates? Does it
>> really speed up the process, since generating a password candidate is
>> more costly than printing it?
>
> It doesn't speed up the processing inside JtR; it actually adds extra
> processing.
>
>> Concretely, is --markov --stdout --external=Parallel with node 1/100,
>> 100 times faster than with node 1/1?
>
> No.  It's probably roughly same speed: the external mode adds overhead
> internally to JtR, but then those skipped candidates don't need to be
> printed to the Unix pipe.
>
>> > However, note that "every 10th" doesn't necessarily produce a
>> > representative sample: the underlying cracking mode (in this case,
>> > Markov) might happen to have some periodicity in its output, and one of
>> > its period lengths might just happen to be a multiple of 10 or whatever.
>> > So ideally you'd want to randomize the order (if the order somehow
>> > doesn't matter for your research) over a larger number of candidate
>> > passwords - say, pass a million of them through GNU coreutils' shuf(1) -
>> > and then take every 10th out of that randomized list.
>>
>> My issue is that I can't get the whole output because it is too costly
>> for me to gather them due to UNIX pipe. I would like to my
>>
>>     john --stdout --markov --sample=100 | my_sublime_post-process
>>
>> be somewhat 100 times faster than
>>
>>     john --stdout --markov --sample=1 | my_sublime_post-process
>
> You could use the built-in --node=1/100 feature, which probably will
> speed things up a lot, but then it almost certainly doesn't result in a
> representative sample - it's just a way to split the work between
> multiple nodes, without regard as to whether each node would get a
> representative sample and be expected to crack a similar percentage of
> real-world passwords that other nodes crack or not (so this probably
> won't be the case, making this approach unsuitable for use in research).
>
> The same applies to incremental mode.
>
>> Your solution requires to get the whole output of john and then
>> post-process it, but I can't find a satisfiable way to get its whole
>> output (since john is really fast to generate candidates).
>
> A question is whether you actually need to get this many candidates (or
> a sample from this many), or whether fewer would suffice.  That depends
> on what your ultimate goal is.
>
> Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.