john-users - Re: Markov Sampling

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJ9ii1HaA0M+FcWfe-iza8x5TrEx61oOmDz27=cUNa7R-Z3knA@mail.gmail.com>
Date: Thu, 15 Feb 2018 10:17:58 -0500
From: Matt Weir <cweir@...edu>
To: john-users@...ts.openwall.com
Subject: Re: Markov Sampling

Realized I had a typo in my genmkvpwd description. That would be for a
max length 12, not 120. Should have picked my example level a bit
differently ;p

Matt

On Thu, Feb 15, 2018 at 10:00 AM, Matt Weir <cweir@...edu> wrote:
> This is for JtR's Markov mode. You can do similar options with more
> advanced Markov methods like OMEN but I'm going to skip that for now.
>
> 1) Make sure you have your stats file trained already. The following
> instructions will calculate the level based upon that.
> 2) Open up the stats file. You'll see a lot of lines like:
>
> 97=proba1[32]
> 51=proba2[32*256+35]
> 44=proba2[32*256+38]
> 51=proba2[32*256+39]
> 40=proba2[32*256+48]
> 35=proba2[32*256+49]
> 37=proba2[32*256+50]
> ....
>
> 3) The format is a bit wonky if you have never dug into it before. The
> main thing to keep in mind is the level is the leftmost value.
> 'proba1' means the probability for the first letter. 'proba2' means a
> conditional probability for a letter following the previous letter.
>
> These examples taken from the code comments in my pcfg implementation.
> Note when I'm using probability/level interchangeably, (not ideal I
> know) : https://github.com/lakiw/pcfg_cracker/blob/master/python_pcfg_cracker/pcfg_trainer/markov.py
>
> # Format:
> # Probability=proba1[ORD_REP1]
> # Probability=proba2[ORD_REP1*256+ORD_REP2]
> #
> # Example:
> # 27=proba1[97] //'a' has probability 27
> # 85=proba2[97*256+114] //'r' given 'a' has a probability of 85
> # 83=proba2[97*256+100] //'d' given 'a' has a probability of 83
>
> 4) This is all based on the ASCII representation so 'a' = '97'. You
> can ignore the *256. That's a holdover from the JtR implementation. I
> suspect at one point there was a plan to support non-ASCII characters.
>
> 5) Now take the password you want to find the Markov level/probability
> for. Let's assume it is 'cats'. You start out by looking up the proba1
> for [c]. That's your starting level. Then look up the prob2['c' +
> 'a']. Add that level to your running total. Next look up proba2['a' +
> t']. Add that to your running total. Finally look up proba2['t' + s'].
> Add that to your running total and the sum of all of that is your
> Markov level.
>
> 6) If you have a transition that isn't listed, (for example it uses a
> non-ascii character) that guess will never be generated.
>
> 7) Use ./genmkvpwd to calculate how many password guesses would be
> made for that level of the target password. While it won't tell you
> where in that level the password would be guessed, you can probably
> say it'll be (keysize for level)/2 and have it average out. With
> Markov it isn't weighted to the front which is why cracking sessions
> look a lot like a straight line with a few bumps in it.
>
> ./genmkvpwd statfile max_lvl [max_len] [start] [stop]
>
> Example (if you are looking up for level 120, max length 120)
> ./genmkvpwd statfile 120 12 1 120
>
> 8) Yeah you have to specify a maximum length for guesses and that does
> impact the keysize. Also I can't remember if the results of genmkvpwd
> are total for 0-level or just that level and you have to add all the
> previous levels to get the guesses for it so you might need to play
> with that. I think they are additive, (aka take the last value for the
> total keysize), but I could be wrong as it's been a while since I
> played with that.
>
> Good luck, and if this is confusing or didn't make sense let me know.
>
> Matt
>
> On Thu, Feb 15, 2018 at 9:10 AM, Matlink <matlink@...link.fr> wrote:
>> I am interested by knowing how you calculate the level required to
>> generate plaintext passwords (as you explain in 1)).
>>
>>
>> Le 13/02/2018 à 19:16, Matt Weir a écrit :
>>> So my initial reaction was that I'm doubtful printing out 1/10 of all
>>> the password guesses will give you the end results you are looking
>>> for. Here are a couple of other options:
>>>
>>> 1) You can always work backwards if your target plaintexts are known.
>>> Aka calculate the level required to generate them, (that's easy to do.
>>> Sum up all of the transitions based on your character file). Once you
>>> have that, calculate how many guesses are required to get to that
>>> level which the included tools in JtR will do for you. If you need
>>> additional help on this approach I can write up something more
>>> detailed. This is *super* fast and gives you a pretty good
>>> approximation.
>>>
>>> 2) You can rely on JtR's own logging to see when passwords would be
>>> cracked so that way you are not slowed down by piping the output to
>>> stdout or using a 3rd party tool to figure out how many passwords you
>>> cracked. If you want to go this route I'm sure the people on this list
>>> can help out.
>>>
>>> There's other options as well I guess. You could always modify the
>>> Markov code directly in JtR. If you want other examples of the Markov
>>> code, I re-implemented it in Python for my PCFG cracker. You can see
>>> the guess generation code at:
>>>
>>>  https://github.com/lakiw/pcfg_cracker/blob/master/python_pcfg_cracker/pcfg_manager/markov_cracker.py
>>>
>>> Of course, my code is slow so that almost certainly is not the way to
>>> go, but it may be useful as a reference. Long story short, if you let
>>> us know more what you are trying to do I'm sure we can brainstorm some
>>> better options.
>>>
>>> Matt
>>>
>>>
>>>
>>> On Tue, Feb 13, 2018 at 11:47 AM, Solar Designer <solar@...nwall.com> wrote:
>>>> On Tue, Feb 13, 2018 at 04:44:03PM +0100, Matlink wrote:
>>>>>> The pre-defined --external=Parallel mode will do what you ask for.
>>>>>> You'll just need to customize the "node" and "total" numbers in its
>>>>>> init() in john.conf.
>>>>> Well, I guess it's only 'not printing' generated candidates? Does it
>>>>> really speed up the process, since generating a password candidate is
>>>>> more costly than printing it?
>>>> It doesn't speed up the processing inside JtR; it actually adds extra
>>>> processing.
>>>>
>>>>> Concretely, is --markov --stdout --external=Parallel with node 1/100,
>>>>> 100 times faster than with node 1/1?
>>>> No.  It's probably roughly same speed: the external mode adds overhead
>>>> internally to JtR, but then those skipped candidates don't need to be
>>>> printed to the Unix pipe.
>>>>
>>>>>> However, note that "every 10th" doesn't necessarily produce a
>>>>>> representative sample: the underlying cracking mode (in this case,
>>>>>> Markov) might happen to have some periodicity in its output, and one of
>>>>>> its period lengths might just happen to be a multiple of 10 or whatever.
>>>>>> So ideally you'd want to randomize the order (if the order somehow
>>>>>> doesn't matter for your research) over a larger number of candidate
>>>>>> passwords - say, pass a million of them through GNU coreutils' shuf(1) -
>>>>>> and then take every 10th out of that randomized list.
>>>>> My issue is that I can't get the whole output because it is too costly
>>>>> for me to gather them due to UNIX pipe. I would like to my
>>>>>
>>>>>     john --stdout --markov --sample=100 | my_sublime_post-process
>>>>>
>>>>> be somewhat 100 times faster than
>>>>>
>>>>>     john --stdout --markov --sample=1 | my_sublime_post-process
>>>> You could use the built-in --node=1/100 feature, which probably will
>>>> speed things up a lot, but then it almost certainly doesn't result in a
>>>> representative sample - it's just a way to split the work between
>>>> multiple nodes, without regard as to whether each node would get a
>>>> representative sample and be expected to crack a similar percentage of
>>>> real-world passwords that other nodes crack or not (so this probably
>>>> won't be the case, making this approach unsuitable for use in research).
>>>>
>>>> The same applies to incremental mode.
>>>>
>>>>> Your solution requires to get the whole output of john and then
>>>>> post-process it, but I can't find a satisfiable way to get its whole
>>>>> output (since john is really fast to generate candidates).
>>>> A question is whether you actually need to get this many candidates (or
>>>> a sample from this many), or whether fewer would suffice.  That depends
>>>> on what your ultimate goal is.
>>>>
>>>> Alexander
>>
>> --
>> Matlink - Sysadmin matlink.fr
>> Sortez couverts, chiffrez vos mails : https://café-vie-privée.fr/
>> XMPP/Jabber : matlink@...link.fr
>> Clé publique PGP : 0x186BB3CA
>> Empreinte Off-the-record : 572174BF 6983EA74 91417CA7 705ED899 DE9D05B2
>>
>>
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.