|
Message-ID: <20140415104307.GA18823@openwall.com> Date: Tue, 15 Apr 2014 14:43:07 +0400 From: Aleksey Cherepanov <lyosha@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: proof of concept converter of rexgen-like syntax into john rules On Mon, Apr 14, 2014 at 07:06:28PM +0200, magnum wrote: > On 2014-04-14 18:42, Aleksey Cherepanov wrote: > >On Sun, Mar 23, 2014 at 12:17:05PM +0100, magnum wrote: > >>On 2014-03-22 22:13, Aleksey Cherepanov wrote: > >>>I think rexgen-like syntax could be a part of rules and performed at > >>>preprocessor level like []. For instance > >> > >>This would only work for trivial regexes, a complicated one would produce > >>too many rules. I'd still want a pure regexp mode. > >> > >>Anyways, your script is something we could supply with Jumbo almost as-is. A > >>"rule generator script" is probably what many people need, including Nima. > >>Just add a shebang line, license blurb, brief docs and wrap "while ($rule = > >><>)" around the lot and I think we're set. > > > >I am sorry for the delay. > > > >I've implemented all missing features except -i option. > > > >I did not test things well. Tests are still TODO. Also I'd like to > >improve interface and add other whistles like named groups and > >relative back refs. > > Thanks! Committed now. I haven't had time to test it a lot yet but I believe > this can be an excellent tool even once we got a native rexgen mode. Thanks! My pipeline is quite easy. I do not store expression as a tree. I expand each rule into a list of variants that have only \0 as active construction (also [] because they are passed into john). Most transformations are straight forward and are not tightly connected. The only complex part is an expansion of alternatives/groups because there is a tree/recursion and back references should be expanded together with alternatives to support back references from group to itself in previous iteration of quantifier (like in (a(?:\1|b)){2} ). Well, this is a sequence of transformations starting with rule as a string: 1) Split the string into tokens. 2) Transform tokens to expand aliases: ? -> {0,1} and so on. 1 and 2 are in function to_tokens(). (expand_questions() is obsolete, I forgot to cut it.) 3) Traverse all tokens to check parity and enumerate groups. Show errors on unbalanced parenthesis and back refs to not yet seen groups. Ok: (a)\1, (a\1) Error: \1(a) 3 is check_parity(). 4) expand_quantifiers(): say, we have group with number 1: (#1...){2,3} After the expansion we have two variants: (#1...)(#1...) (#1...)(#1...)(#1...) All these groups have number 1. Things are done using copying. Then I apply join_tokens to merge some tokens into bigger strings , it should not be done at this point because I parse tokens again and again later. It is a flaw of implementation, and not a part of algorithm. 5) expand_parenthesis() for each variant: expands groups/alternatives and back refs. This part is recursive. At the deepest level of recursion we expand groups into a set of variant and combine sets together: (a|b|c)(x|y|z) -> [a, b, c] * [x, y, z] -> [ax, ay, az, bx, ...] asdf(a|b|c) -> [asdf] * [a, b, c] -> ... Things are not that easy in practice because we add metainformation about each group to each variant. (a|b|c)(x|y|z) -> [a (group 1 is "a"), b (g1 is "b"), c (g1 is "c")] * [x (g2 is "x"), y (g2 is "y"), z (g2 is "z")] -> [ax (g1 is "a", g2 is "x"), ay (g1 is "a", g2 is "y"), ...] Actually we use numbers of groups that they got during enumeration. We put group's value into metainformation of a variant when variant is fully formed. So combine() combines sets of variants tracking metainformation and expanding back refs: it traverses all variants from left set, for each of them it traverses all variants from right set: - if left variant has back refs to groups defined in metainformation of this left variant then it is back refs to group itself we skip such variants. - if right variant has back refs to groups defined in metainformation of the left variant then substitute them. - merge metainformation: every value defined in right set is keeped, values from left set are not defined in right set. Skipping variants changes global var in my implementation. I think it could be put into metainformation too. Counter is not very accurate: (a(?:\1|b)){2} is 4 variants including 2 skipped, but counter say 1 skipped because it skips (a\1) before it could be counted as 2 variants. At the end of the expansion I expand all back refs to defined groups and skip variants with back refs to undefined groups. At this moment metainformation gets wiped and we have pure list of variants again. 6) Last action: expand \0 and other \. for other chars. expand_0_john() transforms variants into rules. Word generator would do different thing. Also word generator should expand []. This point is the right place to join tokens into strings. expand_to_john() runs the whole pipeline. BTW should -i affect \0? -- Regards, Aleksey Cherepanov
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.