|
Message-ID: <20241205014553.GA8319@openwall.com> Date: Thu, 5 Dec 2024 02:45:53 +0100 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Markov phrases in john Hi, I'm keeping track of issues with and enhancement ideas for the tokenizer and its uses here: https://github.com/openwall/john/issues/5597 This lists 12 items now. Some of these have been mentioned in here, others are "brand new". On Wed, Dec 04, 2024 at 07:24:57PM -0500, Matt Weir wrote: > I wrote a new blog post about additional analysis of the Tokenizer and OMEN > attacks. > > Link: > https://reusablesec.blogspot.com/2024/12/analyzing-tokenizer-part-2-omen.html > > TLDR: I ran into significant problems training on the output of tokenize.pl. > The control characters it inserts into the training data causes problems > both when I was reading in the data as well as when I was trying to write > the OMEN rulesets to disk. Therefore I was unable to successfully combine > the two attack techniques. I plan on continuing to look into this, but it's > quickly turning into a much bigger project than I originally expected. Thanks. Yes, projects like these quickly expand into many directions. Regarding the control characters, it should be fairly easy for you to modify the script so that it does not use single-byte ones. Just reduce $maxtok from 158 to 128, and adjust this loop: for (my $c = 1; $c < 0x100; $c++) { to be: for (my $c = 0x80; $c < 0x100; $c++) { As to multi-byte strings that are somehow special in UTF-8 (you show "\u2028" and "\u0085"), you could exclude (skip in the loop above) their individual bytes such as 0xc2 and 0xe2 (if I got these right). You'd also need to decrease $maxtok further to 126. Regarding your first 25 for RockYou training not matching mine, I think it is important because it indicates that something went wrong and it potentially invalidates your other results. Sure everything looks similar, but if it's unexpectedly not the same then we have no idea if e.g. incremental+tokenize would actually perform better than OMEN with whatever error there was corrected. You found one puzzling display error with piping into "less" under WSL. However, even with that corrected your first 25 list still doesn't match mine, right? Anyway, all of this was presumably on a version of tokenize.pl before fixing of now-known issues in it (which had made it sub-optimal compared to the current version). You don't mention what version you use now. I hope you've switched to the latest? If so, it isn't supposed to match the older version's output I had posted. But I can get a new test case to you. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.