|
Message-ID: <33b86fe5cce1011e19188732b13eae58@smtp.hushmail.com> Date: Thu, 5 Dec 2024 09:53:37 +0100 From: magnum <magnumripper@...hmail.com> To: john-users@...ts.openwall.com Subject: Re: Markov phrases in john On 2024-12-05 02:45, Solar Designer wrote: > As to multi-byte strings that are somehow special in UTF-8 (you show > "\u2028" and "\u0085"), you could exclude (skip in the loop above) their > individual bytes such as 0xc2 and 0xe2 (if I got these right). You'd > also need to decrease $maxtok further to 126. U+2028 shouldn't special in any way but it will look like crap if your terminal font can't show it (which is likely). U+0085 is indeed special. I'm not sure I understand the mentioned change of that script but if you want to exclude all UTF-8 first bytes, they are 0xc2, 0xe0, 0xe2, 0xe8 and 0xf0 and decrease $maxtok to 123. With those five excluded, the tokenizer should never produce anything that can be parsed as valid UTF-8. Also, Matt mentions using LC_CTYPE=C, perhaps LC_ALL=C is more effective? magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.