|
Message-ID: <555F9D72.2060207@gmail.com> Date: Fri, 22 May 2015 23:19:46 +0200 From: Marek Wrzosek <marek.wrzosek@...il.com> To: john-users@...ts.openwall.com Subject: Re: Bleeding jumbo now defaults to UTF-8 W dniu 22.05.2015 o 22:25, magnum pisze: > On 2015-05-22 21:17, Marek Wrzosek wrote: >> W dniu 22.05.2015 o 18:33, magnum pisze: >>> On 2015-05-22 16:48, Marek Wrzosek wrote: >>>> What is the simplest way to "repair" all.lst from Openwall? >>> >>> I bet it's a mix of encodings so can't simply be converted. No tool in >>> the world will correctly guess each indivial line's encoding (I have >>> seen tools that try, but never one that was any good at it). >>> >>> But all.lst is just a mix of all the separate smaller files. Ideally >>> each of them should be converted to UTF-8 (from whatever respective >>> codepage), and a new all_utf8.lst could then be created from this. >> >> I've already created something like this using latin1, koi8-r and >> cp1251, but two latter are russian-only so after unique there is only >> one of them. I also created file ru_all.lst_utf8 with russian-only >> passwords (for use with e.g. --rules=jumbo). >> It's against netiquette to attach such big files to e-mails so here are >> links: >> https://dl.dropboxusercontent.com/u/68111957/all.lst_utf8.gz >> https://dl.dropboxusercontent.com/u/68111957/ru_all.lst_utf8.gz >> >> I hope they are fine. > > Cool. However, all.lst_utf8 looks OK at first but contains half a > million lines of double-encoded Unicode. I should probably mention there > is a tool in Jumbo, cprepair, that has some good heuristics for fixing > that very problem and some others. I usually don't talk about it because > I haven't had the inspiration to document it :-) > "../run/cprepair -h" will show usage though. > > Check files (no output from "-s -d" means they seem to be fine): > $ ../run/cprepair -s -d ru_all.lst_utf8 > filename: ru_all.lst_utf8 > > $ ../run/cprepair -s -d all.lst_utf8 | head > filename: all.lst_utf8 > abergläubischen => abergläubischen > abfällt => abfällt > abgeändert => abgeändert > abgeänderte => abgeänderte > abgeänderten => abgeänderten > abgehängten => abgehängten > abgeklärt => abgeklärt > abgekürzt => abgekürzt > abgelöscht => abgelöscht > > Fix the latter: > $ ../run/cprepair all.lst_utf8 | ../run/unique all2.lst_utf8 > Total lines read 4917041 Unique lines written 4435306 > > magnum > Thank you, magnum! :-) -- Marek Wrzosek marek.wrzosek@...il.com
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.