|
Message-ID: <d79e879ad5be0aba058d17ec9ac836fa@smtp.hushmail.com> Date: Fri, 22 May 2015 22:25:45 +0200 From: magnum <john.magnum@...hmail.com> To: john-users@...ts.openwall.com Subject: Re: Bleeding jumbo now defaults to UTF-8 On 2015-05-22 21:17, Marek Wrzosek wrote: > W dniu 22.05.2015 o 18:33, magnum pisze: >> On 2015-05-22 16:48, Marek Wrzosek wrote: >>> What is the simplest way to "repair" all.lst from Openwall? >> >> I bet it's a mix of encodings so can't simply be converted. No tool in >> the world will correctly guess each indivial line's encoding (I have >> seen tools that try, but never one that was any good at it). >> >> But all.lst is just a mix of all the separate smaller files. Ideally >> each of them should be converted to UTF-8 (from whatever respective >> codepage), and a new all_utf8.lst could then be created from this. > > I've already created something like this using latin1, koi8-r and > cp1251, but two latter are russian-only so after unique there is only > one of them. I also created file ru_all.lst_utf8 with russian-only > passwords (for use with e.g. --rules=jumbo). > It's against netiquette to attach such big files to e-mails so here are > links: > https://dl.dropboxusercontent.com/u/68111957/all.lst_utf8.gz > https://dl.dropboxusercontent.com/u/68111957/ru_all.lst_utf8.gz > > I hope they are fine. Cool. However, all.lst_utf8 looks OK at first but contains half a million lines of double-encoded Unicode. I should probably mention there is a tool in Jumbo, cprepair, that has some good heuristics for fixing that very problem and some others. I usually don't talk about it because I haven't had the inspiration to document it :-) "../run/cprepair -h" will show usage though. Check files (no output from "-s -d" means they seem to be fine): $ ../run/cprepair -s -d ru_all.lst_utf8 filename: ru_all.lst_utf8 $ ../run/cprepair -s -d all.lst_utf8 | head filename: all.lst_utf8 abergläubischen => abergläubischen abfällt => abfällt abgeändert => abgeändert abgeänderte => abgeänderte abgeänderten => abgeänderten abgehängten => abgehängten abgeklärt => abgeklärt abgekürzt => abgekürzt abgelöscht => abgelöscht Fix the latter: $ ../run/cprepair all.lst_utf8 | ../run/unique all2.lst_utf8 Total lines read 4917041 Unique lines written 4435306 magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.