|
Message-ID: <72c8293342fdc8482297fd1574fd2580@smtp.hushmail.com> Date: Thu, 27 Jul 2023 01:12:04 +0200 From: magnum <magnumripper@...hmail.com> To: john-users@...ts.openwall.com Subject: Re: Using Chinese characters On 2023-07-20 07:41, Giuseppe Kalos wrote: > I'm trying to crack a 7z password that contains chinese characters, but I'm > having some problems. First of all, I'm working on Windows and in john.conf > I have DefaultEncoding set to UTF-8 and DefaultMSCodepage se to CP850. "DefaultEncoding = UTF-8" is fine, the other setting doesn't matter in this case. > If I try to make a custom charset with --make-charset, it will, no matter > what, take only 51 of the ca. 5000 characters I put in john.pot. If I try Incremental is actually byte-oriented under the hood, and UTF-8 is a multibyte encoding. Sp you won't ever see 5000 there. For example, 爸 is a byte sequence of (in hex) e7 88 b8 - and 皑 is e7 9a 91. If you train on just these two characters, incremental will learn e7, 88, 91, 9a and b8. That's five "characters" (actually that output should read "bytes" in this case) for just two inputs but as the first bytes are picked from a small set, the re-use of bytes (such as e7 for two different inputs above) will be higher. > with a mask, it gives the error "Error in Unicode conversion. Ensure > --input-encoding is correct", but even using --input-encoding=UTF-8 it Mask mode can't do Unicode except for when we have a supported legacy codepage covering what we need to produce (such as CP1252 for producing ß and ü). > gives the same error. The only way I managed to make it kinda work is using > --subset: and pasting all the characters directly in cmd. This way it > starts, but when a session is aborted it cannot be restored, one again > because of the same Unicode conversion error I cited before. Adding > encoding commands like --input-encoding, --target-encoding does not solve > the issue. I have no idea why that doesn't work for you. You could try --subet=full-unicode instead, but it will only crack really short passwords (or just a single character repeated several times). And it will waste time on *all* of Unicode, not just chinese. Anyway for incremental I just tried this, using default settings and latest bleeding jumbo: $ wget https://github.com/cfbao/chinese-diceware/raw/master/pinyin8k.wordlist (...) 2023-07-27 00:24:06 (4.29 MB/s) - ‘pinyin8k.wordlist’ saved [111071/111071] $ file pinyin8k.wordlist pinyin8k.wordlist: Unicode text, UTF-8 text # Create a dummy pot file from it, simply replaceing tabs with ":" $ sed -r 's/\t/:/' < pinyin8k.wordlist > dummy.pot $ head -3 dummy.pot aba:阿爸 afuhan:阿富汗 aiai:皑皑 $ ./john --make-charset=chinese.chr --pot=dummy.pot Loaded 8192 plaintexts Generating charsets........................ DONE Generating cracking order..------- Stable order (9 recalculations) Successfully wrote charset file: chinese.chr (71 characters) See, that's only 71 bytes learned - but it did learn thousands of different ways to put them together, and tries to be clever about that detail when you run the mode later. Now let's print the top 10 candidates: $ ./john -incremental:chinese.chr --stdout -max-candidates=10 Using charset file supplied as option: chinese.chr Warning: only 71 characters available 大入 本头 大战 本业 ? ? ? ? ? 10p 0:00:00:00 0.00% (ETA: 00:44:18) 111.1p/s ? As seen above, Incremental mode doesn't play that well with UTF-8, so candidates 5..10 ended up shown as ? on my terminal because individual bytes were put together in ways that made no sense in that encoding. We can easily filter them out though: $ ./john -inc:chinese.chr --stdout -external:filter_utf8 -max-cand=25 Using charset file supplied as option: chinese.chr Warning: only 71 characters available 大入 本头 大战 本业 光 大肉 本意 过头 过物 过气 成 大头 大有 大自 本口 本人 本有 过子 过目 过思 特别 特头 特有 特文 25p 0:00:00:00 0.00% (ETA: 00:46:31) 277.8p/s 特文 Session stopped (max candidates reached) Now drop that "--max-cand" option and it will produce millions of valid candidates (although, I presume, not always sensible in chinese). I tried it for a few seconds and it produced 17803462 unique candidates despite the "only 71 characters available". I'm not sure why it doesn't work out for you. You should probably use a fairly recent build of john - not the "latest release" but some of the binaries you can find on our GitHub repo, or a build of your own from same GitHub sources. I can't recall anything important were changed in the last 1-2 years but the latest release is over four years old now. Perhaps you should try the exact things I did above, and show if/how it fails? BTW I did also test the above against 7z samples - no problem seen. Feel free to post follow-up questions. I may be slow to respond but I will eventually do so. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.