john-users - Re: Using Chinese characters

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <72c8293342fdc8482297fd1574fd2580@smtp.hushmail.com>
Date: Thu, 27 Jul 2023 01:12:04 +0200
From: magnum <magnumripper@...hmail.com>
To: john-users@...ts.openwall.com
Subject: Re: Using Chinese characters

On 2023-07-20 07:41, Giuseppe Kalos wrote:
> I'm trying to crack a 7z password that contains chinese characters, but I'm
> having some problems. First of all, I'm working on Windows and in john.conf
> I have DefaultEncoding set to UTF-8 and DefaultMSCodepage se to CP850.

"DefaultEncoding = UTF-8" is fine, the other setting doesn't matter in 
this case.

> If I try to make a custom charset with --make-charset, it will, no matter
> what, take only 51 of the ca. 5000 characters I put in john.pot. If I try

Incremental is actually byte-oriented under the hood, and UTF-8 is a 
multibyte encoding.  Sp you won't ever see 5000 there. For example, 爸 is 
a byte sequence of (in hex) e7 88 b8 - and 皑 is e7 9a 91. If you train 
on just these two characters, incremental will learn e7, 88, 91, 9a and 
b8. That's five "characters" (actually that output should read "bytes" 
in this case) for just two inputs but as the first bytes are picked from 
a small set, the re-use of bytes (such as e7 for two different inputs 
above) will be higher.

> with a mask, it gives the error "Error in Unicode conversion. Ensure
> --input-encoding is correct", but even using --input-encoding=UTF-8 it

Mask mode can't do Unicode except for when we have a supported legacy 
codepage covering what we need to produce (such as CP1252 for producing 
ß and ü).

> gives the same error. The only way I managed to make it kinda work is using
> --subset: and pasting all the characters directly in cmd. This way it
> starts, but when a session is aborted it cannot be restored, one again
> because of the same Unicode conversion error I cited before. Adding
> encoding commands like --input-encoding, --target-encoding does not solve
> the issue.

I have no idea why that doesn't work for you.  You could try 
--subet=full-unicode instead, but it will only crack really short 
passwords (or just a single character repeated several times).  And it 
will waste time on *all* of Unicode, not just chinese.


Anyway for incremental I just tried this, using default settings and 
latest bleeding jumbo:

$ wget 
https://github.com/cfbao/chinese-diceware/raw/master/pinyin8k.wordlist
(...)
2023-07-27 00:24:06 (4.29 MB/s) - ‘pinyin8k.wordlist’ saved [111071/111071]

$ file pinyin8k.wordlist
pinyin8k.wordlist: Unicode text, UTF-8 text

# Create a dummy pot file from it, simply replaceing tabs with ":"
$ sed -r 's/\t/:/' < pinyin8k.wordlist > dummy.pot

$ head -3 dummy.pot
aba:阿爸
afuhan:阿富汗
aiai:皑皑

$ ./john --make-charset=chinese.chr --pot=dummy.pot
Loaded 8192 plaintexts
Generating charsets........................ DONE
Generating cracking order..------- Stable order (9 recalculations)
Successfully wrote charset file: chinese.chr (71 characters)

See, that's only 71 bytes learned - but it did learn thousands of 
different ways to put them together, and tries to be clever about that 
detail when you run the mode later. Now let's print the top 10 candidates:

$ ./john -incremental:chinese.chr --stdout -max-candidates=10
Using charset file supplied as option: chinese.chr
Warning: only 71 characters available
大入
本头
大战
本业

?
?
?
?
?
10p 0:00:00:00 0.00% (ETA: 00:44:18) 111.1p/s ?

As seen above, Incremental mode doesn't play that well with UTF-8, so 
candidates 5..10 ended up shown as ? on my terminal because individual 
bytes were put together in ways that made no sense in that encoding.  We 
can easily filter them out though:

$ ./john -inc:chinese.chr --stdout -external:filter_utf8 -max-cand=25
Using charset file supplied as option: chinese.chr
Warning: only 71 characters available
大入
本头
大战
本业

光
大肉
本意
过头
过物
过气
成
大头
大有
大自
本口
本人
本有
过子
过目
过思
特别
特头
特有
特文
25p 0:00:00:00 0.00% (ETA: 00:46:31) 277.8p/s 特文
Session stopped (max candidates reached)

Now drop that "--max-cand" option and it will produce millions of valid 
candidates (although, I presume, not always sensible in chinese). I 
tried it for a few seconds and it produced 17803462 unique candidates 
despite the "only 71 characters available".

I'm not sure why it doesn't work out for you.  You should probably use a 
fairly recent build of john - not the "latest release" but some of the 
binaries you can find on our GitHub repo, or a build of your own from 
same GitHub sources. I can't recall anything important were changed in 
the last 1-2 years but the latest release is over four years old now.

Perhaps you should try the exact things I did above, and show if/how it 
fails? BTW I did also test the above against 7z samples - no problem seen.

Feel free to post follow-up questions.  I may be slow to respond but I 
will eventually do so.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.