john-users - Re: Rules characters unicode support.

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4c755269adfe36e3c73050f9edff50b7@smtp.hushmail.com>
Date: Tue, 3 Nov 2020 19:26:36 +0100
From: magnum <john.magnum@...hmail.com>
To: john-users@...ts.openwall.com
Subject: Re: Rules characters unicode support.

On 2020-11-03 15:48, François wrote:
> While running my tool on a very large (and old) leak, I realized that some
> character substitutions from ASCII to Unicode were hitting some results (a
> few hits on a large leak) for example:
> seé
> (...)
> They're making sense, because some old RFC or specs prevent non ASCII
> characters to be used in email address or login information but passwords
> fields actually take them now. For example, we could imagine that a
> password associated to my email address francois.pesce@...il.com could be
> close to the way my French first name is actually written, thus "françois"
> (possibly generated by a single rule substituting c to ç such as:  scç ).
> 
> However, it seems that currently, john(-jumbo) does not support Unicode
> characters for all rules commands (except for the content of command A"..."
> ). Is anyone working on supporting that use case, should I just try to use
> the A"..." command for my niche finding ? What are your thoughts?

While the Unicode support could be better, there are ways to achieve 
what you need. First of all, we need to tell John what encoding we're 
expecting the hashes to be made from. Nowadays that's usually a 
no-brainer, it use to be UTF-8 and that's also the deafult in john.conf.

Now if your need would have been eg. CP1252, things would be simpler 
since such legacy codepages are all single-byte: You'd simply write your 
rules such as scç and then be sure to save that config file with CP1252 
encoding. Run with --encoding=cp1252 and all should work just fine.

With UTF-8 however, things currently aren't quite that easy because the 
rule engine does not (yet) honor multi-byte characters. But we have a 
work-around called --internal-codepage. What this does is we still 
expect UTF-8 input (the hash file, any wordlists) and we still produce 
hashes from an UTF-8 encoded cleartext - but internally within the rule 
engine we run the internal legacy codepage. Just pick any encoding that 
can hold all characters you need to use.

So let's try it out:

$ echo francois > words.lst

$ cat john-local.conf
[List.Rules:subs]
seé
suü
scç

$ ./john -stdout -w:words.lst -rules=subs -internal-codepage=cp1252
Invalid rule in (null) at line 2: Unknown command seé

We get this error because john-local.conf contains UTF-8. John should 
actually be smarter here and handle that, but we do not yet. So let's 
encode our config file in CP1252 instead:

$ mv john-local.conf john-local.utf8
$ iconv -t cp1252 < john-local.utf8 > john-local.conf

$ ./john -stdout -w:words.lst -rules=subs -internal-codepage=cp1252
francois
françois

Another way of achieving the same is to use \xHH hex encoding. The 
value for "ç" in CP1252 is \xe7 so you'd just write it as sc\xe7 instead 
of scç. This way there's less risk of your editor messing things up in 
the future. This notation can also be handy when specifying a rule 
directly on the command line, like so:

$ ./john -stdout -w:words.lst -internal-c=cp1252 -ru=':sc\xe7 u'
FRANÇOIS

As you can see, once you run with an internal codepage things like 
case-shifting (and nearly all other commands and character classes) will 
work for non-ASCII letters as well. We wouldn't want that to end up as 
FRANçOIS.

A final note is you can set DefaultInternalCodepage in your config file, 
saving you from giving the -internal-codepage option every time. I'd 
actually recommend doing so, the default is empty for backwards 
compatibility only.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.