|
Message-ID: <4D7555F6.5040506@bredband.net>
Date: Mon, 07 Mar 2011 23:02:30 +0100
From: magnum <rawsmooth@...dband.net>
To: john-dev@...ts.openwall.com
Subject: --utf8 option, proof of concept
Here is a PoC of how I think we could get much better flexibility for
all Unicode formats. The code quality is so-so, I don't intend to put
this on the wiki.
This patch adds the option flag "--utf8". Without this flag, John
behaves as usual, that is, for any format internally converting to
Unicode (most notably NT), the conversion assumes ISO-8859-1 input.
Using this flag makes John assume UTF-8 input instead. That is, you
should feed it with wordlists encoded in UTF-8, or hash files with user
info encoded in UTF8 for --single mode to work best.
The base for the UTF-8 conversion is a ConvertUTF.[ch] from Unicode Inc.
I have stripped and modified it a lot but the original files are
included too. It's apparently free for us to use.
There is bound to be hideous bugs in the code and it probably does not
even compile properly for all targets. It works fine on Linux-x86-64.
It's not primarily meant to be used other than for experimenting with
the pros and cons of this idea. Or maybe try out how many new passwords
you can crack using "--single --utf8" on your favourite raw-md5 dataset.
Parts of the code are barely working and optimisations can be made for sure.
Supported formats right now are NT, the various NET*LM* formats,
mschapv2 and both mscash formats. Plus there is a separate
raw-md5-unicode mode included that does unicode($p) and supports the
--utf8 flag. This format is based on the old "thick" format so it
performs at half the speed of the latest'n'greatest. My suggestions for
md5_gen comes from this.
Not yet fixed formats are mssql and probably a couple more.
Other ideas on my to-do list unless someone talks me out of this:
* A couple of new reject rules, maybe -u for rejecting a rule unless the
--utf8 flag is used, and -U for the opposite.
* Maybe even a few utf-8 aware word rules. It's not that complicated.
* I can see a need for a new format property flag, FMT_UNICODE, that
tells that this format use Unicode internally. In particular, a mode
that is not yet supporting --utf8 should bail out if you try.
Some problems:
* I currently have all 8-bit test strings commented out from the code,
as they need to be different when the --utf8 flag is used. For example,
{"$NT$8bd6e4fb88e01009818749c5443ea712", "\xC3\xBC"}, // ü, UTF-8
{"$NT$8bd6e4fb88e01009818749c5443ea712", "\xFC"}, // ü in 8859-1
{"$NT$cc1260adb6985ca749f150c7e0b22063", "\xFC\xFC"}, // Two of them
If I leave the first one uncommented, I can build and test the --utf8
mode, but the normal mode will fail. And vice versa. This is a big
problem when eg. trying to optimise the conversions. I normally work
with all three lines commented out, so sometimes bugs are not discovered
immediately. Some way one would want the correct line to be picked at
runtime but they are all constants. I'm not sure what's the best way to
achive this. Maybe an optional third string for utf8? Or a separate struct.
* In a couple of formats (eg. NT) I have doubled the set_salt function
and call it via a pointer, in order to mitigate the performance hit for
non-utf8. I'm not sure how to do it better, but I'm not particularly
satisfied. It's a hack.
Just try it out and flame or praise. One test I did was sort out all
lines in the Rockyou dataset that has 8-bit characters. Most of these
are UTF-8 but not all. Then I create fake NT and raw-md5-unicode
password files from them and try to crack them using --utf8 or not.
Works like a charm.
magnum
Download attachment "john-1.7.6-jumbo12-utf8.diff.gz" of type "application/x-gzip" (14533 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.