john-dev - Re: Character casing question for U+0131

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4E37176E.9090408@bredband.net>
Date: Mon, 01 Aug 2011 23:15:26 +0200
From: magnum <rawsmooth@...dband.net>
To: john-dev@...ts.openwall.com
Subject: Re: Character casing question for U+0131

On 2011-08-01 00:03, jfoug wrote:
> This is asked to the list as a whole, in hopes that someone will have
> the answer. I am hoping this can be easily answered, and then all
> formats can conform to a single method.

I'm afraid there is no one-size-fits-all answer. We need to establish 
the behaviour for every format that do uppercase and do not use Unicode 
internally. Fortunately the number of formats that uppercase is low!

> Ok, in cp850, character at 0xD5 is U+0131. This is the undotted lower
> case ‘I’ character. Now, in Unicode, this character DOES upcase. It
> upcases to normal cap I. Thus, this is a non-circular upcase. In cp850,
> lc(uc(char(0xD5))) != char(0xD5), but instead == char(0x69)
>
> If this is the proper behavior, then what I have right now, in Unicode.c
> and rules.c is correct. I handle cp850 differently, as after the normal
> building of the upcase/downcase set of arrays, I change one element in
> the upcase array, to handle character 0xD5 upcasing into character 0x49.
> This works great, AS LONG AS the actual formats, and or OS code page
> logic works that way (for cp850).

Empirical data is what gets us forward: Here are *real* hashes from a 
Windows XP running with OEM codepage 437.

pound:1009:ED731A96A0C79241AAD3B435B51404EE:E1AE1BF327FBCC23730F7DB73A56AC44:::
dotless-i:1010:F7E62F36F8DB5AE6AAD3B435B51404EE:5CF982AC5D8263F6F42A88C1816218C4:::
german-ss:1011:83DC881CE3412BC5AAD3B435B51404EE:D0EE6EDA1C675ED9196A449872AEEA84:::
micro:1012:866B72239BB4C2CBAAD3B435B51404EE:0F6CE4C114FB6047318D15A2F0EBBFAC:::
o-diaeresis:1013:350AACEB37EDB148AAD3B435B51404EE:F8B057EF7946389887E5C5868A0969B1:::

Now, you can crack the NT hashes using -enc=utf8 and the following 
dictionary (encoded as UTF-8 of course):
£
ı
ß
µ
ö

Notes for NT:
1. The german double-s (ß) is NOT uppercased to SS (just as we thought)
2. The micro sign is NOT uppercased to the greek uppercase version of 
that character (Unicode specs suggest that could be done)

And here is the characters that will crack the LM part (if encoded in 
cp437):
£
I
ß
µ
Ö

Notes for LM:
1. The dotless i is not present in cp437! *Regardless* of that, it was 
uppercased to I (which of course do exist). I know that when using a 
euro sign as password, there will be a "empty" LM hash and that was what 
I expected here too. Very interesting.
2. The german double-s (ß) is NOT uppercased to SS (just as we thought)
3. The german/nordic o-with-dieresis (ö) is uppercased, as expected

Conclusions:
That dotless-i thing in LM was news to me. This means you do the right 
thing for cp850, but probably not for cp437...

Other than that, the behaviour is what I thought. Something similar to 
this should be done for Oracle on Unix, Oracle on Windows, and a number 
of other formats.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.