john-dev - Re: Character encoding 'how-to' and patch 0009

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BLU0-SMTP40E2CA29B2B94E00F14951FD320@phx.gbl>
Date: Tue, 26 Jul 2011 23:13:59 +0200
From: Frank Dittrich <frank_dittrich@...mail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Character encoding 'how-to' and patch 0009

Am 25.07.2011 23:50, schrieb jfoug:
>> From: Frank Dittrich [mailto:frank_dittrich@...mail.com]
>> Another example: ß (U+00DF, latin small letter sharp s, aka German
>> Eszett, is a lower case character, which doesn't have an upper case
>> version.
>
> Oh, but if you dig into Unicode document, that one does have an
> upcase. This is also one which is not cyclical. You can go from
> uc(U+00DF), but you cannot do lc(uc(U+00DF)). The other 'strange'
> issue with uc(U+00DF), is that you go from 1 character to 2.
> uc(U+00DF) == SS (2 capital 'S' characters). It is defined that way on
> Unicode.org, and john's Unicode conversion handles this one just fine.

The problem is, the upper case version of ß is not part of the German
alphabet, because ß never appears at the beginning of a word, and
uppercase letters are only necessary for nouns and at the beginning of
sentences..
Converting ß to SS for SAP codvn B is definitely wrong, since this hash
algorithm threats all non-ascii characters like the character '^'.
If you would convert it to "SS", you'd definitely get the wrong hash.
I am not sure (and have no time right now to test LM (that is: find a
windows system, use a password containing letters ä, ö, ü, and ß,
extract the resulting hash, and find out which conversion needs to be
done when processing the password).
But I am quite sure that 'ß' will not be converted. May be, 'ä', 'ö',
'ü' are converted to 'Ä','Ö', 'Ü', respectively.

When applying the u rule, converting ß to SS might be the best option,
even if there are cases when you'd prefer converting ß to SZ, to avoid
ambiguity.
More than you ever wanted to know about the German letter ß can be found
here:
http://en.wikipedia.org/wiki/%C3%9F
>> Even though recently (unicode version 5.1) ẞ (U+1E9E, latin capital
>> letter sharp s) has been added,
>
> I was not aware of this. If you look at this document
> http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt , which lists
> 1 to many character casing, you find this line
>
> 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
>
> This tells you that the for character U+00DF, U+00DF is lower case. To
> convert to 'title' case, you use Ss (53 73) and to convert to upper
> case you use SS (53 53). However, lc(uc(U+00DF)) != U+00DF
> lc(uc(U+00DF)) == ss. This was what I originally meant in step 1 about
> 'uppercase only', or 'lowercase only' letters. U+00DF is a lowercase
> only character.
>
> There are also other 'one-way' characters. Here are 2 examples.
>
> 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN
> CAPITAL LETTER I DOT;;;0069;
> 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
>
> In this case, we have U+0130. It is Latin Cap I with dot. The lower
> case is 'i' U+0069.
> The other is U+0131. Small letter dottless (I am pretty sure it should
> list this is a dotless i). The upper case in this case is 'I' or U+0049
>
> So these type code pages (ones that contain U+0130 or U+0131), we will
> have to take special care, to make sure things function as good as we
> possibly can get them. I believe this is the correct behavior in these
> cases: uc(U+0131) = 'I', lc(uc(U+0131)) = 'i' (not U+0131), and
> finally uc(lc(uc(U+131))) == 'I'. As long as we can model that
> behavior properly, then john's behavior should be correct.
>
> note the 'simple' casing in Unicode is documented here:
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>
> Here are lines showing how to lowcase a couple letters:
> 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
> 0042;LATIN CAPITAL LETTER B;Lu;0;L;;;;;N;;;;0062;
>
> Here is how to upcase a few
> 0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
> 0062;LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042
>
> NOTE that both of these documents were mechanically processed, into
> the UnicodeData.h file that is now part of john. That data file strips
> out only the casing information for all of Unicode (minus some
> localization specific multi-char conversions), and warehouses the
> data. The UnicodeInit() function places this data into 'usable'
> conversion arrays of Unicode character data.
>
>
>> hardly any user knows that this letter
>> exists, let alone how to enter such a character.
>> As far as I know, this character is meant either for small caps fonts,
>> or for writing EVERYTHING IN UPPER CASE...
>> (With a German keybord layout, you cannot enter this character by
>> pressing <shift>-<ß>.)
>
> I believe this one is actually is used for the Nazi SS symbol (low case)
>

I think you got that wrong.
'ß' is a normal German letter (used in Germany and Austria, but not in
the Suisse and in Liechtenstein.
On a German keyboard, 'ß' is a normal key, just right of '0'.


Frank
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.