Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BLU0-SMTP40E2CA29B2B94E00F14951FD320@phx.gbl>
Date: Tue, 26 Jul 2011 23:13:59 +0200
From: Frank Dittrich <frank_dittrich@...mail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Character encoding 'how-to' and patch 0009

Am 25.07.2011 23:50, schrieb jfoug:
>> From: Frank Dittrich [mailto:frank_dittrich@...mail.com]
>> Another example: ß (U+00DF, latin small letter sharp s, aka German
>> Eszett, is a lower case character, which doesn't have an upper case
>> version.
>
> Oh, but if you dig into Unicode document, that one does have an
> upcase. This is also one which is not cyclical. You can go from
> uc(U+00DF), but you cannot do lc(uc(U+00DF)). The other 'strange'
> issue with uc(U+00DF), is that you go from 1 character to 2.
> uc(U+00DF) == SS (2 capital 'S' characters). It is defined that way on
> Unicode.org, and john's Unicode conversion handles this one just fine.

The problem is, the upper case version of ß is not part of the German
alphabet, because ß never appears at the beginning of a word, and
uppercase letters are only necessary for nouns and at the beginning of
sentences..
Converting ß to SS for SAP codvn B is definitely wrong, since this hash
algorithm threats all non-ascii characters like the character '^'.
If you would convert it to "SS", you'd definitely get the wrong hash.
I am not sure (and have no time right now to test LM (that is: find a
windows system, use a password containing letters ä, ö, ü, and ß,
extract the resulting hash, and find out which conversion needs to be
done when processing the password).
But I am quite sure that 'ß' will not be converted. May be, 'ä', 'ö',
'ü' are converted to 'Ä','Ö', 'Ü', respectively.

When applying the u rule, converting ß to SS might be the best option,
even if there are cases when you'd prefer converting ß to SZ, to avoid
ambiguity.
More than you ever wanted to know about the German letter ß can be found
here:
http://en.wikipedia.org/wiki/%C3%9F
>> Even though recently (unicode version 5.1) ẞ (U+1E9E, latin capital
>> letter sharp s) has been added,
>
> I was not aware of this. If you look at this document
> http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt , which lists
> 1 to many character casing, you find this line
>
> 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
>
> This tells you that the for character U+00DF, U+00DF is lower case. To
> convert to 'title' case, you use Ss (53 73) and to convert to upper
> case you use SS (53 53). However, lc(uc(U+00DF)) != U+00DF
> lc(uc(U+00DF)) == ss. This was what I originally meant in step 1 about
> 'uppercase only', or 'lowercase only' letters. U+00DF is a lowercase
> only character.
>
> There are also other 'one-way' characters. Here are 2 examples.
>
> 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN
> CAPITAL LETTER I DOT;;;0069;
> 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
>
> In this case, we have U+0130. It is Latin Cap I with dot. The lower
> case is 'i' U+0069.
> The other is U+0131. Small letter dottless (I am pretty sure it should
> list this is a dotless i). The upper case in this case is 'I' or U+0049
>
> So these type code pages (ones that contain U+0130 or U+0131), we will
> have to take special care, to make sure things function as good as we
> possibly can get them. I believe this is the correct behavior in these
> cases: uc(U+0131) = 'I', lc(uc(U+0131)) = 'i' (not U+0131), and
> finally uc(lc(uc(U+131))) == 'I'. As long as we can model that
> behavior properly, then john's behavior should be correct.
>
> note the 'simple' casing in Unicode is documented here:
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>
> Here are lines showing how to lowcase a couple letters:
> 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
> 0042;LATIN CAPITAL LETTER B;Lu;0;L;;;;;N;;;;0062;
>
> Here is how to upcase a few
> 0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
> 0062;LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042
>
> NOTE that both of these documents were mechanically processed, into
> the UnicodeData.h file that is now part of john. That data file strips
> out only the casing information for all of Unicode (minus some
> localization specific multi-char conversions), and warehouses the
> data. The UnicodeInit() function places this data into 'usable'
> conversion arrays of Unicode character data.
>
>
>> hardly any user knows that this letter
>> exists, let alone how to enter such a character.
>> As far as I know, this character is meant either for small caps fonts,
>> or for writing EVERYTHING IN UPPER CASE...
>> (With a German keybord layout, you cannot enter this character by
>> pressing <shift>-<ß>.)
>
> I believe this one is actually is used for the Nazi SS symbol (low case)
>

I think you got that wrong.
'ß' is a normal German letter (used in Germany and Austria, but not in
the Suisse and in Liechtenstein.
On a German keyboard, 'ß' is a normal key, just right of '0'.


Frank

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.