|
Message-ID: <BLU0-SMTP40E2CA29B2B94E00F14951FD320@phx.gbl> Date: Tue, 26 Jul 2011 23:13:59 +0200 From: Frank Dittrich <frank_dittrich@...mail.com> To: john-dev@...ts.openwall.com Subject: Re: Character encoding 'how-to' and patch 0009 Am 25.07.2011 23:50, schrieb jfoug: >> From: Frank Dittrich [mailto:frank_dittrich@...mail.com] >> Another example: ß (U+00DF, latin small letter sharp s, aka German >> Eszett, is a lower case character, which doesn't have an upper case >> version. > > Oh, but if you dig into Unicode document, that one does have an > upcase. This is also one which is not cyclical. You can go from > uc(U+00DF), but you cannot do lc(uc(U+00DF)). The other 'strange' > issue with uc(U+00DF), is that you go from 1 character to 2. > uc(U+00DF) == SS (2 capital 'S' characters). It is defined that way on > Unicode.org, and john's Unicode conversion handles this one just fine. The problem is, the upper case version of ß is not part of the German alphabet, because ß never appears at the beginning of a word, and uppercase letters are only necessary for nouns and at the beginning of sentences.. Converting ß to SS for SAP codvn B is definitely wrong, since this hash algorithm threats all non-ascii characters like the character '^'. If you would convert it to "SS", you'd definitely get the wrong hash. I am not sure (and have no time right now to test LM (that is: find a windows system, use a password containing letters ä, ö, ü, and ß, extract the resulting hash, and find out which conversion needs to be done when processing the password). But I am quite sure that 'ß' will not be converted. May be, 'ä', 'ö', 'ü' are converted to 'Ä','Ö', 'Ü', respectively. When applying the u rule, converting ß to SS might be the best option, even if there are cases when you'd prefer converting ß to SZ, to avoid ambiguity. More than you ever wanted to know about the German letter ß can be found here: http://en.wikipedia.org/wiki/%C3%9F >> Even though recently (unicode version 5.1) ẞ (U+1E9E, latin capital >> letter sharp s) has been added, > > I was not aware of this. If you look at this document > http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt , which lists > 1 to many character casing, you find this line > > 00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S > > This tells you that the for character U+00DF, U+00DF is lower case. To > convert to 'title' case, you use Ss (53 73) and to convert to upper > case you use SS (53 53). However, lc(uc(U+00DF)) != U+00DF > lc(uc(U+00DF)) == ss. This was what I originally meant in step 1 about > 'uppercase only', or 'lowercase only' letters. U+00DF is a lowercase > only character. > > There are also other 'one-way' characters. Here are 2 examples. > > 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN > CAPITAL LETTER I DOT;;;0069; > 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049 > > In this case, we have U+0130. It is Latin Cap I with dot. The lower > case is 'i' U+0069. > The other is U+0131. Small letter dottless (I am pretty sure it should > list this is a dotless i). The upper case in this case is 'I' or U+0049 > > So these type code pages (ones that contain U+0130 or U+0131), we will > have to take special care, to make sure things function as good as we > possibly can get them. I believe this is the correct behavior in these > cases: uc(U+0131) = 'I', lc(uc(U+0131)) = 'i' (not U+0131), and > finally uc(lc(uc(U+131))) == 'I'. As long as we can model that > behavior properly, then john's behavior should be correct. > > note the 'simple' casing in Unicode is documented here: > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt > > Here are lines showing how to lowcase a couple letters: > 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061; > 0042;LATIN CAPITAL LETTER B;Lu;0;L;;;;;N;;;;0062; > > Here is how to upcase a few > 0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041 > 0062;LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042 > > NOTE that both of these documents were mechanically processed, into > the UnicodeData.h file that is now part of john. That data file strips > out only the casing information for all of Unicode (minus some > localization specific multi-char conversions), and warehouses the > data. The UnicodeInit() function places this data into 'usable' > conversion arrays of Unicode character data. > > >> hardly any user knows that this letter >> exists, let alone how to enter such a character. >> As far as I know, this character is meant either for small caps fonts, >> or for writing EVERYTHING IN UPPER CASE... >> (With a German keybord layout, you cannot enter this character by >> pressing <shift>-<ß>.) > > I believe this one is actually is used for the Nazi SS symbol (low case) > I think you got that wrong. 'ß' is a normal German letter (used in Germany and Austria, but not in the Suisse and in Liechtenstein. On a German keyboard, 'ß' is a normal key, just right of '0'. Frank
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.