john-dev - RE: Character encoding 'how-to' and patch 0009

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <01e001cc4b14$d4995730$7dcc0590$@net>
Date: Mon, 25 Jul 2011 16:50:11 -0500
From: "jfoug" <jfoug@....net>
To: <john-dev@...ts.openwall.com>
Subject: RE: Character encoding 'how-to' and patch 0009

>From: Frank Dittrich [mailto:frank_dittrich@...mail.com]
>
>Am 25.07.2011 16:26, schrieb JimF:
>> If simple '8-bit' fixed size character encoding (wide char encodings
>> are not listed in this howto).
>>
>> 1. Build arrays of to-upper and to-lower values in rules.c. These
>> arrays have to be the upper and matching lower case values, listed in
>> the same order. If there are upper case only, or lower case only
>> letters, then build a separate array for them.
>
>I assume you mean characters which don't have a corresponding upper or
>lower case character within the code page in question.

Correct.

>E.g., Ÿ (Unicode code point U+0178) is the upper case character for ÿ
>(Unicode code point U+00FF), but only ÿ (latin small letter y with
>diaresis) is part of iso-latin1.
>For me, it is not clear whether or not ÿ should be converted to Ÿ when
>applying rule u.

I agree, I do not know (yet) how to handle U+00FF<-->U0178.  I was not aware just which code page used this character.  However, you have pointed to iso-latin1.  It will likely have to be handled like iso-8859-1's 0xDF charcter (which is U+00DF, explained below).

>Another example: ß (U+00DF, latin small letter sharp s, aka German
>Eszett, is a lower case character, which doesn't have an upper case
>version.

Oh, but if you dig into Unicode document, that one does have an upcase.   This is also one which is not cyclical. You can go from uc(U+00DF), but you cannot do lc(uc(U+00DF)).  The other 'strange' issue with uc(U+00DF), is that you go from 1 character to 2.  uc(U+00DF) == SS  (2 capital 'S' characters).  It is defined that way on Unicode.org, and john's Unicode conversion handles this one just fine.

As for code pages, the ones I have worked on, do not have any cases where there is an UPCASE character without a matching lower case character (or vice versa), with the exception of the 0xDF in ISO-8859-1.

This casing of ISO-8859-1 0xDF to SS is NOT handled within the rules.  It is handled within the formats which do upcase, and would use this character set, i.e. the code in Unicode.c handles this letter's upcasing properly.

>Even though recently (unicode version 5.1) ẞ (U+1E9E, latin capital
>letter sharp s) has been added, 

I was not aware of this.  If you look at this document http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt , which lists 1 to many character casing, you find this line

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S

This tells you that the for character U+00DF,  U+00DF is lower case.  To convert to 'title' case, you use Ss (53 73)  and to convert to upper case you use SS (53 53).    However, lc(uc(U+00DF)) != U+00DF   lc(uc(U+00DF)) == ss.  This was what I originally meant in step 1 about 'uppercase only', or 'lowercase only' letters.  U+00DF is a lowercase only character.

There are also other 'one-way' characters.  Here are 2 examples.

0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069;
0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049

In this case, we have U+0130.  It is Latin Cap I with dot.  The lower case is 'i' U+0069. 
The other is U+0131.  Small letter dottless (I am pretty sure it should list this is a dotless i).  The upper case in this case is 'I' or U+0049

So these type code pages (ones that contain U+0130 or U+0131), we will have to take special care, to make sure things function as good as we possibly can get them.  I believe this is the correct behavior in these cases: uc(U+0131) = 'I',  lc(uc(U+0131)) = 'i' (not U+0131), and finally uc(lc(uc(U+131))) == 'I'.  As long as we can model that behavior properly, then john's behavior should be correct.

note the 'simple' casing in Unicode is documented here:  http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

Here are lines showing how to lowcase a couple letters:
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
0042;LATIN CAPITAL LETTER B;Lu;0;L;;;;;N;;;;0062;

Here is how to upcase a few 
0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041
0062;LATIN SMALL LETTER B;Ll;0;L;;;;;N;;;0042;;0042

NOTE that both of these documents were mechanically processed, into the UnicodeData.h file that is now part of john.  That data file strips out only the casing information for all of Unicode (minus some localization specific multi-char conversions), and warehouses the data.  The UnicodeInit() function places this data into 'usable' conversion arrays of Unicode character data.


>hardly any user knows that this letter
>exists, let alone how to enter such a character.
>As far as I know, this character is meant either for small caps fonts,
>or for writing EVERYTHING IN UPPER CASE...
>(With a German keybord layout, you cannot enter this character by
>pressing <shift>-<ß>.)

I believe this one is actually is used for the Nazi SS symbol (low case), and does show up, in password cracking databases.

>
>> 5. within unicode.c, add code into utf16toplain() to handle the
>> conversion from utf16 back into the 8 bit character set.
>>
>
>What about Unicode characters which don't have a representation in the
>single-byte code page?

Then the conversion was bogus.  The code 'will' proceed. The choice made was to truncate the 16 bit down to 8 bit, just like how the original non-utf8 code did.  

In the way john works, this 'should' not be an issue.  John goes into a single code page mode, and stays there. So you are not able to convert data from one code page into Unicode, and then later call the utf16toplain() and convert it into a different code page.   To perform a conversion where there are invalid characters for the code page is undefined (at least for that char), so how it is handled really does not matter (i.e. the meaning of undefined).

Also, I have been able to do the upcasing and lowcasing of 'code page' data without having to switch into and then back out of Unicode.  I have code that does this inline, just like was done with ISO-8859-1 (well, 'similar').  Actually, at this time, the upper casing of ISO-8859-1 is the most complex (due to handling of U+00DF).

>(May be I would find out by just reading the source code...)

I will have code patch done shortly.  However, before I do that, I think I will update the test-suite, to handle koi8r and cp1251, so that I am 'sure' I have them properly handled.  I believe I do, but I want to make SURE, prior to releasing code.


Also, I will update the list I started on this email chain, since now I have a better idea on exactly what steps need to be done, to fully support a new 256 byte code page.

Jim.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.