|
Message-ID: <BLU0-SMTP208F90679775B80CB02C612FD330@phx.gbl> Date: Mon, 25 Jul 2011 22:39:11 +0200 From: Frank Dittrich <frank_dittrich@...mail.com> To: john-dev@...ts.openwall.com Subject: Re: Character encoding 'how-to' and patch 0009 Am 25.07.2011 16:26, schrieb JimF: > If simple '8-bit' fixed size character encoding (wide char encodings > are not listed in this howto). > > 1. Build arrays of to-upper and to-lower values in rules.c. These > arrays have to be the upper and matching lower case values, listed in > the same order. If there are upper case only, or lower case only > letters, then build a separate array for them. I assume you mean characters which don't have a corresponding upper or lower case character within the code page in question. E.g., Ÿ (Unicode code point U+0178) is the upper case character for ÿ (Unicode code point U+00FF), but only ÿ (latin small letter y with diaresis) is part of iso-latin1. For me, it is not clear whether or not ÿ should be converted to Ÿ when applying rule u. Another example: ß (U+00DF, latin small letter sharp s, aka German Eszett, is a lower case character, which doesn't have an upper case version. Even though recently (unicode version 5.1) ẞ (U+1E9E, latin capital letter sharp s) has been added, hardly any user knows that this letter exists, let alone how to enter such a character. As far as I know, this character is meant either for small caps fonts, or for writing EVERYTHING IN UPPER CASE... (With a German keybord layout, you cannot enter this character by pressing <shift>-<ß>.) > 5. within unicode.c, add code into utf16toplain() to handle the > conversion from utf16 back into the 8 bit character set. > What about Unicode characters which don't have a representation in the single-byte code page? (May be I would find out by just reading the source code...) Frank
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.