|
Message-ID: <5129919.jY9Djz4Zq0@nimes> Date: Tue, 18 Apr 2023 17:22:20 +0200 From: Bruno Haible <bruno@...sp.org> To: musl@...ts.openwall.com Subject: wmemcmp and wcscmp returns incorrect results for some inputs, on most architectures Hi, ---- Test program ---- ==================================== foo.c ==================================== #include <stdio.h> #include <stdlib.h> #include <wchar.h> int main () { printf (" wchar_t is %s.\n", (wchar_t)-1 < 0 ? "signed" : "unsigned"); wchar_t a[2] = { (wchar_t) 0x76543210, 0 }; wchar_t b[2] = { (wchar_t) 0x9abcdef1, 0 }; int cmp1 = wmemcmp (a, b, 1); int cmp2 = wcscmp (a, b); cmp1 = (cmp1 > 0 ? 1 : cmp1 < 0 ? -1 : 0); cmp2 = (cmp2 > 0 ? 1 : cmp2 < 0 ? -1 : 0); printf (" wmemcmp (a, b, 1) = %d\n", cmp1); printf (" wcscmp (a, b) = %d\n", cmp2); return 0; } =============================================================================== $ gcc -Wall foo.c $ ./a.out This program has two possible correct results (for why, see below): wchar_t is unsigned. wmemcmp (a, b, 1) = -1 wcscmp (a, b) = -1 and wchar_t is signed. wmemcmp (a, b, 1) = 1 wcscmp (a, b) = 1 ---- Results on musl libc ---- On arm64, this program prints: wchar_t is unsigned. wmemcmp (a, b, 1) = -1 wcscmp (a, b) = -1 Which is correct. On x86_64, i686, s390x, powerpc64le, it prints: wchar_t is signed. wmemcmp (a, b, 1) = -1 wcscmp (a, b) = -1 Which is incorrect. Version: On x86_64 I tested musl libc 1.2.3 (in Alpine Linux); for the other architectures some older versions of musl libc. ---- About wmemcmp ---- ISO C 17 describes wmemcmp (§ 7.29.4.4.5) like this: "The wmemcmp function compares the first n wide characters of the object pointed to by s1 to the first n wide characters of the object pointed to by s2." So, it has to compare "wide characters". § 3.7.3 defines a "wide character" as "value representable by an object of type wchar_t, capable of representing any character in the current locale". The second part of this sentence is merely an explanation of what wchar_t is, a wording similar to the one in § 7.19 paragraph 2. So, it is *not* a requirement that the value actually represents a character in the current locale. Any wchar_t value is a "wide character". (Note that this definition of wide character is broader than the one in POSIX:2018: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html section 3.443 defines it as "An integer value corresponding to a single graphic symbol or control code". But, in an apparently attempt to align with ISO C, the description of wmemcmp in POSIX:2018 https://pubs.opengroup.org/onlinepubs/9699919799/functions/wmemcmp.html has this wording: "This function shall not be affected by locale and all wchar_t values shall be treated identically. The null wide character and wchar_t values not corresponding to valid characters shall not be treated specially." ) So, wmemcmp has to compare the array elements by comparing wchar_t values. I.e. if wchar_t is unsigned, by an unsigned comparison; if wchar_t is signed, by a signed comparison. ---- About wcscmp ---- Similarly, ISO C 17 describes wcscmp (§ 7.29.4.4.1) as "The wcscmp function compares the wide string pointed to by s1 to the wide string pointed to by s2." The term "wide string" is defined in § 7.1.1 paragraph 4: "A wide string is a contiguous sequence of wide characters terminated by and including the first null wide character." Regarding the term "wide character", see above. So, wcscmp as well has to compare the array elements by comparing wchar_t values. I.e. if wchar_t is unsigned, by an unsigned comparison; if wchar_t is signed, by a signed comparison. Bruno
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.