Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <5129919.jY9Djz4Zq0@nimes>
Date: Tue, 18 Apr 2023 17:22:20 +0200
From: Bruno Haible <bruno@...sp.org>
To: musl@...ts.openwall.com
Subject: wmemcmp and wcscmp returns incorrect results for some inputs, on most architectures

Hi,

 ---- Test program ----

==================================== foo.c ====================================
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>

int main ()
{
  printf ("  wchar_t is %s.\n", (wchar_t)-1 < 0 ? "signed" : "unsigned");
  wchar_t a[2] = { (wchar_t) 0x76543210, 0 };
  wchar_t b[2] = { (wchar_t) 0x9abcdef1, 0 };
  int cmp1 = wmemcmp (a, b, 1);
  int cmp2 = wcscmp (a, b);
  cmp1 = (cmp1 > 0 ? 1 : cmp1 < 0 ? -1 : 0);
  cmp2 = (cmp2 > 0 ? 1 : cmp2 < 0 ? -1 : 0);
  printf ("  wmemcmp (a, b, 1) = %d\n", cmp1);
  printf ("  wcscmp (a, b) = %d\n", cmp2);
  return 0;
}
===============================================================================
$ gcc -Wall foo.c
$ ./a.out

This program has two possible correct results (for why, see below):

  wchar_t is unsigned.
  wmemcmp (a, b, 1) = -1
  wcscmp (a, b) = -1

and

  wchar_t is signed.
  wmemcmp (a, b, 1) = 1
  wcscmp (a, b) = 1

 ---- Results on musl libc ----

On arm64, this program prints:

  wchar_t is unsigned.
  wmemcmp (a, b, 1) = -1
  wcscmp (a, b) = -1

Which is correct.

On x86_64, i686, s390x, powerpc64le, it prints:

  wchar_t is signed.
  wmemcmp (a, b, 1) = -1
  wcscmp (a, b) = -1

Which is incorrect.

Version: On x86_64 I tested musl libc 1.2.3 (in Alpine Linux); for the other
architectures some older versions of musl libc.

 ---- About wmemcmp ----

ISO C 17 describes wmemcmp (§ 7.29.4.4.5) like this:
  "The wmemcmp function compares the first n wide characters of
   the object pointed to by s1 to the first n wide characters of
   the object pointed to by s2."

So, it has to compare "wide characters". § 3.7.3 defines a "wide character"
as "value representable by an object of type wchar_t, capable of
    representing any character in the current locale".
The second part of this sentence is merely an explanation of what wchar_t
is, a wording similar to the one in § 7.19 paragraph 2.
So, it is *not* a requirement that the value actually represents a
character in the current locale. Any wchar_t value is a "wide character".

(Note that this definition of wide character is broader than the one in
POSIX:2018:
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html
section 3.443 defines it as "An integer value corresponding to a
single graphic symbol or control code".
But, in an apparently attempt to align with ISO C, the description of
wmemcmp in POSIX:2018
https://pubs.opengroup.org/onlinepubs/9699919799/functions/wmemcmp.html
has this wording:
  "This function shall not be affected by locale and all wchar_t values
   shall be treated identically. The null wide character and wchar_t
   values not corresponding to valid characters shall not be treated
   specially."
)

So, wmemcmp has to compare the array elements by comparing wchar_t
values. I.e. if wchar_t is unsigned, by an unsigned comparison; if
wchar_t is signed, by a signed comparison.

 ---- About wcscmp ----

Similarly, ISO C 17 describes wcscmp (§ 7.29.4.4.1) as
  "The wcscmp function compares the wide string pointed to by s1
   to the wide string pointed to by s2."

The term "wide string" is defined in § 7.1.1 paragraph 4:
  "A wide string is a contiguous sequence of wide characters
   terminated by and including the first null wide character."

Regarding the term "wide character", see above.

So, wcscmp as well has to compare the array elements by comparing
wchar_t values. I.e. if wchar_t is unsigned, by an unsigned comparison;
if wchar_t is signed, by a signed comparison.

Bruno



Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.