musl - Re: Hexagon DSP support

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200917013236.GV3265@brightrain.aerifal.cx>
Date: Wed, 16 Sep 2020 21:32:36 -0400
From: 'Rich Felker' <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Hexagon DSP support

On Wed, Sep 16, 2020 at 03:49:28PM -0500, sidneym@...eaurora.org wrote:
> 
> 
> > -----Original Message-----
> > From: sidneym@...eaurora.org <sidneym@...eaurora.org>
> > Sent: Friday, July 24, 2020 12:50 PM
> > To: 'Szabolcs Nagy' <nsz@...t70.net>
> > Cc: 'Rich Felker' <dalias@...c.org>; 'musl@...ts.openwall.com'
> > <musl@...ts.openwall.com>
> > Subject: RE: [musl] Hexagon DSP support
> > 
> > 
> > 
> > > -----Original Message-----
> > > From: Szabolcs Nagy <nsz@...t70.net>
> > > Sent: Thursday, July 23, 2020 4:56 PM
> > > To: sidneym@...eaurora.org
> > > Cc: 'Rich Felker' <dalias@...c.org>; musl@...ts.openwall.com
> > > Subject: Re: [musl] Hexagon DSP support
> > >
> > > * sidneym@...eaurora.org <sidneym@...eaurora.org> [2020-07-20
> > > 16:26:58 -0500]:
> > > > I removed fma/fmal/fmax/fmin/fabs from compiler-rt-builtins,
> > > > https://reviews.llvm.org/D82263
> > > > The comparison with musl can be found here:
> > > > https://github.com/quic/musl/compare/hexagon but I've also attached
> > > > the patch.
> > > >
> > > > An assert in clang when building both musl and libc-test for hexagon
> > > > was fixed by, https://reviews.llvm.org/D80952 prior to this change
> > > > -frounding-math had to be used.
> > > >
> > > > The test-results are also attached.  Everything is built with the
> > > > tip-of-tree llvm so sometimes results vary but these are the results
> > > > I got from this morning's clone.  The only notable difference in the
> > > > results would be that both fma and fmal fail and this is because of
> > > > the compiler-rt change.  I didn't add fma to musl because it require
> > > > more complex assembly, along the lines you saw in an earlier version
> > > > with
> > > sqrt.
> > >
> > >
> > > the fma and sqrt failures are still not fully explained, e.g. this looks
> wrong:
> > >
> > > src/math/special/fma.h:42: RN fma(0x1p+0,0x1p+0,-0x1p-1074) want
> > > 0x1p+0 got -0x1.fffffp-43 ulperr -4503599627370496.000 = -0x1p+52 +
> > > 0x0p+0
> > >
> > > the only target specific bit in fma is a_clz_64 so i would check that.
> > >
> > > e.g. a_clz_64(1ULL << 42) should give 21 (this computation happens
> > > during the fma test case above).
> > 
> > Hexagon didn't have a_clz_64 implemented however I added this morning it
> > and noticed no differences.  I will update the patch with that routine
> > included.
> > 
> > I did notice a compiler regression in how it compiled fma and have asked a
> > compiler person to take a look.  An older version of our internally
> maintained
> > compiler does produce the expected results for the values I used but later
> > versions do not.  Unfortunately changing optimization levels will produce
> > different results as well.
> 
> I've attached updated test results and patch.  The patch doesn't change much
> other than adding the above mentioned a_clz_64.  The only other change was
> an update to pthread_arch.h for an api update so hexagon conforms with the
> rest of musl.
> 
> Between updates to llvm and musl both fma and sqrt issues are resolved
> provided I compile the library without optimization enabled.  No new tests
> fail.
> 
> I guess I also need to know what the thoughts are about adding hexagon to
> the mainline base.  There are no issues adding from this end.

I'll post some review with the hope that this can move forward
upstream in musl soon. I might need some help figuring out how to get
a cross build environment to check things, but I'll follow up when I
do.

The review that follows is not 100% thorough but I think it's more
detailed than I've done for hexagon so far. Most of it's open to
discussion if you think anything I say is wrong.

> diff --git a/arch/hexagon/atomic_arch.h b/arch/hexagon/atomic_arch.h
> new file mode 100644
> index 00000000..ede55956
> --- /dev/null
> +++ b/arch/hexagon/atomic_arch.h
> @@ -0,0 +1,194 @@
> +#define a_ctz_32 a_ctz_32
> +static inline int a_ctz_32(unsigned long x)
> +{
> +	__asm__(
> +		"%0 = ct0(%0)\n\t"
> +		: "+r"(x));
> +	return x;
> +}
> +
> +#define a_ctz_64 a_ctz_64
> +static inline int a_ctz_64(uint64_t x)
> +{
> +	int count;
> +	__asm__(
> +		"%0 = ct0(%1)\n\t"
> +		: "=r"(count) : "r"(x));
> +	return count;
> +}
> +#define a_clz_64 a_clz_64
> +static inline int a_clz_64(uint64_t x)
> +{
> +        int count;
> +        __asm__(
> +                "%1 = brev(%1)\n\t"
> +                "%0 = ct0(%1)\n\t"
> +                : "=r"(count) : "r"(x));
> +        return count;
> +}

This should probably do just the brev in asm then return
a_ctz_64(result) so that the compiler has the freedom to schedule the
insns independently, unless there's a reason not to want it to do
that.

> +#define a_cas a_cas
> +static inline int a_cas(volatile int *p, int t, int s)
> +{
> +	int dummy;
> +	__asm__ __volatile__(
> +		"1:	%0 = memw_locked(%1)\n\t"
> +		"	{ p0 = cmp.eq(%0, %2)\n\t"
> +		"	  if (!p0.new) jump:nt 2f }\n\t"
> +		"	memw_locked(%1, p0) = %3\n\t"
> +		"	if (!p0) jump 1b\n\t"
> +		"2:	\n\t"
> +		: "=&r"(dummy)
> +		: "r"(p), "r"(t), "r"(s)
> +		: "p0", "memory" );
> +        return dummy;
> +}

I don't know the hexagon atomic model, but as far as I can tell these
at least "look right" in the sense of having right asm constraints.

> [...]
> +#define a_barrier a_barrier
> +static inline void a_barrier()
> +{
> +	__asm__ __volatile__ ("barrier" ::: "memory");
> +}

Is the barrier implied in memw_locked? If not, there need to be
explicit barriers in all the atomic functions.

> diff --git a/arch/hexagon/bits/alltypes.h.in b/arch/hexagon/bits/alltypes.h.in
> new file mode 100644
> index 00000000..9d770c7e
> --- /dev/null
> +++ b/arch/hexagon/bits/alltypes.h.in
> @@ -0,0 +1,26 @@
> +#define _Addr int
> +#define _Int64 long long
> +#define _Reg int
> +
> +#define __BYTE_ORDER 1234
> +#define __LONG_MAX 0x7fffffffL
> +
> +#ifndef __cplusplus
> +#ifdef __WCHAR_TYPE__
> +TYPEDEF __WCHAR_TYPE__ wchar_t;
> +#else
> +TYPEDEF long wchar_t;
> +#endif
> +#endif
> +
> +TYPEDEF float float_t;
> +TYPEDEF double double_t;
> +
> +#if !defined(__cplusplus)
> +TYPEDEF struct { _Alignas(8) long long __ll; long double __ld; } max_align_t;
> +#elif defined(__GNUC__)
> +TYPEDEF struct { __attribute__((__aligned__(8))) long long __ll; long double __ld; } max_align_t;
> +#else
> +TYPEDEF struct { alignas(8) long long __ll; long double __ld; } max_align_t;
> +#endif

As I understand, the Hexagon ABI has all scalar types aligned to their
size, so I don't think the alignas mess is needed here. It should just
work as unconditional:

TYPEDEF struct { long long __ll; long double __ld; } max_align_t;

> diff --git a/arch/hexagon/bits/setjmp.h b/arch/hexagon/bits/setjmp.h
> new file mode 100644
> index 00000000..5ee5e49f
> --- /dev/null
> +++ b/arch/hexagon/bits/setjmp.h
> @@ -0,0 +1,4 @@
> +
> +typedef struct {
> +	long regs[16];
> +} __jmp_buf[1] __attribute__((aligned (8)));

Use long long so no extension is needed here for alignment, and no
wrapper struct (which is not needed, and the member name regs is a
namespace violation):

typedef long long __jmp_buf[8];

> diff --git a/arch/hexagon/bits/signal.h b/arch/hexagon/bits/signal.h
> new file mode 100644
> index 00000000..7a3b36d0
> --- /dev/null
> +++ b/arch/hexagon/bits/signal.h
> @@ -0,0 +1,105 @@
> +#if defined(_POSIX_SOURCE) || defined(_POSIX_C_SOURCE) \
> + || defined(_XOPEN_SOURCE) || defined(_GNU_SOURCE) || defined(_BSD_SOURCE)
> +
> +#if defined(_XOPEN_SOURCE) || defined(_GNU_SOURCE) || defined(_BSD_SOURCE)
> +#define MINSIGSTKSZ 2048
> +#define SIGSTKSZ 8192
> +#endif
> +
> +#if defined(_GNU_SOURCE) || defined(_BSD_SOURCE)
> +typedef int greg_t, gregset_t[18];
> +typedef struct sigcontext
> +{
> +	unsigned long  r0,  r1,  r2,  r3;
> +	unsigned long  r4,  r5,  r6,  r7;
> +	unsigned long  r8,  r9, r10, r11;
> +	unsigned long r12, r13, r14, r15;
> +	unsigned long r16, r17, r18, r19;
> +	unsigned long r20, r21, r22, r23;
> +	unsigned long r24, r25, r26, r27;
> +	unsigned long r28, r29, r30, r31;
> +	unsigned long sa0;
> +	unsigned long lc0;
> +	unsigned long sa1;
> +	unsigned long lc1;
> +	unsigned long m0;
> +	unsigned long m1;
> +	unsigned long usr;
> +	unsigned long p3_0;
> +	unsigned long gp;
> +	unsigned long ugp;
> +	unsigned long pc;
> +	unsigned long cause;
> +	unsigned long badva;
> +	unsigned long pad1;
> +	unsigned long pad2;
> +	unsigned long pad3;
> +} __attribute__((__aligned__(8))) mcontext_t;
> +#else
> +typedef struct {
> +	unsigned long __regs[48];
> +} __attribute__((__aligned__(8))) mcontext_t;
> +#endif

If the pad members are really padding, can the last two just be
replaced by a long long so there's natural alignment with no
extension?

> [...]
> +typedef struct __ucontext {
> +	unsigned long uc_flags;
> +	struct __ucontext *uc_link;
> +	stack_t uc_stack;
> +	mcontext_t uc_mcontext;
> +	sigset_t uc_sigmask;
> +//	unsigned long long uc_regspace[64];
> +} ucontext_t;

Should the commented uc_regspace be removed?

> diff --git a/arch/hexagon/bits/stat.h b/arch/hexagon/bits/stat.h
> new file mode 100644
> index 00000000..55e81fd9
> --- /dev/null
> +++ b/arch/hexagon/bits/stat.h
> @@ -0,0 +1,20 @@
> +/* copied from kernel definition, but with padding replaced
> + * by the corresponding correctly-sized userspace types. */
> +struct stat {
> +	dev_t st_dev;
> +	ino_t st_ino;
> +	mode_t st_mode;
> +	nlink_t st_nlink;
> +	uid_t st_uid;
> +	gid_t st_gid;
> +	dev_t st_rdev;
> +	unsigned long __pad;
> +	off_t st_size;
> +	blksize_t st_blksize;
> +	int __pad2;
> +	blkcnt_t st_blocks;
> +	struct timespec st_atim;
> +	struct timespec st_mtim;
> +	struct timespec st_ctim;
> +	unsigned __unused[2];
> +};

No objection to doing this as-is, but since this looks like the modern
"standard" layout and we could probably deduplicate it by making
arch/generic provide it. That'd be a later patch to remove the other
existing duplicates (aarch64, riscv64, maybe others).

> diff --git a/arch/hexagon/bits/syscall.h.in b/arch/hexagon/bits/syscall.h.in
> new file mode 100644
> index 00000000..77ca3fa0
> --- /dev/null
> +++ b/arch/hexagon/bits/syscall.h.in
> @@ -0,0 +1,317 @@
> [...]
> +#define __NR_timer_create 107
> +#define __NR_timer_gettime 108
> +#define __NR_timer_getoverrun 109
> +#define __NR_timer_settime 110
> +#define __NR_timer_delete 111
> +#define __NR_clock_settime 112
> +#define __NR_clock_gettime 113

Some of these, e.g. clock_gettime, need to be renamed as in commit
5a105f19b5aae79dd302899e634b6b18b3dcd0d6.

> [...]
> +#define __NR_sched_rr_get_interval_time64 423
> +#define __NR_syscalls (__NR_sched_rr_get_interval_time64+1)
> +#define __NR_fcntl  __NR3264_fcntl
> +#define __NR_fstatfs  __NR3264_fstatfs
> +#define __NR_truncate  __NR3264_truncate
> +#define __NR_ftruncate  __NR3264_ftruncate
> +#define __NR_lseek  __NR3264_lseek
> +#define __NR_sendfile  __NR3264_sendfile
> +#define __NR_newfstatat  __NR3264_fstatat
> +#define __NR_fcntl64  __NR3264_fcntl
> +#define __NR_statfs64  __NR3264_statfs
> +#define __NR_fstatfs64  __NR3264_fstatfs
> +#define __NR_truncate64  __NR3264_truncate
> +#define __NR_ftruncate64  __NR3264_ftruncate
> +#define __NR__llseek  __NR3264_lseek
> +#define __NR_sendfile64  __NR3264_sendfile
> +#define __NR_fstatat64  __NR3264_fstatat
> +#define __NR_fstat64  __NR3264_fstat
> +#define __NR_mmap2  __NR3264_mmap
> +#define __NR_fadvise64_64  __NR3264_fadvise64

Is there a reason for these NR3264 redirections rather than just
defining directly with the public names?

> diff --git a/arch/hexagon/crt_arch.h b/arch/hexagon/crt_arch.h
> new file mode 100644
> index 00000000..331a797e
> --- /dev/null
> +++ b/arch/hexagon/crt_arch.h
> @@ -0,0 +1,35 @@
> +__asm__(
> +".weak _DYNAMIC \n"
> +".hidden _DYNAMIC \n"
> +".text \n"
> +".global " START " \n"
> +".type " START ", %function \n"
> +START ": \n"
> +"                                       // Find _DYNAMIC\n"
> +"       jump 1f\n"
> +".word  _DYNAMIC - .\n"
> +"1:     r2 = pc\n"
> +"       r2 = add(r2, #-4)\n"
> +"       r1 = memw(r2)\n"
> +"       r1 = add(r2, r1)\n"
> +"	r30 = #0			// Signals the end of backtrace\n"
> +"	r0 = r29			// Pointer to argc/argv\n"
> +"	r29 = and(r29, #-16)		// Align\n"
> +"	memw(r29+#-8) = r29\n"
> +"	r29 = add(r29, #-8)\n"
> +"	call " START "_c \n"
> +".size " START ", .-" START "\n"
> +);
> +
> +__asm__(
> +".section \".note.ABI-tag\", \"a\" \n"
> +".align 4 \n"
> +".long 1f - 0f		/* name length */ \n"
> +".long 3f - 2f		/* data length */ \n"
> +".long  1		/* note type */ \n"
> +"0:	.asciz \"GNU\"	/* vendor name seems like this should be MUSL but lldb doesn't agree.*/ \n"
> +"1:	.align 4		 \n"
> +"2:	.long 0	/* linux */ \n"
> +"	.long 3,0,0 \n"
> +"3:	.align 4 \n"
> +);

Is there a reason this needs to be here at all? Shouldn't the tooling
generate it if it's actually wanted/needed?

> diff --git a/arch/hexagon/reloc.h b/arch/hexagon/reloc.h
> new file mode 100644
> index 00000000..14085872
> --- /dev/null
> +++ b/arch/hexagon/reloc.h
> @@ -0,0 +1,25 @@
> +#include <endian.h>
> +
> +#if __BYTE_ORDER == __LITTLE_ENDIAN
> +#define ENDIAN_SUFFIX "el"
> +#else
> +#define ENDIAN_SUFFIX ""
> +#endif

bits/alltypes.h.in defined __BYTE_ORDER as 1234 (unconditionally
little endian). Does Hexagon (hw and abi) actually support both byte
orders? If not, I don't think there should be an endian suffix here at
all.

> diff --git a/include/elf.h b/include/elf.h
> index 549f92c1..54251c24 100644
> --- a/include/elf.h
> +++ b/include/elf.h
> @@ -3284,6 +3284,107 @@ enum
>  #define R_RISCV_SET32           56
>  #define R_RISCV_32_PCREL        57
>  
> +#define R_HEX_NONE               0
> [...]

I'd like to merge this separately first since it's independent of
whether hexagon is a supported host.

> diff --git a/src/fenv/hexagon/fenv.S b/src/fenv/hexagon/fenv.S
> new file mode 100644
> index 00000000..07b89764
> --- /dev/null
> +++ b/src/fenv/hexagon/fenv.S
> @@ -0,0 +1,143 @@
> +/*
> +The Hexagon user status register includes five status fields which work as
> +sticky flags for the five IEEE-defined exception conditions: inexact, overflow, underflow,
> +divide by zero, and invalid. A sticky flag is set when the corresponding exception occurs,
> +and remains set until explicitly cleared.

Please format for at most 80 columns. (Some source files have a few
longer lines, but if you're flowing text it should be flowed to 80 or
fewer.)

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.