Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130702074920.GF29800@brightrain.aerifal.cx>
Date: Tue, 2 Jul 2013 03:49:20 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: Request for volunteers

On Tue, Jul 02, 2013 at 04:19:37AM +0200, Szabolcs Nagy wrote:
> > My naive feeling would be that deciding "how much can go in one test"
> > is not a simple rule we can follow, but requires considering what's
> > being tested, how "low-level" it is, and whether the expected failures
> > might interfere with other tests. For instance a test that's looking
> > for out-of-bounds accesses would not be a candidate for doing a lot in
> > a single test file, but a test that's merely looking for correct
> > parsing could possibly get away with testing lots of assertions in a
> > single file.
> 
> yes the boundary is not clear, but eg the current pthread
> test does too many kinds of things in one file

I agree completely. By the way, your mentioning pthread tests reminds
me that we need a reliable way to fail tests that have deadlocked or
otherwise hung. The standard "let it run for N seconds then kill it"
approach is rather uncivilized. I wonder if we could come up with a
nice way with a mix of realtime and cputime timers to observe complete
lack of forward progress.

> if the 'hundreds of test cases' can be represented as
> a simple array of test vectors then that should go into
> one file
> 
> if many functions want to use the same test vectors then
> at some point it's worth moving the vectors out to a
> header file and write separate tests for the different
> functions

Indeed, that is probably the way I should have factored my scanf
tests, but there is something to be said for getting the 4 errors for
the 4 functions with the same vector collated together in the output.

> > I think any fancy "framework" stuff could be purely in the controlling
> > and reporting layer, outside the address space of the actual tests. We
> > may however need a good way for the test to communicate its results to
> > the framework...
> 
> the simple approach is to make each test a standalone process that
> exits with 0 on success
> 
> in the failure case it can use dprintf to print error messages to
> stdout and the test system collects the exit status and the messages

Agreed. And if any test is trying to avoid stdio entirely, it can use
write() directly to generate the output.

> > One thing that comes to mind where tests may need a lot of "build
> > system" help is testing the dynamic linker.
> 
> yes
> 
> and we need to compile with -lpthread -lm -lrt -l...
> if the tests should work on other libcs

Indeed. If we were testing other libcs, we might even want to run some
non-multithreaded tests with and without -lpthread in case the
override symbols in libpthread break something. Of course that can't
happen for musl; the equivalent test for musl would be static linking
and including or excluding references to certain otherwise-irrelevant
functions that might affect which version of another function gets
linked.

BTW, to use or not to use -static is also a big place we need build
system help.

> my current solution is using wildcard rules for building
> *_dso.c into .so and *.c into executables and then
> add extra rules and target specific make variables:
> 
> foo: LDFLAGS+=-ldl -rdynamic
> foo: foo_dso.so

I wasn't aware of this makefile trick to customizer flags for
different files. This could be very useful for customized optimization
levels in musl:

  ifdef $(OPTIMIZE)
  $(OPTIMIZE_OBJS) $(OPTIMIZE_OBJS:%.o=%.lo): CFLAGS+=-O3
  endif

> the other solution i've seen is to put all the build commands
> into the .c file as comments:
> 
> //RUN cc -c -o $name.o $name.c
> //RUN cc -o $name $name.o
> ....
> 
> and use simple shell scripts as the build system
> (dependencies are harder to track this way, but the tests
> are more self-contained)

What about a mix? Have the makefile include another makefile fragment
with a rule to generate that fragment, where the fragment is generated
from comments in the source files. Then you have full dependency
tracking via make, and self-contained tests.

> > > - i looked at the bug history and many bugs are in hard to
> > > trigger cornercases (eg various races) or internally invoke ub
> > > in a way that may be hard to verify in a robust way
> > 
> > Test cases for race conditions make one of the most interesting types
> > of test writing. :-) The main key is that you need to have around a
> > copy of the buggy version to test against. Such tests would not have
> > FAILED or PASSED as possible results, but rather FAILED, or FAILED TO
> > FAIL. :-)
> 
> hm we can introduce a third result for tests that try to trigger
> some bug but are not guaranteed to do so
> (eg failed,passed,inconclusive)
> but probably that's more confusing than useful

Are you aware of any such cases?

> > > - some tests may need significant support code to achieve good
> > > coverage (printf, math, string handling close to 2G,..)
> > > (in such cases we can go with simple self-contained tests without
> > > much coverage, but easy maintainance, or with something
> > > sophisticated)
> > 
> > I don't follow.
> 
> i mean for many small functions there is not much difference between
> a simple sanity check and full coverage (eg basename can be thoroughly
> tested by about 10 input-output pairs)
> 
> but there can be a huge difference: eg detailed testing of getaddrinfo
> requires non-trivial setup with dns server etc, it's much easier to do
> some sanity checks like gnulib would do, or a different example is
> rand: a real test would be like the diehard test suit while the sanity
> check is trivial

By the way, getaddrinfo (the dns resolver core of it) had a nasty bug
at one point in the past that randomly smashed the stack based on the
timing of dns responses. This would be a particularly hard thing to
test, but if we do eventually want to have regression tests for
timing-based bugs, it might make sense to use debuglib
(https://github.com/rofl0r/debuglib) and set breakpoints at key
functions to control the timing.

> so i'm not sure how much engineering should go into the tests:
> go for a small maintainable set that touch as many areas in libc
> as possible, or go for extensive coverage and develop various tools
> and libs that help setting up the environment or generate large set
> of test cases (eg my current math tests are closer to this later one)

I think something in between is what we should aim for, tuned for
where we expect to find bugs that matter. For functions like printf,
scanf, strtol, etc. that have a lot of complex logic and exact
behavior they must deliver or risk introducing serious application
bugs, high coverage is critical to delivering a libc we can be
confident in. But for other functions, a simple sanity check migh
suffice. Sanity checks are very useful for new ports, since failure to
pass can quickly show that we have syscall conventions or struct
definitions wrong, alignment bugs, bad asm, etc. They would also
probably have caught the recent embarassing mbsrtowcs bug I fixed.

Here are some exhaustive tests we could easily perform:

- rand_r: period and bias
- all multibyte to wide operations: each valid UTF-8 character and
  each invalid prefix. for functions that behave differently based on
  whether output pointer is null, testing both ways.
- all wide to multibyte functions: each valid and invald wchar_t.

And some functions that would probably be fine with just sanity
checks:

- dirent interfaces
- network address conversions
- basename/dirname
- signal operations
- search interfaces

And things that can't be tested exhaustively but which I would think
need serious tests:

- stdio (various combinations of buffering, use of unget buffer, scanf
  pushback, seeking, file position, flushing, switching
  reading/writing, eof and error flags, ...)
- AIO (it will probably fail now tho)
- threads (synchronization primitives, cancellation, TSD dtors, ...)
- regex (sanity-check all features, longest-match rule, ...)
- fnmatch, glob, and wordexp
- string functions

> if the goal is to execute the test-suit as a post-commit hook

I think that's a little too resource-heavy for a full test, but
perhaps reasonable for a subset of tests.

> then there should be a reasonable limit on resource usage, build and
> execution time etc and this limit affects how the code may be
> organized, how errors are reported..
> (most test systems i've seen are for simple unit tests: they allow
> checking a few constraints and then report errors in a nice way,
> however in case of libc i'd assume that you want to enumerate the
> weird corner-cases to find bugs more effectively)

Yes, I think so.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.