|
Message-ID: <20130702074920.GF29800@brightrain.aerifal.cx> Date: Tue, 2 Jul 2013 03:49:20 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: Request for volunteers On Tue, Jul 02, 2013 at 04:19:37AM +0200, Szabolcs Nagy wrote: > > My naive feeling would be that deciding "how much can go in one test" > > is not a simple rule we can follow, but requires considering what's > > being tested, how "low-level" it is, and whether the expected failures > > might interfere with other tests. For instance a test that's looking > > for out-of-bounds accesses would not be a candidate for doing a lot in > > a single test file, but a test that's merely looking for correct > > parsing could possibly get away with testing lots of assertions in a > > single file. > > yes the boundary is not clear, but eg the current pthread > test does too many kinds of things in one file I agree completely. By the way, your mentioning pthread tests reminds me that we need a reliable way to fail tests that have deadlocked or otherwise hung. The standard "let it run for N seconds then kill it" approach is rather uncivilized. I wonder if we could come up with a nice way with a mix of realtime and cputime timers to observe complete lack of forward progress. > if the 'hundreds of test cases' can be represented as > a simple array of test vectors then that should go into > one file > > if many functions want to use the same test vectors then > at some point it's worth moving the vectors out to a > header file and write separate tests for the different > functions Indeed, that is probably the way I should have factored my scanf tests, but there is something to be said for getting the 4 errors for the 4 functions with the same vector collated together in the output. > > I think any fancy "framework" stuff could be purely in the controlling > > and reporting layer, outside the address space of the actual tests. We > > may however need a good way for the test to communicate its results to > > the framework... > > the simple approach is to make each test a standalone process that > exits with 0 on success > > in the failure case it can use dprintf to print error messages to > stdout and the test system collects the exit status and the messages Agreed. And if any test is trying to avoid stdio entirely, it can use write() directly to generate the output. > > One thing that comes to mind where tests may need a lot of "build > > system" help is testing the dynamic linker. > > yes > > and we need to compile with -lpthread -lm -lrt -l... > if the tests should work on other libcs Indeed. If we were testing other libcs, we might even want to run some non-multithreaded tests with and without -lpthread in case the override symbols in libpthread break something. Of course that can't happen for musl; the equivalent test for musl would be static linking and including or excluding references to certain otherwise-irrelevant functions that might affect which version of another function gets linked. BTW, to use or not to use -static is also a big place we need build system help. > my current solution is using wildcard rules for building > *_dso.c into .so and *.c into executables and then > add extra rules and target specific make variables: > > foo: LDFLAGS+=-ldl -rdynamic > foo: foo_dso.so I wasn't aware of this makefile trick to customizer flags for different files. This could be very useful for customized optimization levels in musl: ifdef $(OPTIMIZE) $(OPTIMIZE_OBJS) $(OPTIMIZE_OBJS:%.o=%.lo): CFLAGS+=-O3 endif > the other solution i've seen is to put all the build commands > into the .c file as comments: > > //RUN cc -c -o $name.o $name.c > //RUN cc -o $name $name.o > .... > > and use simple shell scripts as the build system > (dependencies are harder to track this way, but the tests > are more self-contained) What about a mix? Have the makefile include another makefile fragment with a rule to generate that fragment, where the fragment is generated from comments in the source files. Then you have full dependency tracking via make, and self-contained tests. > > > - i looked at the bug history and many bugs are in hard to > > > trigger cornercases (eg various races) or internally invoke ub > > > in a way that may be hard to verify in a robust way > > > > Test cases for race conditions make one of the most interesting types > > of test writing. :-) The main key is that you need to have around a > > copy of the buggy version to test against. Such tests would not have > > FAILED or PASSED as possible results, but rather FAILED, or FAILED TO > > FAIL. :-) > > hm we can introduce a third result for tests that try to trigger > some bug but are not guaranteed to do so > (eg failed,passed,inconclusive) > but probably that's more confusing than useful Are you aware of any such cases? > > > - some tests may need significant support code to achieve good > > > coverage (printf, math, string handling close to 2G,..) > > > (in such cases we can go with simple self-contained tests without > > > much coverage, but easy maintainance, or with something > > > sophisticated) > > > > I don't follow. > > i mean for many small functions there is not much difference between > a simple sanity check and full coverage (eg basename can be thoroughly > tested by about 10 input-output pairs) > > but there can be a huge difference: eg detailed testing of getaddrinfo > requires non-trivial setup with dns server etc, it's much easier to do > some sanity checks like gnulib would do, or a different example is > rand: a real test would be like the diehard test suit while the sanity > check is trivial By the way, getaddrinfo (the dns resolver core of it) had a nasty bug at one point in the past that randomly smashed the stack based on the timing of dns responses. This would be a particularly hard thing to test, but if we do eventually want to have regression tests for timing-based bugs, it might make sense to use debuglib (https://github.com/rofl0r/debuglib) and set breakpoints at key functions to control the timing. > so i'm not sure how much engineering should go into the tests: > go for a small maintainable set that touch as many areas in libc > as possible, or go for extensive coverage and develop various tools > and libs that help setting up the environment or generate large set > of test cases (eg my current math tests are closer to this later one) I think something in between is what we should aim for, tuned for where we expect to find bugs that matter. For functions like printf, scanf, strtol, etc. that have a lot of complex logic and exact behavior they must deliver or risk introducing serious application bugs, high coverage is critical to delivering a libc we can be confident in. But for other functions, a simple sanity check migh suffice. Sanity checks are very useful for new ports, since failure to pass can quickly show that we have syscall conventions or struct definitions wrong, alignment bugs, bad asm, etc. They would also probably have caught the recent embarassing mbsrtowcs bug I fixed. Here are some exhaustive tests we could easily perform: - rand_r: period and bias - all multibyte to wide operations: each valid UTF-8 character and each invalid prefix. for functions that behave differently based on whether output pointer is null, testing both ways. - all wide to multibyte functions: each valid and invald wchar_t. And some functions that would probably be fine with just sanity checks: - dirent interfaces - network address conversions - basename/dirname - signal operations - search interfaces And things that can't be tested exhaustively but which I would think need serious tests: - stdio (various combinations of buffering, use of unget buffer, scanf pushback, seeking, file position, flushing, switching reading/writing, eof and error flags, ...) - AIO (it will probably fail now tho) - threads (synchronization primitives, cancellation, TSD dtors, ...) - regex (sanity-check all features, longest-match rule, ...) - fnmatch, glob, and wordexp - string functions > if the goal is to execute the test-suit as a post-commit hook I think that's a little too resource-heavy for a full test, but perhaps reasonable for a subset of tests. > then there should be a reasonable limit on resource usage, build and > execution time etc and this limit affects how the code may be > organized, how errors are reported.. > (most test systems i've seen are for simple unit tests: they allow > checking a few constraints and then report errors in a nice way, > however in case of libc i'd assume that you want to enumerate the > weird corner-cases to find bugs more effectively) Yes, I think so. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.