Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Thu, 4 Aug 2022 00:43:42 +0200
From: Szabolcs Nagy <nsz@...t70.net>
To: Mike Beattie <mike@...ernal.org>
Cc: musl@...ts.openwall.com
Subject: Re: Bug: BOL/EOL anchors in regex capture groups won't match
 EOL

* Mike Beattie <mike@...ernal.org> [2022-07-21 18:08:19 +1200]:
> FRRouting uses musl-libc in its docker container build, and it also appears
> to be in use in the GNS3 appliances for frr available online.
> 
> BGP as-path matching is regex powered, and usage of a special token of '_'
> allows for the easy matching of the boundary of an ASN in an as-path.
> Internally, it's translated into the regex capture group of:
> 
>    (^|[,{}() ]|$)
> 
> A valid as-path is a sequence of integers such as:
> 
>    100 200 300
> 
> A BGP as-path filter might be specified as so:
> 
>    bgp as-path access-list foo seq 20 permit _300_
> 
> which would get expanded to:
> 
>    (^|[,{}() ]|$)300(^|[,{}() ]|$)
> 
> when checking for a match. The usage of the pattern "(^|$)" in musl's regex
> implementation will never match EOL, but it does match BOL. Removal of the
> circumflex will let the match succeed.

thanks for the report.

it seems to me regcomp does not handle assertions corretly if there is
a union (|) of multiple subexpressions that match the empty string.

it simply takes the assertion of the leftmost subexpression so e.g.

'(|$)a' matches 'a' but
'($|)a' does not because it matches as '$a' and the $ assertion fail.

since posix does not allow (| empty pattern in the syntax a conforming
example is e.g.

'(b*|$)a' vs '($|b*)a'

all supported assertions are affected (^, $, \b, \B, \<, \>).

the fix is not obvious: there is a regcomp step like

	tags, assertions = leftmost_empty_match(subexpr)
	process(tags, assertions)

which should be

	list = all_empty_match(subexpr)
	for tags, assertions in list:
		if assertions are weaker than previous ones:
			process(tags, assertions)

i think this can increase storage and computation requirements
significantly unless the algorithm is further optimized.


> 
> Here is the output of a test programs I've written to confirm this:
> 
>    $ musl-gcc -o r r.c
> 
>    $ ./r "_300_" "100 200 300"
>    regex: (^|[,{}() ]|$)300(^|[,{}() ]|$)
>    regexec on [100 200 300]: NOT Found
> 
> Removal of "^|" from the beginning of the trailing capture group:
> 
>    $ ./r "(^|[,{}() ]|$)300([,{}() ]|$)" "0000 1111 2222"
>    regex: (^|[,{}() ]|$)300([,{}() ]|$)
>    regexec on [100 200 300]: Found
> 
> Thanks,
> Mike.
> -- 
> Mike Beattie <mike@...ernal.org>

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.