Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <6D612B6AC5DCDA4580AF97B1068118AD2DC49A@DGGEML501-MBX.china.huawei.com>
Date: Sat, 18 Apr 2020 08:44:50 +0000
From: "liheng (P)" <liheng40@...wei.com>
To: Rich Felker <dalias@...c.org>
CC: "musl@...ts.openwall.com" <musl@...ts.openwall.com>, "Xiangrui (Euler)"
	<rui.xiang@...wei.com>, Lizefan <lizefan@...wei.com>
Subject: regex Back reference matching result not same as glibc and
 tre. 

Rich Felker:

Hello, I've noticed musl regex matching result is not same as glibc and tre. 
The back reference maybe not supported well in latest version.

Here is a simple test case:

#include <regex.h>
#include <stdio.h>
#include <string.h>

#define str "aba"
#define N 2
static const char *expected[N] =
{
        str, "a"
};

static const char pat[] = "(.?).?\\1";

int test_regex(void)
{
        regex_t rbuf;

        int err = regcomp(&rbuf, pat, REG_EXTENDED);
        if (err != 0) {
                char errstr[300];
                regerror(err, &rbuf, errstr, sizeof (errstr));
                puts (errstr);
                return err;
        }

        regmatch_t m[N];
        err = regexec(&rbuf, str, N, m, 0);
        if (err != 0) {
                puts ("regexec failed");
                return 1;
        }

        int result = 0;
        int i;
        for (i = 0; i < N; ++i) {
                if (m[i].rm_so == -1) {
                        printf ("m[%d] unused\n", i);
                        result = 1;
                }
                else {
                        int len = m[i].rm_eo - m[i].rm_so;
                        printf ("m[%d] = \"%.*s\"\n", i, len, str + m[i].rm_so);
                        if (strlen (expected[i]) != len
                                || memcmp (expected[i], str + m[i].rm_so, len) != 0)
                                result = 1;
                }
        }

        return result;
}

int main (void)
{
        int result = 0;

        result = test_regex();

        if (result != 0) {
                printf("test regex failed\n");
        } else {
                printf("test regex success\n");
        }

        return result;
}

musl: 
# ./test
regexec failed
test regex failed

glibc:
# ./test
m[0] = "aba"
m[1] = "a"
m[2] = ""
test regex success

tre:
# ./test
m[0] = "aba"
m[1] = "a"
m[2] = ""
test regex success


I noticed Rich Felker made change about back reference in below commit to suppress back reference processing in ERE regcomp.

commit 7c8c86f6308c7e0816b9638465a5917b12159e8f
Author: Rich Felker <dalias@...ifal.cx>
Date:   Fri Mar 20 18:25:01 2015 -0400

    suppress backref processing in ERE regcomp

    one of the features of ERE is that it's actually a regular language
    and does not admit expressions which cannot be matched in linear time.
    introduction of \n backref support into regcomp's ERE parsing was
    unintentional.

diff --git a/src/regex/regcomp.c b/src/regex/regcomp.c index bce6bc15..4d80cb1c 100644
--- a/src/regex/regcomp.c
+++ b/src/regex/regcomp.c
@@ -839,7 +839,7 @@ static reg_errcode_t parse_atom(tre_parse_ctx_t *ctx, const char *s)
                        break;
                default:
-                       if (isdigit(*s)) {
+                       if (!ere && isdigit(*s)) {
                                /* back reference */


This commit reminds me that if i want to use back reference i should not to tag REG_EXTENDED, but this test case matching still failed.

And I try to support back reference in ERE regcomp by below modify and then the musl regex matching success same as glibc and tre.

--- a/src/regex/regcomp.c
+++ b/src/regex/regcomp.c
                default:
+                       if (!ere && isdigit(*s)) {
+                       if (ere && isdigit(*s)) {
                                /* back reference */


Thank you for considering this.

Li Heng

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.