正则表达式性能缓慢

本文关键字：缓慢性能正则表达式 | 更新日期: 2023-09-27 18:27:51

下面的代码包含一个正则表达式，旨在提取 C# 字符串文字，但对于多个字符的输入字符串的正则表达式匹配性能很差。

class Program
{
   private static void StringMatch(string s)
    {
        // regex: quote, zero-or-more-(zero-or-more-non-backslash-quote, optional-backslash-anychar), quote
        Match m = Regex.Match(s, "'"(([^'''''"]*)(''''.)?)*'"");
        if (m.Success)
            Trace.WriteLine(m.Value);
        else
            Trace.WriteLine("no match");
    }
    public static void Main()
    {
        // this first string is unterminated (so the match fails), but it returns instantly
        StringMatch("'"OK");
        // this string is terminated (the match succeeds)
        StringMatch("'"This is a longer terminated string - it matches and returns instantly'"");
        // this string is unterminated (so the match will fail), but it never returns
        StringMatch("'"This is another unterminated string and takes FOREVER to match");
    }
}

我可以将正则表达式重构为不同的形式，但任何人都可以解释为什么性能如此糟糕？

正则表达式性能缓慢

你遇到了灾难性的回溯：

让我们稍微简化一下正则表达式(没有转义的引号，也没有第二个可选组，因为正如您的评论一样，它与测试的字符串无关(：

"(([^''"]*))*"

([^''"]*)匹配除引号或反斜杠以外的任何字符串。这再次包含在一个可选组中，该组可以重复任意次数。

现在对于字符串"ABC，正则表达式引擎需要尝试以下排列：

" ， ABC
"、ABC、<empty string>
"、AB、C
" ， AB ， C ， <empty string>
"、AB、<empty string>、C
"、AB、<empty string>、C、<empty string>
"、<empty string>、AB、C
"、<empty string>、AB、C、<empty string>
"、<empty string>、AB、<empty string>、C、<empty string>
"、<empty string>、AB、<empty string>、C
"、A、BC
"、A、BC、<empty string>
"、A、<empty string>、BC
" ， <empty string> ， A ， BC
等。
"、A、B、C
" ， A ， B ， C ， <empty string>
"、A、B、<empty string>、C
等等。

然后，每个都失败，因为没有以下"。

此外，您只是在测试子字符串，而不是强制正则表达式匹配整个字符串。而且您通常希望对正则表达式使用逐字字符串来减少所需的反斜杠数量。这个怎么样：

foundMatch = Regex.IsMatch(subjectString, 
    @"'A     # Start of the string
    ""       # Match a quote
    (?:      # Either match...
     ''.     # an escaped character
    |        # or
     [^''""] # any character except backslash or quote
    )*       # any number of times
    ""       # Match a quote
    'Z       # End of the string", 
    RegexOptions.IgnorePatternWhitespace);

编辑

来了："'"([^'''''"]|''''.)*'""

解释一下，在 C# 转义字符串后，你会得到这个正则表达式： "([^''"]|''.)*"

意义：

"                #start with a quote
(
    [^''"]       #match a non backslash or quote
    |            #or
    ''.          #backslash something
)                
*                #And repeat
"                #end with a quote

通过不嵌套您的*，您不会获得指数或无限循环，它会立即为我返回。

我认为@Tim Pietzcker对回溯给出了最好的解释。

通过周围的各种基准测试(包括我自己的基准测试(，这些是快速的方法：

方法一、展开

" [^"'']* (?: ''. [^"'']* )* "

方法二、交替

" (?: ''. | [^"'']+ )* "

方法 1，可以大幅优于 2。

信息

我认为很难解释灾难性的回溯。即使认识到它有时也很难，除非它在时间上非常明显。然后，在时间关键型应用中，有时做一些基准测试是有益的。

在这个引用主题上，我喜欢在基准模板化的perl(5.10引擎(脚本中添加新的方法，看看它是如何做到的。每个引擎都有点不同。如果你关心，这里有一个示例。

正则表达式示例与使用大量引号和转义字符串
的时间每次迭代 100,000 次。

(?x-ism:" ( (?: ''?. )*? ) ")代码取：14.7031 挂钟秒(14.58 USR + 0.00 系统 = 14.58 CPU(

(?x-ism:" (.*? (?<!'') (?:''{2})* ) ")代码取：12.8435 挂钟秒(12.75 USR + 0.00 系统 = 12.75 CPU(

(?x-ism:" ( (?: [^''"] | ''. )* ) ")代码采用：10.3123 挂钟秒(10.27 USR + 0.00 系统 = 10.27 CPU(

(?x-ism: " ( (?: [^"'']+ | (?:''.)+ )* ) " )代码取：8.39063 挂钟秒(8.39 USR + 0.00 系统 = 8.39 CPU(

(?x-ism: " ( (?: [^"'']+ | ''. )* ) " )代码采用：8.7498 挂钟秒(8.75 USR + 0.00 系统 = 8.75 CPU(

(?x-ism: " ( (?: ''. | [^"'']+ )* ) " )代码占用：8.5623 挂钟秒(8.44 USR + 0.00 系统 = 8.44 CPU(

(?x-ism: " ( [^"'']* (?: ''. [^"'']* )* ) " )代码取：7.79661 挂钟秒 ( 7.80 USR + 0.00 系统 = 7.80 CPU(

(?x-ism: (?> " ( (?: [^"''] | ''. )* " ) ) )代码取：10.5156 挂钟秒 (10.52 USR + 0.00 系统 = 10.52 CPU(

尝试

Match m = Regex.Match(s, @"'.*?(?<=[^'']('''')*)'".Replace("'", "'""));

这将"智能地"忽略偶数'.这是因为"关闭一个字符串，'"不会，''"关闭(因为第一个反斜杠转义第二个反斜杠(，'''"不会......

.*?是一个惰性量词。您甚至可以使用标准.*量词。我会说也许你应该用^和$锚定你的正则表达式。

我正在使用替换，因为转义双引号让我头疼：-(

我要补充一点，使用 4k 文本，它在我的计算机上是即时的，无论是匹配还是不匹配。

作为替代方案：

Match m = Regex.Match(s, @"^'(?>([^''']|''.)*)'$".Replace("'", "'""));

解释：

(?> ) disables backtracking
^ begin of the string

然后你有一个交替构造(0 次或更多次，*(：

[^'''] any non-quote and non backslash
''. or a backslash followed by another character (that is escaped)
$ end of the string

这也非常非常快，而且很容易阅读。

这是我将使用的：

"[^'n"'']*(?:''.[^'n"'']*)*"

@sln关于展开循环方法是最快的是正确的，但我会通过排除换行来进一步完善它，这在老式字符串文字中是不允许的。 ''.部分还可以，但[^"'']需要更改为[^'n"'']。此外，如果我们谈论的是提取字符串文字，我们不能用 'A 和 'Z 锚定正则表达式。

我使用正则表达式Buddy来比较你的正则表达式的性能，没有锚点的Tim正则表达式，以及这个。我将光标放在每个示例字符串的开头引号之前，并使用"在此处调试"，结果如下：

original regex        :  "(([^''"'n]*)(''.)?)*"
"OK                   :  failed in 101 steps
"This is a longer...  :  matched in 12 steps
"This is another...   :  gave up after 1,000,000 steps

Tim's regex           :   "(?:''.|[^''"'n])*"
"OK                   :  failed in 17 steps
"This is a longer...  :  matched in 211 steps
"This is another...   :  failed in 253 steps

unrolled loop         :  "[^''"'n]*(?:''.[^''"'n]*)*"
"OK                   :  failed in 5 steps
"This is a longer...  :  matched in 5 steps
"This is another...   :  failed in 5 steps

将其作为逐字字符串插入代码中，您将获得：

Match m = Regex.Match(s, @"""[^'n""'']*(?:''.[^'n""'']*)*""");

编辑：顺便说一下，我并不是说你必须使用这个正则表达式，因为它更快;其他解决方案几乎可以肯定足够快。但是，如果您确实需要最大性能(同时仍在使用正则表达式(，这可能是实现它的方法。之所以如此之快，是因为正则表达式总是向前移动：没有回溯引用，没有环顾四周，最重要的是，没有回溯。

回溯问题之所以引起，是因为所有内容都是可选的量化
在嵌套量词组中。此块被文字的 .
包围由于永远不会达到最终文本，因此引擎尝试每个内部
无限序列。

解决它的唯一方法是在块中放置一个字面停止点.
这是停止回溯的点。

除此之外，您可以将文字放在两个变量术语中，并使用 clever
群集以获得最佳结果：

"  [^''"]*  (?:  ''.    [^''"]*  )*  "
     ^^^         ^^^      ^^^
     var       literal    var