Regex.Split()的奇怪行为

本文关键字:Split Regex | 更新日期: 2023-09-27 18:20:37

我尝试使用regex在文本文件中拆分数据,但在测试过程中发现了一个奇怪的错误——非常简单的文件显然是不正确的。说明这种行为的示例代码:

        const string line = "511525,3122,9,39,2007,9,39,3127,9,39,'" -49,368.11 '",'"-32,724.16'",2,1,'" 2,347.91 '", -   ,'" 2,234.17 '", -   ,2.2,1.143,2,1.24,FALSE,1,2,0,311,511625";
        const string pattern = ",(?=([^'"]*'"[^'"]*'")*[^'"]*$)";
        Console.WriteLine();
        Console.WriteLine("SPLIT");
        var splitted = Regex.Split(line, pattern, RegexOptions.Compiled);
        foreach (var s in splitted)
        {
            Console.WriteLine(s);
        }
        Console.WriteLine();
        Console.WriteLine("REPLACE");
        var replaced = Regex.Replace(line, pattern, "!" , RegexOptions.Compiled);
        Console.WriteLine(replaced);
        Console.WriteLine();
        Console.WriteLine("MATCH");
        var matches = Regex.Matches(line, pattern);
        foreach (Match match in matches)
        {
            Console.WriteLine(match.Index);
        }

因此,正如您所看到的,split是唯一会产生意外结果的方法(它在无效位置进行拆分!)!CCD_ 1和CCD_ 2都给出了绝对正确的结果。我甚至尝试在RegexBuddy中测试提到的regex,它显示出与Regex.Matches相同的匹配!我是遗漏了什么,还是看起来像Split方法中的错误?

控制台输出

SPLIT
511525
, -   ," 2,234.17 "
3122
, -   ," 2,234.17 "
9
, -   ," 2,234.17 "
39
, -   ," 2,234.17 "
2007
, -   ," 2,234.17 "
9
, -   ," 2,234.17 "
39
, -   ," 2,234.17 "
3127
, -   ," 2,234.17 "
9
, -   ," 2,234.17 "
39
, -   ," 2,234.17 "
" -49,368.11 "
, -   ," 2,234.17 "
"-32,724.16"
, -   ," 2,234.17 "
2
, -   ," 2,234.17 "
1
, -   ," 2,234.17 "
" 2,347.91 "
 -   ," 2,234.17 "
 -
" 2,234.17 "
" 2,234.17 "
 -
2.2
1.143
2
1.24
FALSE
1
2
0
311
511625
REPLACE
511525!3122!9!39!2007!9!39!3127!9!39!" -49,368.11 "!"-32,724.16"!2!1!" 2,347.91 "! -   !" 2,234.17 "! -   !2.2!1.143!2!1.24!FALSE!1!2!0!311!511625
MATCH
6
11
13
16
21
23
26
31
33
36
51
64
66
68
81
87
100
106
110
116
118
123
129
131
133
135
139

Regex.Split()的奇怪行为

根据Microsoft的回复(添加ExplicitCapture),问题似乎出在捕获组上。ExplicitCapture选项会将该捕获组转换为非捕获组

您可以在没有选项的情况下通过使组显式不捕获来执行相同的操作:

const string pattern = ",(?=(?:[^'"]*'"[^'"]*'")*[^'"]*$)";

用LINQPad测试,似乎产生了我们想要的结果。

是否有任何捕获组会产生不同,如Regex.Split 文档中所述

如果在Regex.Split表达式中使用捕获圆括号,则捕获的文本包含在结果字符串数组中。例如拆分字符串"梅花梨"放在捕获内的连字符上圆括号将包含连字符的字符串元素添加到返回的数组。

来自MS 的解决方案

(添加ExplicitCapture regex选项)