正则表达式匹配(贪婪/不贪婪?)
本文关键字:贪婪 正则表达式 | 更新日期: 2023-09-27 17:54:20
我遇到了一些麻烦'挑选'这个数据'分开'。虽然辅助函数等是一个选项,但我真的希望只使用正则表达式来解决这个问题(并在匹配后处理匹配组)。
这是我拥有的(部分)数据:
Belgium
Belgium M_Foo
Belgium A_Bar
Belgium M_FooBar
Belgium S_Whooptee Doo
Belgium Xxx
Belgium S_Foo Bar
United Kingdom
United Kingdom W_Foo-Bar
United Kingdom M_Yay
United Kingdom Xxx
United Kingdom S_Derp
United Kingdom F_Doh Lorem
United Kingdom S_Ipsum Dolor
United States of America L_Foo
Macedonia F.Y.R. Xxx
Macedonia F.Y.R. S_Foo Bar
Cyprus (Greek) M_Foo
Congo (Democratic Republic of)
Congo (Democratic Republic of) Q_Yolo
本质上这是一个"键/值"类型的字符串数组。它包含一个国家名称(这不是标准化的,所以我不能使用硬编码的国家名称或'查找',它也可能是一些其他的字符串,而不是一个国家名称),optionally
后面是关键字Xxx
或 <random_upcase_char>_<random_text>
。
我想出了以下正则表达式:
^(.+?)(?:'s+(Xxx|[A-Z]_.*)?)
或者,第一个配对组的差异很小:
^(.*?)(?:'s+(Xxx|[A-Z]_.*)?)
对于以Belgium
开头的第一个字符串可以正常工作。对于这些记录,它返回以下结果:
Group 1 Group 2
================================
Belgium
Belgium M_Foo
Belgium A_Bar
Belgium M_FooBar
Belgium S_Whooptee Doo
Belgium Xxx
Belgium S_Foo Bar
但是,下面几行会引起问题:
Group 1 Group 2
================================
United
United
United
United
United
United
United
United
Macedonia
Macedonia
Cyprus
Congo
Congo
我想让正则表达式做的是:
Group 1 Group 2
================================================
United Kingdom
United Kingdom W_Foo-Bar
United Kingdom M_Yay
United Kingdom Xxx
United Kingdom S_Derp
United Kingdom F_Doh Lorem
United Kingdom S_Ipsum Dolor
United States of America L_Foo
Macedonia F.Y.R. Xxx
Macedonia F.Y.R. S_Foo Bar
Cyprus (Greek) M_Foo
Congo (Democratic Republic of)
Congo (Democratic Republic of) Q_Yolo
但是我不能让第一部分匹配。我很确定这与贪婪/不贪婪的选项有关,但在摆弄了一段时间后,我不能让它工作…
我不关心是否返回额外/其他/更多匹配组。该正则表达式旨在在.Net C#
应用程序中使用(如果您想知道这是哪种"方言")。
有时候,对于非贪婪匹配,锚定非常重要。在这种情况下,锚定到行尾可以解决问题。你的regexp应该是:
^(.+?)(?:'s+(Xxx|[A-Z]_.*))?$
请注意,我还将可选的(?
)量词移到了另一个分组级别之外,因此空格是可选的。
我管理了你想要的这个正则表达式(多行运行):
^((?:.+?| )+?)(?:'s+(Xxx|[A-Z]_.*)|'s)?$
使用你的输入给我这个结果:
1: Belgium 2:
1: Belgium 2: M_Foo
1: Belgium 2: A_Bar
1: Belgium 2: M_FooBar
1: Belgium 2: S_Whooptee Doo
1: Belgium 2: Xxx
1: Belgium 2: S_Foo Bar
1: United Kingdom 2:
1: United Kingdom 2: W_Foo-Bar
1: United Kingdom 2: M_Yay
1: United Kingdom 2: Xxx
1: United Kingdom 2: S_Derp
1: United Kingdom 2: F_Doh Lorem
1: United Kingdom 2: S_Ipsum Dolor
1: United States of America 2: L_Foo
1: Macedonia F.Y.R. 2: Xxx
1: Macedonia F.Y.R. 2: S_Foo Bar
1: Cyprus (Greek) 2: M_Foo
/(?:^(.+)'s+(Xxx|[A-Z]_.+)$|^(.+)$)/gm
将匹配您的所有字符串,然而,任何只有国家的行将被放在第三个匹配中(因此在您查看结果时检查这一点)。
试试这个(不区分大小写):
^([A-Z]+(?:'s+(?!Xxx)[A-Z]+)*(?:'s+'([^)]+'))?)(?:'s+(Xxx|(?:[-A-Z_.]+(?:'s+[-A-Z_.]+)*)))?$
它适用于你所有的例子。但是,坦率地说,您应该正确地分隔数据。
演示:$ perl -ne '/^([A-Z]+(?:'s+(?!Xxx)[A-Z]+)*(?:'s+'([^)]+'))?)(?:'s+(Xxx|(?:[-A-Z_.]+(?:'s+[-A-Z_.]+)*)))?$/i and print "MATCH: group 1 is '"$1'", group 2 is '"$2'"'n"'
> Belgium
> Belgium M_Foo
> Belgium A_Bar
> Belgium M_FooBar
> Belgium S_Whooptee Doo
> Belgium Xxx
> Belgium S_Foo Bar
> United Kingdom
> United Kingdom W_Foo-Bar
> United Kingdom M_Yay
> United Kingdom Xxx
> United Kingdom S_Derp
> United Kingdom F_Doh Lorem
> United Kingdom S_Ipsum Dolor
> United States of America L_Foo
> Macedonia F.Y.R. Xxx
> Macedonia F.Y.R. S_Foo Bar
> Cyprus (Greek) M_Foo
> Congo (Democratic Republic of)
> Congo (Democratic Republic of) Q_Yolo
> EOF
MATCH: group 1 is "Belgium", group 2 is ""
MATCH: group 1 is "Belgium", group 2 is "M_Foo"
MATCH: group 1 is "Belgium", group 2 is "A_Bar"
MATCH: group 1 is "Belgium", group 2 is "M_FooBar"
MATCH: group 1 is "Belgium", group 2 is "S_Whooptee Doo"
MATCH: group 1 is "Belgium", group 2 is "Xxx"
MATCH: group 1 is "Belgium", group 2 is "S_Foo Bar"
MATCH: group 1 is "United Kingdom", group 2 is ""
MATCH: group 1 is "United Kingdom", group 2 is "W_Foo-Bar"
MATCH: group 1 is "United Kingdom", group 2 is "M_Yay"
MATCH: group 1 is "United Kingdom", group 2 is "Xxx"
MATCH: group 1 is "United Kingdom", group 2 is "S_Derp"
MATCH: group 1 is "United Kingdom", group 2 is "F_Doh Lorem"
MATCH: group 1 is "United Kingdom", group 2 is "S_Ipsum Dolor"
MATCH: group 1 is "United States of America", group 2 is "L_Foo"
MATCH: group 1 is "Macedonia", group 2 is "F.Y.R. Xxx"
MATCH: group 1 is "Macedonia", group 2 is "F.Y.R. S_Foo Bar"
MATCH: group 1 is "Cyprus (Greek)", group 2 is "M_Foo"
MATCH: group 1 is "Congo (Democratic Republic of)", group 2 is ""
MATCH: group 1 is "Congo (Democratic Republic of)", group 2 is "Q_Yolo"