正则表达式匹配(贪婪/不贪婪?)

本文关键字:贪婪 正则表达式 | 更新日期: 2023-09-27 17:54:20

我遇到了一些麻烦'挑选'这个数据'分开'。虽然辅助函数等是一个选项,但我真的希望只使用正则表达式来解决这个问题(并在匹配后处理匹配组)。

这是我拥有的(部分)数据:

Belgium
Belgium M_Foo
Belgium A_Bar
Belgium M_FooBar
Belgium S_Whooptee Doo
Belgium Xxx
Belgium S_Foo Bar
United Kingdom
United Kingdom W_Foo-Bar
United Kingdom M_Yay
United Kingdom Xxx
United Kingdom S_Derp
United Kingdom F_Doh Lorem
United Kingdom S_Ipsum Dolor
United States of America L_Foo
Macedonia F.Y.R. Xxx
Macedonia F.Y.R. S_Foo Bar
Cyprus (Greek) M_Foo
Congo (Democratic Republic of)
Congo (Democratic Republic of) Q_Yolo

本质上这是一个"键/值"类型的字符串数组。它包含一个国家名称(这不是标准化的,所以我不能使用硬编码的国家名称或'查找',它也可能是一些其他的字符串,而不是一个国家名称),optionally后面是关键字Xxx <random_upcase_char>_<random_text>

我想出了以下正则表达式:

^(.+?)(?:'s+(Xxx|[A-Z]_.*)?)

或者,第一个配对组的差异很小:

^(.*?)(?:'s+(Xxx|[A-Z]_.*)?)

对于以Belgium开头的第一个字符串可以正常工作。对于这些记录,它返回以下结果:

Group 1     Group 2
================================
Belgium
Belgium     M_Foo
Belgium     A_Bar
Belgium     M_FooBar
Belgium     S_Whooptee Doo
Belgium     Xxx
Belgium     S_Foo Bar

但是,下面几行会引起问题:

Group 1     Group 2
================================
United
United
United
United
United
United
United
United
Macedonia
Macedonia
Cyprus
Congo
Congo

我想让正则表达式做的是:

Group 1                         Group 2
================================================
United Kingdom
United Kingdom                  W_Foo-Bar
United Kingdom                  M_Yay
United Kingdom                  Xxx
United Kingdom                  S_Derp
United Kingdom                  F_Doh Lorem
United Kingdom                  S_Ipsum Dolor
United States of America        L_Foo
Macedonia F.Y.R.                Xxx
Macedonia F.Y.R.                S_Foo Bar
Cyprus (Greek)                  M_Foo
Congo (Democratic Republic of)
Congo (Democratic Republic of)  Q_Yolo

但是我不能让第一部分匹配。我很确定这与贪婪/不贪婪的选项有关,但在摆弄了一段时间后,我不能让它工作…

我不关心是否返回额外/其他/更多匹配组。该正则表达式旨在在.Net C#应用程序中使用(如果您想知道这是哪种"方言")。

正则表达式匹配(贪婪/不贪婪?)

有时候,对于非贪婪匹配,锚定非常重要。在这种情况下,锚定到行尾可以解决问题。你的regexp应该是:

^(.+?)(?:'s+(Xxx|[A-Z]_.*))?$

请注意,我还将可选的(?)量词移到了另一个分组级别之外,因此空格是可选的。

我管理了你想要的这个正则表达式(多行运行):

^((?:.+?| )+?)(?:'s+(Xxx|[A-Z]_.*)|'s)?$

使用你的输入给我这个结果:

1: Belgium                  2: 
1: Belgium                  2: M_Foo
1: Belgium                  2: A_Bar
1: Belgium                  2: M_FooBar
1: Belgium                  2: S_Whooptee Doo
1: Belgium                  2: Xxx
1: Belgium                  2: S_Foo Bar
1: United Kingdom           2: 
1: United Kingdom           2: W_Foo-Bar
1: United Kingdom           2: M_Yay
1: United Kingdom           2: Xxx
1: United Kingdom           2: S_Derp
1: United Kingdom           2: F_Doh Lorem
1: United Kingdom           2: S_Ipsum Dolor
1: United States of America 2: L_Foo
1: Macedonia F.Y.R.         2: Xxx
1: Macedonia F.Y.R.         2: S_Foo Bar
1: Cyprus (Greek)           2: M_Foo

/(?:^(.+)'s+(Xxx|[A-Z]_.+)$|^(.+)$)/gm将匹配您的所有字符串,然而,任何只有国家的行将被放在第三个匹配中(因此在您查看结果时检查这一点)。

试试这个(不区分大小写):

^([A-Z]+(?:'s+(?!Xxx)[A-Z]+)*(?:'s+'([^)]+'))?)(?:'s+(Xxx|(?:[-A-Z_.]+(?:'s+[-A-Z_.]+)*)))?$

它适用于你所有的例子。但是,坦率地说,您应该正确地分隔数据。

演示:

$ perl -ne '/^([A-Z]+(?:'s+(?!Xxx)[A-Z]+)*(?:'s+'([^)]+'))?)(?:'s+(Xxx|(?:[-A-Z_.]+(?:'s+[-A-Z_.]+)*)))?$/i and print "MATCH: group 1 is '"$1'", group 2 is '"$2'"'n"'
> Belgium
> Belgium M_Foo
> Belgium A_Bar
> Belgium M_FooBar
> Belgium S_Whooptee Doo
> Belgium Xxx
> Belgium S_Foo Bar
> United Kingdom
> United Kingdom W_Foo-Bar
> United Kingdom M_Yay
> United Kingdom Xxx
> United Kingdom S_Derp
> United Kingdom F_Doh Lorem
> United Kingdom S_Ipsum Dolor
> United States of America L_Foo
> Macedonia F.Y.R. Xxx
> Macedonia F.Y.R. S_Foo Bar
> Cyprus (Greek) M_Foo
> Congo (Democratic Republic of)
> Congo (Democratic Republic of) Q_Yolo
> EOF
MATCH: group 1 is "Belgium", group 2 is ""
MATCH: group 1 is "Belgium", group 2 is "M_Foo"
MATCH: group 1 is "Belgium", group 2 is "A_Bar"
MATCH: group 1 is "Belgium", group 2 is "M_FooBar"
MATCH: group 1 is "Belgium", group 2 is "S_Whooptee Doo"
MATCH: group 1 is "Belgium", group 2 is "Xxx"
MATCH: group 1 is "Belgium", group 2 is "S_Foo Bar"
MATCH: group 1 is "United Kingdom", group 2 is ""
MATCH: group 1 is "United Kingdom", group 2 is "W_Foo-Bar"
MATCH: group 1 is "United Kingdom", group 2 is "M_Yay"
MATCH: group 1 is "United Kingdom", group 2 is "Xxx"
MATCH: group 1 is "United Kingdom", group 2 is "S_Derp"
MATCH: group 1 is "United Kingdom", group 2 is "F_Doh Lorem"
MATCH: group 1 is "United Kingdom", group 2 is "S_Ipsum Dolor"
MATCH: group 1 is "United States of America", group 2 is "L_Foo"
MATCH: group 1 is "Macedonia", group 2 is "F.Y.R. Xxx"
MATCH: group 1 is "Macedonia", group 2 is "F.Y.R. S_Foo Bar"
MATCH: group 1 is "Cyprus (Greek)", group 2 is "M_Foo"
MATCH: group 1 is "Congo (Democratic Republic of)", group 2 is ""
MATCH: group 1 is "Congo (Democratic Republic of)", group 2 is "Q_Yolo"