如何使平衡组捕获
本文关键字:平衡 何使 | 更新日期: 2023-09-27 18:33:03
假设我有这个文本输入。
tes{}tR{R{abc}aD{mnoR{xyz}}}
我想提取 ff 输出:
R{abc}
R{xyz}
D{mnoR{xyz}}
R{R{abc}aD{mnoR{xyz}}}
目前,我只能使用 msdn 中的平衡组方法提取 {} 组内的内容。 下面是模式:
^[^{}]*(((?'Open'{)[^{}]*)+((?'Target-Open'})[^{}]*)+)*(?(Open)(?!))$
有谁知道如何在输出中包含 R{} 和 D{}?
我认为这里需要一种不同的方法。一旦你匹配了第一个较大的组R{R{abc}aD{mnoR{xyz}}}
(请参阅我对可能的拼写错误的评论(,您将无法获得子组,因为正则表达式不允许你捕获单个R{ ... }
组。
因此,必须有某种方法来捕捉而不是消费,而做到这一点的明显方法是使用积极的前瞻性。从那里,你可以把你使用的表达式,尽管有一些变化以适应新的焦点变化,我想出了:
(?=([A-Z](?:(?:(?'O'{)[^{}]*)+(?:(?'-O'})[^{}]*?)+)+(?(O)(?!))))
[我还将"Open"重命名为"O",并删除了关闭大括号的命名捕获,以使其更短并避免比赛中的噪音]
在 regexhero.net(到目前为止我所知道的唯一免费的.NET正则表达式测试器(上,我得到了以下捕获组:
1: R{R{abc}aD{mnoR{xyz}}}
1: R{abc}
1: D{mnoR{xyz}}
1: R{xyz}
正则表达式的细分:
(?= # Opening positive lookahead
([A-Z] # Opening capture group and any uppercase letter (to match R & D)
(?: # First non-capture group opening
(?: # Second non-capture group opening
(?'O'{) # Get the named opening brace
[^{}]* # Any non-brace
)+ # Close of second non-capture group and repeat over as many times as necessary
(?: # Third non-capture group opening
(?'-O'}) # Removal of named opening brace when encountered
[^{}]*? # Any other non-brace characters in case there are more nested braces
)+ # Close of third non-capture group and repeat over as many times as necessary
)+ # Close of first non-capture group and repeat as many times as necessary for multiple side by side nested braces
(?(O)(?!)) # Condition to prevent unbalanced braces
) # Close capture group
) # Close positive lookahead
以下内容在 C# 中不起作用
我实际上想尝试它在 PCRE 引擎上应该如何工作,因为可以选择递归正则表达式,我认为这更容易,因为我更熟悉它并且产生了更短的正则表达式:)
(?=([A-Z]{(?:[^{}]|(?1))+}))
正则表达式101演示
(?= # Opening positive lookahead
([A-Z] # Opening capture group and any uppercase letter (to match R & D)
{ # Opening brace
(?: # Opening non-capture group
[^{}] # Matches non braces
| # OR
(?1) # Recurse first capture group
)+ # Close non-capture group and repeat as many times as necessary
} # Closing brace
) # Close of capture group
) # Close of positive lookahead
我不确定单个正则表达式是否能够满足您的需求:这些嵌套的子字符串总是把它搞砸。
一种解决方案可能是以下算法(用 Java 编写,但我想转换为 C# 不会那么难(:
/**
* Finds all matches (i.e. including sub/nested matches) of the regex in the input string.
*
* @param input
* The input string.
* @param regex
* The regex pattern. It has to target the most nested substrings. For example, given the following input string
* <code>A{01B{23}45C{67}89}</code>, if you want to catch every <code>X{*}</code> substrings (where <code>X</code> is a capital letter),
* you have to use <code>[A-Z][{][^{]+?[}]</code> or <code>[A-Z][{][^{}]+[}]</code> instead of <code>[A-Z][{].+?[}]</code>.
* @param format
* The format must follow the <a href= "http://docs.oracle.com/javase/7/docs/api/java/util/Formatter.html#syntax" >format string
* syntax</a>. It will be given one single integer as argument, so it has to contain (and to contain only) a <code>%d</code> flag. The
* format must not be foundable anywhere in the input string. If <code>null</code>, <code>ééé%dèèè</code> will be used.
* @return The list of all the matches of the regex in the input string.
*/
public static List<String> findAllMatches(String input, String regex, String format) {
if (format == null) {
format = "ééé%dèèè";
}
int counter = 0;
Map<String, String> matches = new LinkedHashMap<String, String>();
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
// if a substring has been found
while (matcher.find()) {
// create a unique replacement string using the counter
String replace = String.format(format, counter++);
// store the relation "replacement string --> initial substring" in a queue
matches.put(replace, matcher.group());
String end = input.substring(matcher.end(), input.length());
String start = input.substring(0, matcher.start());
// replace the found substring by the created unique replacement string
input = start + replace + end;
// reiterate on the new input string (faking the original matcher.find() implementation)
matcher = pattern.matcher(input);
}
List<Entry<String, String>> entries = new LinkedList<Entry<String, String>>(matches.entrySet());
// for each relation "replacement string --> initial substring" of the queue
for (int i = 0; i < entries.size(); i++) {
Entry<String, String> current = entries.get(i);
// for each relation that could have been found before the current one (i.e. more nested)
for (int j = 0; j < i; j++) {
Entry<String, String> previous = entries.get(j);
// if the current initial substring contains the previous replacement string
if (current.getValue().contains(previous.getKey())) {
// replace the previous replacement string by the previous initial substring in the current initial substring
current.setValue(current.getValue().replace(previous.getKey(), previous.getValue()));
}
}
}
return new LinkedList<String>(matches.values());
}
因此,在您的情况下:
String input = "tes{}tR{R{abc}aD{mnoR{xyz}}}";
String regex = "[A-Z][{][^{}]+[}]";
findAllMatches(input, regex, null);
返回:
R{abc}
R{xyz}
D{mnoR{xyz}}
R{R{abc}aD{mnoR{xyz}}}
表达式中平衡组使您可以准确控制要捕获的内容,并且 .Net 正则表达式引擎保留组的所有捕获的完整历史记录(与大多数仅捕获每个组最后一次出现的其他风格不同(。
MSDN 示例有点太复杂了。匹配嵌套结构的更简单方法是:
(?>
(?<O>)'p{Lu}'{ # Push to the O stack, and match an upper-case letter and {
| # OR
'}(?<-O>) # Match } and pop from the stack
| # OR
'p{Ll} # Match a lower-case letter
)+
(?(O)(?!)) # Make sure the stack is empty
或在一行中:
(?>(?<O>)'p{Lu}'{|'}(?<-O>)|'p{Ll})+(?(O)(?!))
正则表达式风暴的工作示例
在您的示例中,它还匹配字符串开头的"tes"
,但不要担心,我们还没有完成。
通过一个小的修正,我们还可以捕获R{
之间的发生......}
对:
(?>(?<O>)'p{Lu}'{|'}(?<Target-O>)|'p{Ll})+(?(O)(?!))
每个Match
都有一个称为 "Target"
的Group
,每个这样的Group
都会对每次出现都有一个Capture
- 你只关心这些捕获。
正则表达式风暴的工作示例 - 单击"表"选项卡并检查${Target}
的 4 个捕获
另请参阅:
- 什么是正则表达式平衡组?