如何编写与除第一次出现之外的所有匹配项的正则表达式

本文关键字:正则表达式 何编写 第一次 | 更新日期: 2024-06-14 07:37:21

我正在尝试编写一个正则表达式,该表达式将匹配除html文件中第一个图像标签之外的所有图像标签。例如:

<html><body><img src="foo"><span><img src="bar></span><img src="foobar"></body></html>

到目前为止,我只设法创建了一个与所有图像标签匹配的表达式:

<img[^>]*>

如何编写与除第一次出现之外的所有匹配项的正则表达式

只需使用像 HtmlAgilityPack 这样的真正的 html 解析器来解析 html

var html = @"html><body><img src=""foo""><span><img src=""bar""></span><img src=""foobar""></body></html>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var imgLinks = doc.DocumentNode
                    .Descendants("img")
                    .Skip(1)
                    .Select(x => x.Attributes["src"])
                    .ToList();

别这样

var pattern = @"<img[^>]*>"; //your pattern in question
var imgs = Regex.Matches(html, pattern)
                .Cast<Match>()
                .Skip(1)
                .Select(m => m.Value)
                .ToList();

在这个答案中,我将演示标签可以从正则表达式中匹配,这与某些注释中认为标签无法识别但使用完整的 HTML/XML 解析器相反。

为了演示,我将使用XML 1.1的 www.www.org 规范中XML语法规则的子集,扩展到可以从STag和EmptyElemTag访问的所有规则,这是我们想要匹配的标签。 由于没有向后递归规则,我将演示这组规则可以转换为正则表达式以分别解析开始和空标记。

由于 xml 使用 UTF 字符编码,

并且它允许字符超过范围 ''u0000-''uffff,因此我必须为扩展 UTF 编码中的字符类选择一些表示法,因此我将使用非标准扩展 ''u 表示法,包括使用五个十六进制数字而不是四个,以简化此语法到正则表达式的转换(以允许在 0x10000-0xeffff 范围内允许的字符(

从 XML 版本 1.1 的 xml 规范中借用的是开始和空元素标记的语法:

STag ::= '<' Name (S Attribute)* S? '>'
EmptyElemTag ::= '<' Name (S Attribute)* S? '/>'
Name ::= (NameStartChar NameChar*)
NameChar ::= (NameStartChar | [-.0-9'u000b7'u00300-'u0036f'u0203f-'u02040])
NameStartChar ::= ([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff])
S ::= (['u00020'u00009'u0000d'u0000a]+)
Attribute ::= (Name Eq AttValue)
Eq ::= (S? '=' S?)
AttValue ::= ( '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" )
Reference ::= (EntityRef | CharRef)
EntityRef ::= ('&' Name ';')
CharRef ::= ('&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')
为了构造接受开始标签和

空标签的正则表达式,我从上面的语法开始,并从中构造一个简单的开始规则,该规则接受开始和空标签:

Start ::= STag | EmptyElemTag

然后用每个规则的右侧(正确括号(替换所有非终端,直到我只有右侧的终端元素和正则表达式运算符:

Start ::= '<' Name (S Attribute)* S? '>' | '<' Name (S Attribute)* S? '/>'

我可以做一些操作来对术语进行分组并获得

Start ::= '<' Name (S Attribute)* S? '/'?'>'

现在替换Attribute

Start ::= '<' Name (S Name Eq AttValue)* S? '/'? '>'

现在替换AttValue

Start ::= '<' Name (S Name Eq ('"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" ))* S? '/'? '>'

现在替换Reference

Start ::= '<' Name (S Name Eq ('"' ([^<&"] | EntityRef | CharRef)* '"' | "'" ([^<&'] | EntityRef | CharRef)* "'" ))* S? '/'? '>'

现在替换EntityRef

Start ::= '<' Name (S Name Eq ('"' ([^<&"] | '&' Name ';' | CharRef)* '"' | "'" ([^<&'] | '&' Name ';' | CharRef)* "'" ))* S? '/'? '>'

现在替换CharRef

Start ::= '<' Name (S Name Eq ('"' ([^<&"] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* S? '/'? '>'

现在Eq

Start ::= '<' Name (S Name S? '=' S? ('"' ([^<&"] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* S? '/'? '>'

下一S

Start ::= '<' Name ((['u00020'u00009'u0000d'u0000a]+) Name (['u00020'u00009'u0000d'u0000a]+)? '=' (['u00020'u00009'u0000d'u0000a]+)? ('"' ([^<&"] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* (['u00020'u00009'u0000d'u0000a]+)? '/'? '>'

现在替换Name

Start ::= '<' (NameStartChar NameChar*) ((['u00020'u00009'u0000d'u0000a]+) (NameStartChar NameChar*) (['u00020'u00009'u0000d'u0000a]+)? '=' (['u00020'u00009'u0000d'u0000a]+)? ('"' ([^<&"] | '&' (NameStartChar NameChar*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' (NameStartChar NameChar*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* (['u00020'u00009'u0000d'u0000a]+)? '/'? '>'

现在替换NameChar

Start ::= '<' (NameStartChar (NameStartChar | [-.0-9'u000b7'u00300-'u0036f'u0203f'u0203f'u02040])*) ((['u00020'u00009'u0000d'u0000a]+) (NameStartChar (NameStartChar | [-.0-9'u000b7'u00300-'u0036f'u0203f'u0203f'u02040])*) (['u00020'u00009'u0000d'u0000a]+)? '=' (['u00020'u00009'u0000d'u0000a]+)? ('"' ([^<&"] | '&' (NameStartChar (NameStartChar | [-.0-9'u000b7'u00300-'u0036f'u0203f'u0203f'u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' (NameStartChar (NameStartChar | [-.0-9'u000b7'u00300-'u0036f'u0203f'u0203f'u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* (['u00020'u00009'u0000d'u0000a]+)? '/'? '>'

最后NameStartChar

Start ::= '<' (([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff]) (([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff]) | [-.0-9'u000b7'u00300-'u0036f'u0203f'u0203f'u02040])*) ((['u00020'u00009'u0000d'u0000a]+) (([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff]) (([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff]) | [-.0-9'u000b7'u00300-'u0036f'u0203f'u0203f'u02040])*) (['u00020'u00009'u0000d'u0000a]+)? '=' (['u00020'u00009'u0000d'u0000a]+)? ('"' ([^<&"] | '&' (([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff]) (([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff]) | [-.0-9'u000b7'u00300-'u0036f'u0203f'u0203f'u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' (([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff]) (([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff]) | [-.0-9'u000b7'u00300-'u0036f'u0203f'u0203f'u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* (['u00020'u00009'u0000d'u0000a]+)? '/'? '>'

最后,在用c替换'c'并消除不需要的空格后,正则表达式会导致:

<(([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff])(([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff])|[-.0-9'u000b7'u00300-'u0036f'u0203f'u0203f'u02040])*)((['u00020'u00009'u0000d'u0000a]+)(([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff])(([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff])|[-.0-9'u000b7'u00300-'u0036f'u0203f'u0203f'u02040])*)(['u00020'u00009'u0000d'u0000a]+)?=(['u00020'u00009'u0000d'u0000a]+)?('"([^<&'"]|&(([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff])(([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff])|[-.0-9'u000b7'u00300-'u0036f'u0203f'u0203f'u02040])*);|&#[0-9]+;|&#x[0-9a-fA-F]+;)*'"|''([^<&'']|&(([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff])(([:A-Za-z_'u000c0-'u000d6'u000d8-'u000f6'u000f8-'u002ff'u00370-'u0037d'u0037f-'u01fff'u0200c-'u0200d'u02070-'u0218f'u02c00-'u02fef'u03001-'u0d7ff'u0f900-'u0fdcf'u0fdf0-'u0fffd'u10000-'ueffff])|[-.0-9'u000b7'u00300-'u0036f'u0203f'u0203f'u02040])*);|&#[0-9]+;|&#x[0-9a-fA-F]+;)*''))*(['u00020'u00009'u0000d'u0000a]+)?/?>

当然,您可以拥有更多允许您匹配开始/空标签的正则表达式,但这是我能够开发的最简单的方法之一,以应对评论中指出的场景。

更简单的可能是:

<[iI][mM][gG][ 't'n'r]+([^>"']|"[^"]*"|'[^']*')*>

如果您没有处理范围之外的 UTF 字符 ''u0000--''u007f(ASCII 范围(,并且您知道 HTML 文件有效。(最后一个可能是错误的,请小心使用我已经在脑海中构建了它,并且可能会错误地处理一些奇怪的案例(