如何编写regex来将HTML字符串分解为多个部分

本文关键字：个部分解字符串何编写 regex 来将 HTML | 更新日期: 2023-09-27 17:55:06

我有这个字符串：

This is sample <p id="short"> the value of short </p> <p id="medium"> the value of medium </p> <p id="large"> the value of large</p>

我想把它分成三块：

p个标记前的字符串：this is sample
短：the value of short
介质：the value of medium
大型：the value of large

如何编写regex来将HTML字符串分解为多个部分

如果您不介意使用非正则表达式解决方案(因为HTML不是常规语言(，您可以使用此

string input = @"This is sample <p id=""short""> the value of short </p> <p id=""medium""> the value of medium </p> <p id=""large""> the value of large</p>";

string before = input.Substring(0, input.IndexOf("<"));
string xmlWrapper = "<html>" + input.Substring(input.IndexOf("<")) + "</html>";
XElement xElement = XElement.Parse(xmlWrapper);
var shortElement =
    xElement.Elements().Where(p => p.Name == "p" && p.Attribute("id").Value == "short").SingleOrDefault();
var shortValue = shortElement != null ? shortElement.Value : string.Empty;
var mediumElement =
    xElement.Elements().Where(p => p.Name == "p" && p.Attribute("id").Value == "medium").SingleOrDefault();
var mediumValue = shortElement != null ? shortElement.Value : string.Empty;
var largelement =
    xElement.Elements().Where(p => p.Name == "p" && p.Attribute("id").Value == "large").SingleOrDefault();
var largeValue = shortElement != null ? shortElement.Value : string.Empty;

这是我的尝试：

var regex = new Regex("(?<text>.*?)<p.*?>(?<small>.*?)</p>.*<p.*?>(?<medium>.*?)</p>.*.*<p.*?>(?<large>.*?)</p>.*");
var htmlsnip = @"This is sample <p id=""short""> the value of short </p> <p id=""medium""> the value of medium </p> <p id=""large""> the value of large</p>";
var match = regex.Match(htmlsnip);
var text = match.Groups["text"].Value;
var small = match.Groups["small"].Value;
var medium = match.Groups["medium"].Value;
var large = match.Groups["large"].Value;

(?<string_before_p_tags>[^<]*)<p id="short">(?<short>.*)</p>'s*<p id="medium">(?<medium>.*)</p>'s*<p id="large">(?<large>.*)</p>

返回命名的捕获组：

string_before_p_tags：这是样本
short：short的值
medium：介质的值
large：大的值

在Bala R的答案的基础上，这里有一种更简洁的XPath方法：

string input = @"This is sample <p id=""short""> the value of short </p> <p id=""medium""> the value of medium </p> <p id=""large""> the value of large</p>";
var xmlWrapper = "<html>" + input + "</html>";
var elements = XElement.Parse(xmlWrapper).XPathSelectElements("/*").ToList();
var text = elements[0].PreviousNode.ToString();
var small = elements[0].Value;
var medium = elements[1].Value;
var large = elements[2].Value;

首先，这里有很多次说过，你不应该使用regex来解析html，原因有几个(主要是html不是一种常规语言(，你应该使用html解析器。

但是，如果由于任何限制，您都不能使用HTML解析器，则可以执行以下操作：

1. string before p tags - 'w[^<]
2. short - <p id="short"> ['w|'s]* [^<]
3. medium - <p id="medium"> ['w|'s]* [^<]
4. large - <p id="large"> ['w|'s]* [^<]

干杯。

使用HtmlAgilityPack非常简单：

 string html = "This is sample <p id='"short'"> the value of short </p> <p id='"medium'"> the value of medium </p> <p id='"large'"> the value of large</p>";
            string id = null;
            NameValueCollection output = new NameValueCollection();
            string[] pIds = new string[3] { "short", "medium", "large" };
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(html);
            int c = 0;
            int len = pIds.Length;
            while (c < len)
            {
                id = pIds[c];
                output.Add(id, doc.GetElementbyId(id).InnerHtml);
                c++;
            }
       //In key of output variable, is equivalent to value of paragraph. example:
        Console.WriteLine(output["short"].ToString());

输出：the value of short