使用正则表达式在多个HTML标记之间获取文本

本文关键字：之间获取取文本 HTML 正则表达式 | 更新日期: 2023-09-27 18:09:02

使用regex，我希望能够获得多个div标记之间的文本。例如:

<div>first html tag</div>
<div>another tag</div>

将输出:

first html tag
another tag

我使用的正则表达式模式只匹配我的最后一个div标签，而错过了第一个。代码:

    static void Main(string[] args)
    {
        string input = "<div>This is a test</div><div class='"something'">This is ANOTHER test</div>";
        string pattern = "(<div.*>)(.*)(<''/div>)";
        MatchCollection matches = Regex.Matches(input, pattern);
        Console.WriteLine("Matches found: {0}", matches.Count);
        if (matches.Count > 0)
            foreach (Match m in matches)
                Console.WriteLine("Inner DIV: {0}", m.Groups[2]);
        Console.ReadLine();
    }

输出:

匹配:1

内部DIV:这是另一个测试

使用正则表达式在多个HTML标记之间获取文本

用非贪婪匹配替换你的模式

static void Main(string[] args)
{
    string input = "<div>This is a test</div><div class='"something'">This is ANOTHER test</div>";
    string pattern = "<div.*?>(.*?)<''/div>";
    MatchCollection matches = Regex.Matches(input, pattern);
    Console.WriteLine("Matches found: {0}", matches.Count);
    if (matches.Count > 0)
        foreach (Match m in matches)
            Console.WriteLine("Inner DIV: {0}", m.Groups[1]);
    Console.ReadLine();
}

由于其他人没有提到HTML tags with attributes，以下是我的解决方案:

// <TAG(.*?)>(.*?)</TAG>
// Example
var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>");
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!");
Console.Write(m.Groups[2].Value); // will print -> World

我认为这段代码应该可以工作:

string htmlSource = "<div>first html tag</div><div>another tag</div>";
string pattern = @"<div[^>]*?>(.*?)</div>";
MatchCollection matches = Regex.Matches(htmlSource, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
ArrayList l = new ArrayList();
foreach (Match match in matches)
 {
   l.Add(match.Groups[1].Value);
 }

首先请记住，在HTML文件中，您将有一个新的行符号("'n")，您没有将其包含在用于检查正则表达式的字符串中。

第二步，取regex:

((<div.*>)(.*)(<''/div>))+ //This Regex will look for any amount of div tags, but it must see at least one div tag.
((<div.*>)(.*)(<''/div>))* //This regex will look for any amount of div tags, and it will not complain if there are no results at all.

也是查找这类信息的好地方:

http://www.regular-expressions.info/reference.html

http://www.regular-expressions.info/refadv.html

Mayman

简而言之，您不可能在所有情况下都正确地执行此操作。总会出现正则表达式无法提取所需信息的有效HTML的情况。

原因是因为HTML是一种上下文无关的语法，它是一种比正则表达式更复杂的类。

这里有一个例子——如果你有多个堆叠的div呢?

<div><div>stuff</div><div>stuff2</div></div>

作为其他答案列出的正则表达式将抓取:

<div><div>stuff</div>
<div>stuff</div>
<div>stuff</div><div>stuff2</div>
<div>stuff</div><div>stuff2</div></div>
<div>stuff2</div>
<div>stuff2</div></div>

因为这是正则表达式在解析HTML时所做的。

你不能编写一个理解如何解释所有情况的正则表达式，因为正则表达式无法做到这一点。如果您正在处理一组非常特定的受约束的HTML，那么这是可能的，但是您应该记住这个事实。

更多信息:https://stackoverflow.com/a/1732454/2022565

你看过Html敏捷包(见https://stackoverflow.com/a/857926/618649)吗?

CsQuery看起来也很有用(基本上使用CSS选择器样式的语法来获取元素)。参见https://stackoverflow.com/a/11090816/618649。

CsQuery基本上意味着"c#的jQuery"，这几乎是我用来找到它的确切搜索条件。

如果你可以在web浏览器中做到这一点，你可以很容易地使用jQuery，使用类似于$("div").each(function(idx){ alert( idx + ": " + $(this).text()); }的语法(只有你会明显地输出结果到日志，或屏幕，或使用它进行web服务调用，或任何你需要用它做的)。

我希望下面的正则表达式能起作用:

<div.*?>(.*?)<*.div>

你将得到你想要的输出

这是一个测试这是另一个测试