如何在字符串第一次出现时停止,使用多次

本文关键字:字符串 第一次 | 更新日期: 2023-09-27 17:49:41

我目前正在编写一个脚本来解析HTML文档中的内容。

下面是我正在解析的代码示例:

<div class="tab-content">
<div class="tab-pane fade in active" id="how-to-take">
<div class="panel-body">
<h3>What is Pantoprazole?</h3>
Pantoprazole is a generic drug used to treat certain conditions where there is too much acid in the stomach. It is
used to treat gastric and duodenal ulcers, erosive esophagitis, and gastroesophageal reflux disease (GERD). GERD is
a condition where the acid in the stomach washes back up into the esophagus. <br/> Pantoprazole is a proton pump
inhibitor (PPI). It works by decreasing the amount of acid produced by the stomach.
<h3>How To Take</h3>
Take the tablets 1 hour before a meal without chewing or breaking them and swallow them whole with some water
</div>
</div>
<div class="tab-pane fade" id="alternative-treatments">
<div class="panel-body">
<h3>Alternatives</h3>
Antacids taken as required Antacids are alkali liquids or tablets
that can neutralise the stomach acid. A dose may give quick relief.
There are many brands which you can buy. You can also get some on
prescription. If you have mild or infrequent bouts of dyspepsia you
may find that antacids used as required are all that you need.<br/>
</div>
</div>
<div class="tab-pane fade" id="side-effects">
<div class="panel-body">
<p>Most people who take acid reflux medication do not have any side-effects.
However, side-effects occur in a small number of users. The most
common side-effects are:</p>
<ul>

我正在尝试解析以下内容:

<div class="tab-pane fade in active" id="how-to-take">
<div class="panel-body">

</div>

我写了下面的正则表达式代码:

<div class="tab-pane fade in active" id="how-to-take">'n<div class="panel-body">'n(.*?['s'S]+)'n(?:<'/div>)

<div class="tab-pane fade in active" id="how-to-take">'n<div class="panel-body">'n(.*?['s'S]+)'n<'/div>

但是它似乎并没有在第一个<'/div>停止,它一直持续到代码中的最后一个<div>

如何在字符串第一次出现时停止,使用多次

不要使用正则表达式来解析HTML。你可以用HtmlAgilityPack

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(File.ReadAllText("Path"));
var divPanelBody = doc.DocumentNode.SelectSingleNode("//div[@class='panel-body']");
string text = divPanelBody.InnerText.Trim();  // null check omitted
结果:

什么是泮托拉唑?泮托拉唑是一种用于治疗胃酸过多的情况。它是用于治疗胃溃疡、十二指肠溃疡、糜烂性食管炎等胃食管反流病(GERD)。胃食管反流是一种情况胃里的酸被冲回食道。Pantoprazole是质子泵抑制剂(PPI)。它是通过减少胃酸胃产生的酸如何服用?服药1小时饭前不要咀嚼或打碎,整个吞下用一些水

下面是我更喜欢的另一种LINQ方法:

var divPanelBody = doc.DocumentNode.Descendants("div")
    .FirstOrDefault(d => d.GetAttributeValue("class", "") == "panel-body");

注意,这两种方法都是区分大小写的,所以它们不会找到Panel-Body。您可以使最后一种方法不区分大小写:

var divPanelBody = doc.DocumentNode.Descendants("div")
    .FirstOrDefault(d => d.GetAttributeValue("class", "").Equals("panel-body", StringComparison.InvariantCultureIgnoreCase));

您可以使用htmllagilitypack

轻松地做到这一点
public string GetInnerHtml(string html)
{
      HtmlDocument doc = new HtmlDocument();
      doc.LoadHtml(html);
      var nodes = doc.DocumentNode.SelectNodes("//div[@class='"panel-body'"]");
      StringBuilder sb = new StringBuilder();
      foreach (var n in nodes)
      {
            sb.Append(n.InnerHtml);
      }
      return sb.ToString();
}