HtmlAgilityPack从html块中提取额外的文本

本文关键字:文本 提取 html HtmlAgilityPack | 更新日期: 2023-09-27 18:12:03

我正试图提取网页的某些部分,但我遇到了一些麻烦。我对网络解析非常陌生,所以请假设我什么都不知道,并保持答案非常详细。

我有这部分html

<div id="playerStats">
  <div id="hp"><span class="title">HP:</span>"12213"</div>
  <div id="mp"><span class="title">MP:</span></div>
  <div id="magicResist"><span class="title">Magic Resist</span>"4618"</div>
  <div id="physicalDefend"><span class="title">Physical Defence</span>"1725"</div>
  <div id="phyCriticalReduceRate"><span class="title">Strike Resist</span>"1518"</div>
  <div id="phyCriticalDamageReduce"><span class="title">Strike fortitude</span>"392"</div>
  <div id="physicalRight"><span class="title">Main Hand Attack</span>"201"</div>
  <div id="accuracyRight"><span class="title">Main Hand Accuracy</span>"201"</div>
  <div id="criticalRight"><span class="title">Main Hand Critical</span>"201"</div>
  <div id="physicalLeft"><span class="title">Off Hand Attack</span>"201"</div>
  <div id="accuracyLeft"><span class="title">Off Hand Accuracy</span>"201"</div>
  <div id="criticalLeft"><span class="title">Off Hand Critical</span>"201"</div>
  <div id="attackSpeed"><span class="title">Attack Speed</span>"201"</div>
  <div id="magicalBoost"><span class="title">Magic Boost</span>"201"</div>
  <div id="magicalAccuracy"><span class="title">Magic Accuracy</span>"201"</div>
  <div id="magicalCriticalRight"><span class="title">Crit Spell</span>"201"</div>
  <div id="castingTimeRatio"><span class="title">Casting Speed</span>"201"</div>
  <div id="block"><span class="title">Block</span>"201"</div>
  <div id="dodge"><span class="title">Evasion</span>"201"</div>
</div>

输出

HP:
MP:
Magic Resist
Physical Defence
Strike Resist
Strike fortitude
Main Hand Attack
Main Hand Accuracy
Main Hand Critical
Off Hand Attack
Off Hand Accuracy
Off Hand Critical
Attack Speed
Magic Boost
Magic Accuracy
Crit Spell
Casting Speed
Block
Evasion
Movement Speed

使用代码

var browser = document.DocumentNode.SelectNodes("//*[@id='"playerStats'"]");
if (browser != null) {
  foreach(var b in browser)
  output.AppendLine(b.InnerHtml);
} else {
  output.AppendLine(("Oops!  I'm broken!"));
}

但是,我还想包括数字"12213"或

之间的任何文本。
</span>"xxx"</div> 

后面写"HP:"

我怎样才能检索这个文本以及使用我已经实现的代码?

HtmlAgilityPack从html块中提取额外的文本

你可以这样做(在控制台应用程序示例中):

HtmlDocument doc = new HtmlDocument();
doc.Load(MyTestFile);
foreach(var node in doc.DocumentNode.SelectNodes("//div[@id='playerStats']/div/span"))
{
    Console.WriteLine(node.InnerText + " " + (node.NextSibling != null ? node.NextSibling.InnerText : null));
}

NextSibling是具有相同父节点的给定节点之后的下一个节点。如果当前节点是父节点的最后一个子节点,则该节点可能不存在。

注意,对于初始选择,我已经显式地将元素类型设置为DIV,因为它在性能方面更好。(*匹配任何节点).