使用htmllagilitypack解析格式错误的HTML

本文关键字:错误 HTML 格式 htmllagilitypack 使用 | 更新日期: 2023-09-27 18:15:01

我试图解析一个HTML页面,但源是畸形的:

<div class='"item column-1'">
  <h2><a href='"/index.php">Bridgestone recebe um Volkswagen Group Award</a></h2> 
  <dl class='"article-info'"><dt class='"article-info-term'">Detalhes</dt><dd class='"create'">04-08-2015</dd></dl>
<p style='"color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; line-height: normal;'">
  <img src='"images/Bridgestone-VWGroup.jpg'" width='"600'" height='"400'" alt='"Bridgestone-VWGroup'">
</p>
<p style='"color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; line-height: normal;'">
  A Bridgestone acaba de receber um Volkswagen Group Award como reconhecimento pelo trabalho desenvolvido enquanto fornecedor daquele grupo, um prémio atribuído a um lote restrito de fornecedores internacionais
<div class='"css_buttons1'" style='"min-height:40px;display: inline-block;width: 425px;'">
<div class='"css_fb_share'" style='"display:inline-block;'"><fb:share-button href='"http://anossaoficina.com/index.php" type='"button_count'"></fb:share-button></div></div>
<p class='"readmore'">
  <a href='"/index.php?">Continuar... Bridgestone recebe um Volkswagen Group Award</a></p>
<div class='"item-separator'"></div>
</div>
<span class='"row-separator'"></span>
</div>

我需要提取第二个p innexText"A普利司通…",但HtmlAgilityPack返回",因为这个标签有一个开始的<p>,但没有</p>关闭它:

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://anossaoficina.com/index.php?option=com_content&view=category&layout=blog&id=78&Itemid=474");
foreach (var matchingDiv in doc.DocumentNode.SelectNodes("//*[contains(@class,'item column-1')]"))
{
    var DescritionShort = matchingDiv.SelectSingleNode("./p[1]").InnerText;
}

使用htmllagilitypack解析格式错误的HTML

var web = new HtmlAgilityPack.HtmlWeb();
var doc = web.Load("http://anossaoficina.com/index.php?option=com_content&view=category&layout=blog&id=78&Itemid=474");
var DescritionShort = doc.DocumentNode
                      .SelectSingleNode("//div[@class='item column-1']//p[2]")
                      .NextSibling.InnerText;

返回

普利司通与大众汽车集团的合作与合作与合作

普利司通与大众汽车集团的合作与合作与合作