htmlAggilityPack 跨节点搜索文本字符串

本文关键字:文本 字符串 搜索 节点 htmlAggilityPack | 更新日期: 2023-09-27 18:35:52

我希望能够搜索从URL中抓取的html文档,并验证URL是否包含特定文本。文本和 URL 均由用户提供,可能会有所不同。我用 httpWeb 请求抓取网址

string quote = txtQuote.Text;
string sourceURL = txtURL.Text;
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(sourceURL);
    HttpWebResponse response = (HttpWebResponse)request.GetResponse();
    if (response.StatusCode == HttpStatusCode.OK)
    {
        Stream receiveStream = response.GetResponseStream();
        StreamReader readStream = null;
        if (response.CharacterSet == null)
        {
            readStream = new StreamReader(receiveStream);
        }
        else
        {
readStream = new StreamReader(receiveStream,     
Encoding.GetEncoding(response.CharacterSet));
        }
        string data = readStream.ReadToEnd();

        response.Close();
        readStream.Close();

我还有一个 html 实体列表和数据库中各种可能的编码,我检索并传递给 DataTable,以便我可以将任何编码更改为标准 html 实体,并将不间断空格替换为标准空格

DataTable encodings = new DataTable();
        string getEncodings = "select * from htmlentities";
        SqlCommand cmdGetEncodings = new SqlCommand(getEncodings, dbcon);
        encodings.Load(cmdGetEncodings.ExecuteReader());
        dbcon.Close();
        foreach (DataRow row in encodings.Rows)
        {
            string htmlentity = row[1].ToString();
            string deccode = row[2].ToString();
            string hexcode = row[3].ToString();
            data = data.Replace(deccode, htmlentity);
            data = data.Replace(hexcode, htmlentity);
      data = data.Replace(“ ”, “ “);
        }

然后,我使用 htmlAgilityPack 将抓取和修改的 html 传递给新文档,并检索内部文本HtmlDocument doc = new HtmlDocument(); .doc。LoadHtml(data);

        HtmlNode root = doc.DocumentNode;
        string innerText = root.InnerText;

现在我想知道,准确验证引用是否包含在 innerText 中的最佳方法是什么?我尝试的一种方法是: if (innerText.IndexOf(quote) != -1) { 标签1.文本 = "找到"; } 否则 { 标签1.文本 ="未找到"; }

但这并不准确,它找不到跨越节点的 innerText(例如在多个<p>上)。未找到返回的示例引用和 URL:

"他年轻时敏捷的掩护点已经沦为站在原地,只阻止那些直接向他靠近的球,"查理·康诺利(Charlie Connolly)在他关于格蕾丝生活的精美小说《吉尔伯特》(Gilbert)中这样说。"在澳大利亚人的第一局中,每当球从他身边飞过时,他都非常清楚人群的叫声。在英格兰队因为兰吉辛吉的93分而战平的比赛结束时,格蕾丝告诉杰克逊:"一切都结束了,贾克,我不会再打了。
然后是唐·布拉德曼。这个故事如此有名,几乎不需要复述。"我非常想做得好,"布拉德曼承认。他被埃里克·霍利斯(Eric Hollies)投了第二个球,"一个完美的长度粘稠",刚好碰到球棒的内缘,然后敲响了保释金。如果他只进了四分,他的平均分甚至会是一百分。

网址: http://www.theguardian.com/sport/2016/feb/23/test-cricket-farewells-brendon-mccullum

但是,如果我只搜索第一段:

"他年轻时敏捷的掩护点已经沦为站在原地,只阻止那些直接向他靠近的球,"查理·康诺利(Charlie Connolly)在他关于格蕾丝生活的精美小说《吉尔伯特》(Gilbert)中这样说。"在澳大利亚人的第一局中,每当球从他身边飞过时,他都非常清楚人群的叫声。在英格兰队因为兰吉辛吉的93分而战平的比赛结束时,格蕾丝告诉杰克逊:"一切都结束了,贾克,我不会再打了。

它会返回找到。有没有办法实现跨节点检查文本?

htmlAggilityPack 跨节点搜索文本字符串

所以,如果你只打算刮 http://www.theguardian.com
这是一个简单的解决方案,因为《卫报》的html代码非常整洁。

var hdoc = new HtmlDocument();
hdoc.LoadHtml(data); // or hdoc.Load(data) - depending on what you get from your request
var articleNodes = hdoc.DocumentNode.SelectNodes(@"//p"); // the 'p' nodes contains the article text
var quote = "my quote";
var article = string.Empty;
foreach (HtmlNode node in articleNodes)
{
   article += node.InnerText + " "; // added a whitespace so we dont mess up the text.
}
if (article.Contains(quote))
{
   return true;
}
else
{
   return false;
}

现在,如果您打算为任何给定的URL制作此内容,那么前面就有麻烦了。
由于您不知道这些URL的html格式是"最好的" - 最好的我的意思是,最简单和最令人畏缩的解决方案如下:

var hdoc = new HtmlDocument();
hdoc.LoadHtml(data); // or hdoc.Load(data) - depending on what you get from your request
var articleNodes = hdoc.DocumentNode;
var quote = "my quote";
var text = string.Empty;
foreach (var node in articleNodes.InnerText)
{
    text += node + " "; // added a whitespace so we dont mess up the text.
    foreach (var htmlNode in articleNodes.ChildNodes)
    {
        text += htmlNode.InnerText + " ";
        foreach (var childNode in htmlNode.ChildNodes)
        {
            text += childNode.InnerText + " ";
            foreach (var childrensChildren in childNode.ChildNodes)
            {
                text += childrensChildren.InnerText + " ";
            }
        }
    }
}
if (text.Contains(quote))
{
    return true;
}
else
{
    return false;
}

最终,由于不知道给定的URL的html代码,嵌套foreach语句可能会增加或减少。当然,在运行任何 foreach 语句之前,必须在节点上进行一些空检查。
可能有更好的解决方案,这是我的2美分。

工作示例:这返回 true,我将文章的一部分复制+粘贴到 quote 变量中,并检查我们的文章字符串是否包含它。

string urlAddress = "http://www.theguardian.com/sport/2016/feb/23/test-cricket-farewells-brendon-mccullum";
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        string data = string.Empty;
        if (response.StatusCode == HttpStatusCode.OK)
        {
            Stream receiveStream = response.GetResponseStream();
            StreamReader readStream = null;
            if (response.CharacterSet == null)
            {
                readStream = new StreamReader(receiveStream);
            }
            else
            {
                readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
            }
            data = readStream.ReadToEnd();
            response.Close();
            readStream.Close();
        }
        var hdoc = new HtmlDocument();
        hdoc.LoadHtml(data); 
        var articleNodes = hdoc.DocumentNode.SelectNodes(@"//p"); // the 'p' nodes contains the article text
        var quote ="Sinatra couldn’t stand the song. His daughter Tina once said that her father thought it was “self-serving and self-indulgent”. By the end of the ’70s he was in the habit of introducing it by explaining how little he liked it. “I hate this song. I hate this song!” he said before performing it at Atlantic City in 1979. “I got it up to here, this goddamn song.” Of course when Sinatra died, pretty much every single TV and radio news show played him out with My Way, “the most obvious, ";
        var article = string.Empty;
        foreach (HtmlNode node in articleNodes)
        {
            article += node.InnerText + " "; // added a whitespace so we dont mess up the text.
        }
        bool containsQuote = false || article.Contains(quote); // bool is true if the quote is in the article.