如何使用htmllagilitypack避免从HTML源中提取数据的重复

本文关键字:提取 数据 HTML htmllagilitypack 何使用 | 更新日期: 2023-09-27 18:13:35

我使用htmllagilitypack从HTML代码源中提取数据。这是一个HTML的例子:

<div class="enum-container">
    <div class="enum">
        <span class="field-key">MD5</span> a4188cf2b9189f82b855350233a307eb
    </div>
    <div class="enum">
        <span class="field-key">SHA1</span> c3eedd67a14810b8c639eb77ed2731e574245b2a
    </div>
    <div class="enum">
        <span class="field-key">File size</span>
        3.8 KB ( 3854 bytes )
    </div>
</div>

我使用以下代码:

    Dim Table2 As New DataTable()
    Table2.Columns.Add("Value1", GetType(String))
    Table2.Columns.Add("Value2", GetType(String))
    For Each row1 As HtmlNode In doc.DocumentNode.SelectNodes("//div[@id='file-details']//div[@class='enum-container']//div[@class='enum']")
        Dim MyValue1 As HtmlNode = row1.SelectSingleNode("//span[@class='field-key']")
        Dim MyValue2 As String = row1.InnerText
        Table2.Rows.Add(MyValue1.InnerText, MyValue2)
    Next
    DataGridView3.DataSource = Table2

结果如下:

https://i.stack.imgur.com/vPriY.png

可以看到,第一列得到一个重复的值(MD5)。


我想要的是这样的

https://i.stack.imgur.com/jlsk5.png

谢谢。

如何使用htmllagilitypack避免从HTML源中提取数据的重复

您正在选择文档中与'//' xpath匹配的第一个span。您需要将其从xpath中删除,以便它将选择直接继承项。

c#

DataTable fileDetailsTable = new DataTable();
fileDetailsTable.Columns.Add("Key", typeof(string));
fileDetailsTable.Columns.Add("Value", typeof(string));
HtmlNodeCollection enumNodes = document.DocumentNode.SelectNodes("//div[@id='file-details']//div[@class='enum-container']//div[@class='enum']");
foreach (HtmlNode enumNode in enumNodes)
{
    //Select the child span from the enum node.
    HtmlNode fieldKeyNode = enumNode.SelectSingleNode("span[@class='field-key']");
    if (fieldKeyNode != null)
    {
        //Grab the key.
        string fieldKey = fieldKeyNode.InnerText;
        //Grab the value which is the field key's sibling
        string fieldValue = fieldKeyNode.NextSibling.InnerText;
        fileDetailsTable.Rows.Add(fieldKey, fieldValue);
    }
}

VB。

Dim fileDetailsTable As New DataTable()
fileDetailsTable.Columns.Add("Key", GetType(String))
fileDetailsTable.Columns.Add("Value", GetType(String))
Dim enumNodes As HtmlNodeCollection = document.DocumentNode.SelectNodes("//div[@id='file-details']//div[@class='enum-container']//div[@class='enum']")
For Each enumNode As HtmlNode In enumNodes
    'Select the child span from the enum node.
    Dim fieldKeyNode As HtmlNode = enumNode.SelectSingleNode("span[@class='field-key']")
    If fieldKeyNode IsNot Nothing Then
        'Grab the key.
        Dim fieldKey As String = fieldKeyNode.InnerText
        'Grab the value which is the field key's sibling
        Dim fieldValue As String = fieldKeyNode.NextSibling.InnerText
        fileDetailsTable.Rows.Add(fieldKey, fieldValue)
    End If
Next