跳过<;!DOCTYPE html>;使用htmlAgilityPack

本文关键字:使用 gt htmlAgilityPack html lt 跳过 DOCTYPE | 更新日期: 2023-09-27 18:24:44

我正在为WP 8.1做一个应用程序,我必须解析这样的页面:

<!DOCTYPE html>
<html>
<body>
    <table cellspacing="0" cellpadding="0" border="0" style="border-style:none; padding:0; margin:0;" id="ctl00_ContentPlaceHolder1_ListView1_groupPlaceholderContainer">               
         <tbody>
             <tr style="border-style:none;padding:0; margin:0; background-image:none; vertical-align:top;" id="ctl00_ContentPlaceHolder1_ListView1_ctrl0_itemPlaceholderContainer">         
                 <td style="border-style:none;padding:0; margin:0; width:22%;" id="ctl00_ContentPlaceHolder1_ListView1_ctrl0_ctl01_Td3">
                    <div class="photo">
                        <a target="_self" title="PH1" href="fumetto.aspx?Fumetto=279277">PH1_1</a>
                    </div>
                </td>
            </tr>
            <tr style="border-style:none;padding:0; margin:0; background-image:none; vertical-align:top;" id="ctl00_ContentPlaceHolder1_ListView1_ctrl0_itemPlaceholderContainer">          
                 <td style="border-style:none;padding:0; margin:0; width:22%;" id="ctl00_ContentPlaceHolder1_ListView1_ctrl0_ctl01_Td3">
                    <div class="photo">
                        <a target="_self" title="PH2" href="fumetto.aspx?Fumetto=279277">PH2_1</a>
                    </div>
                </td>
            </tr>
            <tr style="border-style:none;padding:0; margin:0; background-image:none; vertical-align:top;" id="ctl00_ContentPlaceHolder1_ListView1_ctrl0_itemPlaceholderContainer">          
                 <td style="border-style:none;padding:0; margin:0; width:22%;" id="ctl00_ContentPlaceHolder1_ListView1_ctrl0_ctl01_Td3">
                    <div class="photo">
                        <a target="_self" title="PH3" href="fumetto.aspx?Fumetto=279277">PH3_1</a>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
</body>  
</html>

当我使用此代码时,我总是得到htmlDoc.DocumentNode内部的第一个节点(doctype节点),而丢失html节点。是否有跳过doctype节点的方法?

string filePath = "...";
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(filePath);

跳过<;!DOCTYPE html>;使用htmlAgilityPack

String html = "<!DOCTYPE html><html><body><table cellspacing='0' cellpadding='0' border='0' style='border-style:none; padding:0; margin:0;' id='ctl00_ContentPlaceHolder1_ListView1_groupPlaceholderContainer'><tbody><tr style='border-style:none;padding:0; margin:0; background-image:none; vertical-align:top;' id='ctl00_ContentPlaceHolder1_ListView1_ctrl0_itemPlaceholderContainer'>         <td style='border-style:none;padding:0; margin:0; width:22%;' id='ctl00_ContentPlaceHolder1_ListView1_ctrl0_ctl01_Td3'><div class='photo'><a target='_self' title='PH1' href='fumetto.aspx?Fumetto=279277'>PH1_1</a></div></td></tr><tr style='border-style:none;padding:0; margin:0; background-image:none; vertical-align:top;' id='ctl00_ContentPlaceHolder1_ListView1_ctrl0_itemPlaceholderContainer'><td style='border-style:none;padding:0; margin:0; width:22%;' id='ctl00_ContentPlaceHolder1_ListView1_ctrl0_ctl01_Td3'><div class='photo'><a target='_self' title='PH2' href='fumetto.aspx?Fumetto=279277'>PH2_1</a></div></td></tr><tr style='border-style:none;padding:0; margin:0; background-image:none; vertical-align:top;' id='ctl00_ContentPlaceHolder1_ListView1_ctrl0_itemPlaceholderContainer'><td style='border-style:none;padding:0; margin:0; width:22%;' id='ctl00_ContentPlaceHolder1_ListView1_ctrl0_ctl01_Td3'><div class='photo'><a target='_self' title='PH3' href='fumetto.aspx?Fumetto=279277'>PH3_1</a></div></td></tr></tbody></table></body></html>";
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);
        HtmlNode htmlnode = doc.DocumentNode.Element("html");
        System.Diagnostics.Debug.WriteLine(htmlnode.OuterHtml);

适用于我,并且只显示来自html标记的内容。