我如何提取链接，从字符串与html内容使用htmlagilitypack

本文关键字：html 字符串 htmlagilitypack 何提取提取链接 | 更新日期: 2023-09-27 18:15:53

for (int i = 0; i < numberoflinks; i++)
{
    string downloadString = client.DownloadString(mainlink+i+".html");
    var document = new HtmlWeb().Load(url);
    var urls = document.DocumentNode.Descendants("img")
                        .Select(e => e.GetAttributeValue("src", null))
                        .Where(s => !String.IsNullOrEmpty(s))
}

问题是HtmlWeb()。加载需要一个html url，但我想加载字符串downloadString已经里面的html内容。

更新:

我试过了:

for (int i = 0; i < numberoflinks; i++)
            {
                string downloadString = client.DownloadString(mainlink+i+".html");
                HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
                document.Load(downloadString);
                var urls = document.DocumentNode.Descendants("img")
                                                .Select(e => e.GetAttributeValue("src", null))
                                                .Where(s => !String.IsNullOrEmpty(s));
            }

但是我得到了异常:

document.Load(downloadString);

路径

中的非法字符

我想做的是从每个链接下载/提取所有。jpg图像。不需要先下载url到硬盘，而是将内容下载到字符串中，提取html中所有以。JPG结尾的图像链接，然后下载JPG格式的。

我如何提取链接，从字符串与html内容使用htmlagilitypack

您应该能够使用HtmlDocument的LoadHtml()方法处理HTML字符串。

从源代码:

public void LoadHtml(string html)

从指定的字符串加载HTML文档。

param name="html"
包含要加载的HTML文档的字符串。不能为空

Load方法需要一个文件名，这是illegal characters in path消息的原因。