在下载的txt文件中提取url链接

本文关键字：提取 url 链接文件下载 txt | 更新日期: 2023-09-27 18:28:33

当前正在使用url提取器进行工作。我正在尝试从下载的html文件中提取所有http链接/href链接，并在单独的txt文件中打印这些链接。到目前为止，我已经成功地下载了一个页面的整个html，只是从中提取链接并使用Regex打印它们是一个问题。想知道是否有人能帮我？

     private void button2_Click(object sender, EventArgs e)
    {
        Uri fileURI = new Uri(URLbox2.Text);
        WebRequest request = WebRequest.Create(fileURI);
        request.Credentials = CredentialCache.DefaultCredentials;
        WebResponse response = request.GetResponse();
        Console.WriteLine(((HttpWebResponse)response).StatusDescription);
        Stream dataStream = response.GetResponseStream();
        StreamReader reader = new StreamReader(dataStream);
        string responseFromServer = reader.ReadToEnd();
        SW = File.CreateText("C:''Users''Conal_Curran''OneDrive''C#''MyProjects''Web Crawler''URLTester''response1.htm");
        SW.WriteLine(responseFromServer);
        SW.Close();
        string text = System.IO.File.ReadAllText(@"C:''Users''Conal_Curran''OneDrive''C#''MyProjects''Web Crawler''URLTester''response1.htm");
        string[] links = System.IO.File.ReadAllLines(@"C:''Users''Conal_Curran''OneDrive''C#''MyProjects''Web Crawler''URLTester''response1.htm");

        Regex regx = new Regex(links, @"http://([''w+?''.''w+])+([a-zA-Z0-9''~''!''@''#''$''%''^''&amp;''*''('')_''-''=''+''''''/''?''.'':'';''''',]*)?", RegexOptions.IgnoreCase);
        MatchCollection mactches = regx.Matches(text);
        foreach (Match match in mactches)
        {
            text = text.Replace(match.Value, "<a href='" + match.Value + "'>" + match.Value + "</a>");
        }
        SW = File.CreateText("C:''Users''Conal_Curran''OneDrive''C#''MyProjects''Web Crawler''URLTester''Links.htm");
        SW.WriteLine(links);
    }

在下载的txt文件中提取url链接

如果您不知道，可以使用一个可用的html解析器nuget包（非常容易）实现这一点。

我个人使用HtmlAgilityPack（以及ScrapySharp，另一个包）和AngleSharp

只有上面的3行，您就可以使用HtmlAgilityPack:通过http get请求加载文档中的所有href

/* do not forget to include the usings: using HtmlAgilityPack; using ScrapySharp.Extensions; */ HtmlWeb w = new HtmlWeb(); //since you have your html locally stored, you do the following: //P.S: By prefixing file path strings with @, you are rid of having to escape slashes and other fluffs. var doc = HtmlDocument.LoadHtml(@"C:'Users'Conal_Curran'OneDrive'C#'MyProjects'Web Crawler'URLTester'response1.htm"); //for an http get request //var doc = w.Load("yourAddressHere"); var hrefs = doc.DocumentNode.CssSelect("a").Select(a => a.GetAttributeValue("href"));