C# 中的简单网络爬虫

本文关键字：网络爬虫简单网简单 | 更新日期: 2023-09-27 18:33:07

我创建了一个简单的网络爬虫，但我想添加递归函数，以便打开的每个页面我都可以获取此页面中的URL，但我不知道如何做到这一点，我还想包含线程以使其更快。这是我的代码

namespace Crawler
{
    public partial class Form1 : Form
    {
        String Rstring;
        public Form1()
        {
            InitializeComponent();
        }
        private void button1_Click(object sender, EventArgs e)
        {
            
            WebRequest myWebRequest;
            WebResponse myWebResponse;
            String URL = textBox1.Text;
            myWebRequest =  WebRequest.Create(URL);
            myWebResponse = myWebRequest.GetResponse();//Returns a response from an Internet resource
            Stream streamResponse = myWebResponse.GetResponseStream();//return the data stream from the internet
                                                                       //and save it in the stream
            StreamReader sreader = new StreamReader(streamResponse);//reads the data stream
            Rstring = sreader.ReadToEnd();//reads it to the end
            String Links = GetContent(Rstring);//gets the links only
            
            textBox2.Text = Rstring;
            textBox3.Text = Links;
            streamResponse.Close();
            sreader.Close();
            myWebResponse.Close();


        }
        private String GetContent(String Rstring)
        {
            String sString="";
            HTMLDocument d = new HTMLDocument();
            IHTMLDocument2 doc = (IHTMLDocument2)d;
            doc.write(Rstring);
            
            IHTMLElementCollection L = doc.links;
           
            foreach (IHTMLElement links in  L)
            {
                sString += links.getAttribute("href", 0);
                sString += "/n";
            }
            return sString;
        }

C# 中的简单网络爬虫

我按如下方式修复了您的 GetContent 方法，以便从抓取页面获取新链接：

public ISet<string> GetNewLinks(string content)
{
    Regex regexLink = new Regex("(?<=<a''s*?href=(?:'|'"))[^''"]*?(?=(?:'|'"))");
    ISet<string> newLinks = new HashSet<string>();    
    foreach (var match in regexLink.Matches(content))
    {
        if (!newLinks.Contains(match.ToString()))
            newLinks.Add(match.ToString());
    }
    return newLinks;
}

更新

修复：正则表达式应该是正则表达式链接。感谢@shashlearner指出这一点（我的打字错误）。

我使用反应式扩展创建了类似的东西。

https://github.com/Misterhex/WebCrawler

希望能帮到你。

Crawler crawler = new Crawler();
IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/"));
observable.Subscribe(onNext: Console.WriteLine, 
onCompleted: () => Console.WriteLine("Crawling completed"));

以下内容包括答案/建议。

我相信您应该使用dataGridView而不是textBox因为当您在 GUI 中查看它时，更容易看到找到的链接（URL）。

您可以更改：

textBox3.Text = Links;

自

 dataGridView.DataSource = Links;

现在对于这个问题，您尚未包括：

using System.  "'s"

使用了哪些，因为如果我能得到它们，将不胜感激，因为无法弄清楚。

从设计的角度来看，我已经写了一些网络爬虫。基本上，您希望使用堆栈数据结构实现深度优先搜索。您也可以使用广度优先搜索，但您可能会遇到堆栈内存问题。祝你好运。