c#如何在网页抓取时存储登录信息

本文关键字：存储登录信息抓取网页 | 更新日期: 2023-09-27 18:08:59

我在用c#写一个网络爬虫。到目前为止，在我的程序，我可以扫描网站的源代码。对于我打算的网站，我需要登录访问一个静态页面。然而，我的代码，我登录只是很好，可以扫描源代码，然而，当我导航到下载页面，我得到了一个错误。我想这是因为我需要以某种方式告诉网站我还在登录。我该怎么做?

当前代码。

using System;
using System.Net;
using System.IO;
using System.Text;
namespace WebCraler
{
    class MainClass
    {
        static string username = "john" ;
        static string password = "123"; 
        public static void Main (string[] args)
        {
            Console.WriteLine ("Test login");
            String Page = GetWebText("http://localhost/PHP/Login/userStats.php");
            Console.WriteLine (Page);
            Console.WriteLine ("Test Login");
            String response = loginResponse(); 
            Console.WriteLine (response);
        }
        public static string GetWebText(string url)
        {
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
            request.UserAgent = "A .NET Web Crawler";
            WebResponse response = request.GetResponse();
            Stream stream = response.GetResponseStream();
            StreamReader reader = new StreamReader(stream);
            string htmlText="";
            string line;
            while ((line = reader.ReadLine()) != null){
                if(line.Contains("<td>"))
                {
                    //htmlText += "'n *****Found Andrew Kralovec****** 'n";
                }
                htmlText += line+"'n";
            }
            //string htmlText = reader.ReadToEnd();
            return htmlText;
        }
        private static String loginResponse()
        {
            try{
                ASCIIEncoding encoding = new ASCIIEncoding();
                string postData = "myusername=" + username + "&mypassword=" + password;
                byte[] data = encoding.GetBytes(postData);
                WebRequest request = WebRequest.Create("http://localhost/PHP/Login/check_login.php");
                request.Method = "POST";
                request.ContentType = "application/x-www-form-urlencoded";
                request.ContentLength = data.Length;
                Stream stream = request.GetRequestStream();
                stream.Write(data, 0, data.Length);
                stream.Close();
                WebResponse response = request.GetResponse();
                stream = response.GetResponseStream();
                StreamReader steamReader = new StreamReader(stream);
                String htmlRespones = steamReader.ReadToEnd();
                steamReader.Close();
                stream.Close();
                return htmlRespones ; 

            }catch{
                String htmlRespones = "Catch Error"; 
                return htmlRespones ; 
            }
        }
    }
}

c#如何在网页抓取时存储登录信息

当你登录到一个网站时，服务器会发出一个cookie，这个cookie必须在随后的请求中重新发送，这样服务器就知道你已经登录了(否则你会被重定向到登录页面或其他一些错误)。

默认情况下，HttpWebRequest不保留请求之间的cookie，您需要自己管理:

private CookieContainer sessionCookies = new CookieContainer();
public void MakeRequest() {
    HttpWebRequest request = HttpWebRequest.Create();
    request.CookieContainer = this.sessionCookies;
    // your code here
    request.GetResponse();
}

Dai正确地介绍了利用cookie的方法。然而，除了将cookie加载到请求中，您还需要将新到达的cookie存储到CookieContainer变量中:

private CookieContainer sessionCookies = new CookieContainer();
public static string GetWebText(string url) {
   HttpWebRequest request = HttpWebRequest.Create();
   request.CookieContainer = this.sessionCookies; // loading cookies in
   WebResponse response = request.GetResponse();
   // now we need to store cookies received from server into the sessionCookies variable
   this.sessionCookies = response.GetCoookies(); // GetCoookies() method or similar, check C# specification
   ...
   return htmlText;
}

// your code here -可能只是请求和响应之间的任何东西(没有)。我已经把它去掉了。