c#如何在网页抓取时存储登录信息
本文关键字:存储 登录 信息 抓取 网页 | 更新日期: 2023-09-27 18:08:59
我在用c#写一个网络爬虫。到目前为止,在我的程序,我可以扫描网站的源代码。对于我打算的网站,我需要登录访问一个静态页面。然而,我的代码,我登录只是很好,可以扫描源代码,然而,当我导航到下载页面,我得到了一个错误。我想这是因为我需要以某种方式告诉网站我还在登录。我该怎么做?
当前代码。
using System;
using System.Net;
using System.IO;
using System.Text;
namespace WebCraler
{
class MainClass
{
static string username = "john" ;
static string password = "123";
public static void Main (string[] args)
{
Console.WriteLine ("Test login");
String Page = GetWebText("http://localhost/PHP/Login/userStats.php");
Console.WriteLine (Page);
Console.WriteLine ("Test Login");
String response = loginResponse();
Console.WriteLine (response);
}
public static string GetWebText(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "A .NET Web Crawler";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string htmlText="";
string line;
while ((line = reader.ReadLine()) != null){
if(line.Contains("<td>"))
{
//htmlText += "'n *****Found Andrew Kralovec****** 'n";
}
htmlText += line+"'n";
}
//string htmlText = reader.ReadToEnd();
return htmlText;
}
private static String loginResponse()
{
try{
ASCIIEncoding encoding = new ASCIIEncoding();
string postData = "myusername=" + username + "&mypassword=" + password;
byte[] data = encoding.GetBytes(postData);
WebRequest request = WebRequest.Create("http://localhost/PHP/Login/check_login.php");
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
request.ContentLength = data.Length;
Stream stream = request.GetRequestStream();
stream.Write(data, 0, data.Length);
stream.Close();
WebResponse response = request.GetResponse();
stream = response.GetResponseStream();
StreamReader steamReader = new StreamReader(stream);
String htmlRespones = steamReader.ReadToEnd();
steamReader.Close();
stream.Close();
return htmlRespones ;
}catch{
String htmlRespones = "Catch Error";
return htmlRespones ;
}
}
}
}
当你登录到一个网站时,服务器会发出一个cookie,这个cookie必须在随后的请求中重新发送,这样服务器就知道你已经登录了(否则你会被重定向到登录页面或其他一些错误)。
默认情况下,HttpWebRequest
不保留请求之间的cookie,您需要自己管理:
private CookieContainer sessionCookies = new CookieContainer();
public void MakeRequest() {
HttpWebRequest request = HttpWebRequest.Create();
request.CookieContainer = this.sessionCookies;
// your code here
request.GetResponse();
}
Dai正确地介绍了利用cookie的方法。然而,除了将cookie加载到请求中,您还需要将新到达的cookie存储到CookieContainer变量中:
private CookieContainer sessionCookies = new CookieContainer();
public static string GetWebText(string url) {
HttpWebRequest request = HttpWebRequest.Create();
request.CookieContainer = this.sessionCookies; // loading cookies in
WebResponse response = request.GetResponse();
// now we need to store cookies received from server into the sessionCookies variable
this.sessionCookies = response.GetCoookies(); // GetCoookies() method or similar, check C# specification
...
return htmlText;
}
// your code here
-可能只是请求和响应之间的任何东西(没有)。我已经把它去掉了。