多线程Web请求正在连续执行
本文关键字:连续 执行 Web 请求 多线程 | 更新日期: 2023-09-27 18:27:27
我正在C#中构建一个web scraper,用于处理代理和大量请求。页面是通过ConnectionManager类加载的,该类获取代理并重试使用随机代理加载该页面,直到页面正确加载。
平均而言,一个任务需要100到300个请求,为了加快进程,我设计了一种使用多线程同时下载网页的方法。
public Review[] getReviewsMultithreaded(int reviewCount)
{
ArrayList reviewList = new ArrayList();
int currentIndex = 0;
int currentPage = 1;
int totalPages = (reviewCount / 10) + 1;
bool threadHasMoreWork = true;
Object pageLock = new Object();
Thread[] threads = new Thread[Program.maxScraperThreads];
for(int i = 0; i < Program.maxScraperThreads; i++)
{
threads[i] = (new Thread(() =>
{
while (threadHasMoreWork)
{
HtmlDocument doc;
lock(pageLock)
{
if (currentPage <= totalPages)
{
string builtString = "http://www.example.com/reviews/" + _ID + "?pageNumber=" + currentPage;
//Log.WriteLine(builtString);
currentPage++;
doc = Program.conManager.loadDocument(builtString);
}
else
{
threadHasMoreWork = false;
continue;
}
}
try
{
//Get info from page and add to list
reviewList.Add(cRev);
}
Log.WriteLine(_asin + " reviews scraped: " + reviewList.Count);
}
catch (Exception ex) { continue; }
}
}));
threads[i].Start();
}
bool threadsAreRunning = true;
while(threadsAreRunning) //this is in a separate thread itself, so as not to interrupt the GUI
{
threadsAreRunning = false;
foreach (Thread t in threads)
if (t.IsAlive)
{
threadsAreRunning = true;
Thread.Sleep(2000);
}
}
//flatten the arraylist to a primitive
return reviewArray;
}
然而,我注意到,请求在很大程度上仍然是一次处理一个,因此该方法的速度并没有以前快多少。锁会引起问题吗?ConnectionManager是在一个对象中实例化的,而每个线程都从同一个对象调用loadDocument,这是事实吗?
啊,没关系。我注意到锁包含了对加载页面的方法的调用,因此一次只加载一个页面。