Streamwriter.WriteLine() 并没有编写所有内容.奇怪的输出
本文关键字:输出 WriteLine 并没有 Streamwriter | 更新日期: 2023-09-27 18:30:34
我正在编写一个程序来抓取指向我的大学教师简历页面的链接。我正在使用 HTMLAgilityPack。这是我的代码:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using System.IO;
namespace Get_Professor_Data
{
class Program
{
static void Main(string[] args)
{
FileStream fs = new FileStream("Links.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite);
string url, previousurl = "";
char c = '@';
StreamWriter writer = new StreamWriter(fs);
HtmlWeb web = new HtmlWeb();
for (int i = 0; i < 26; i++)
{
HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
c++;
url = link.Attributes["href"].Value.ToString();
//if (url == previousurl)
// continue;
try
{
if (url.Substring(0, 25).Equals(@"/facultybios/profile.php?", StringComparison.Ordinal))
{
writer.WriteLine(@"https://www2.aus.edu" + url);
writer.Flush();
}
}
catch (Exception ex)
{
}
previousurl = url;
}
}
writer.Close();
}
}
}
这是我的输出:
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=jabdalla
https://www2.aus.edu/facultybios/profile.php?faculty=jsater
https://www2.aus.edu/facultybios/profile.php?faculty=jgriffin
https://www2.aus.edu/facultybios/profile.php?faculty=jfedtke
https://www2.aus.edu/facultybios/profile.php?faculty=jyounas
https://www2.aus.edu/facultybios/profile.php?faculty=jsqualli
https://www2.aus.edu/facultybios/profile.php?faculty=jboisvert
https://www2.aus.edu/facultybios/profile.php?faculty=jvinke
https://www2.aus.edu/facultybios/profile.php?faculty=jbaker
https://www2.aus.edu/facultybios/profile.php?faculty=jhassan
https://www2.aus.edu/facultybios/profile.php?faculty=jpalmer
https://www2.aus.edu/facultybios/profile.php?faculty=jkolo
https://www2.aus.edu/facultybios/profile.php?faculty=jmarch
https://www2.aus.edu/facultybios/profile.php?faculty=jinhyuk
https://www2.aus.edu/facultybios/profile.php?faculty=giesen
https://www2.aus.edu/facultybios/profile.php?faculty=jvangorp
https://www2.aus.edu/facultybios/profile.php?faculty=jswanstrom
https://www2.aus.edu/facultybios/profile.php?faculty=jking
https://www2.aus.edu/facultybios/profile.php?faculty=jmontague
https://www2.aus.edu/facultybios/profile.php?faculty=jallee
https://www2.aus.edu/facultybios/profile.php?faculty=jkatsos
https://www2.aus.edu/facultybios/profile.php?faculty=jbley
https://www2.aus.edu/facultybios/profile.php?faculty=jwallis
https://www2.aus.edu/facultybios/profile.php?faculty=jgibbs
https://www2.aus.edu/facultybios/profile.php?faculty=jroldan
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https
出于某种奇怪的原因,只打印来自 J 页面的链接。某些链接为空。最后一行只有https(这就是为什么我认为问题出在编写器而不是我的代码逻辑上)。我一直在尝试解决这个问题一段时间,但没有运气。
这些是我正在抓取的页面:https://www2.aus.edu/facultybios/
任何帮助将不胜感激。
我 100% 同意 Jon 的观察:你根本不需要捕获异常(相反,只需在调用 Substring()
之前检查长度!),但可以肯定的是,您应该只捕获您期望得到的异常。你应该使用 using
来处理FileStream
对象和StreamWriter
对象的处置(从技术上讲,后者为您处置前者,但恕我直言,明确一点就好了)。
至于实际问题,在我看来,有一个明显的错误,以及一个可能的错误:
- 明显的错误是您在错误的范围内递增
c
(您用来选择要抓取的页面的变量)。也就是说,对于您处理的每个 URL,您都会增加一次其值。据推测,您实际上希望在循环之前而不是在循环内部递增该变量。
即代替这个:
HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
c++;
你可能想写这个:
HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
c++;
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
甚至可能是这个:
HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + (c++));
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
- 可能的错误是您
c
初始化为字符@
。我在该页面上没有看到任何表明这是一个有效的字符;看起来只有当sort
参数设置为从A
到Z
的字母(不区分大小写)时,它才会显示链接。
考虑到所有这些,恕我直言,编写此代码的更好方法是这样的:
using (FileStream fs = new FileStream("Links.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite)_
using (StreamWriter writer = new StreamWriter(fs))
{
string url;
HtmlWeb web = new HtmlWeb();
for (int i = 0; i < 26; i++)
{
char c = (char)('A' + i);
HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
url = link.Attributes["href"].Value.ToString();
if (url.Length > 25 &&
url.Substring(0, 25).Equals(@"/facultybios/profile.php?", StringComparison.Ordinal))
{
writer.WriteLine(@"https://www2.aus.edu" + url);
writer.Flush();
}
}
}
}