Streamwriter.WriteLine() 并没有编写所有内容.奇怪的输出

本文关键字:输出 WriteLine 并没有 Streamwriter | 更新日期: 2023-09-27 18:30:34

我正在编写一个程序来抓取指向我的大学教师简历页面的链接。我正在使用 HTMLAgilityPack。这是我的代码:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using System.IO;
namespace Get_Professor_Data
{
    class Program
    {
        static void Main(string[] args)
        {
            FileStream fs = new FileStream("Links.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite);
            string url, previousurl = "";
            char c = '@';
            StreamWriter writer = new StreamWriter(fs);
            HtmlWeb web = new HtmlWeb();
            for (int i = 0; i < 26; i++)
            {
                HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
                foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
                {
                    c++;
                    url = link.Attributes["href"].Value.ToString();
                    //if (url == previousurl)
                    //    continue;
                    try
                    {
                        if (url.Substring(0, 25).Equals(@"/facultybios/profile.php?", StringComparison.Ordinal))
                        {
                            writer.WriteLine(@"https://www2.aus.edu" + url);
                            writer.Flush();
                        }
                    }
                    catch (Exception ex)
                    {
                    }
                    previousurl = url;
                }
            }
            writer.Close();
        }
    }
}

这是我的输出:

https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=jabdalla
https://www2.aus.edu/facultybios/profile.php?faculty=jsater
https://www2.aus.edu/facultybios/profile.php?faculty=jgriffin
https://www2.aus.edu/facultybios/profile.php?faculty=jfedtke
https://www2.aus.edu/facultybios/profile.php?faculty=jyounas
https://www2.aus.edu/facultybios/profile.php?faculty=jsqualli
https://www2.aus.edu/facultybios/profile.php?faculty=jboisvert
https://www2.aus.edu/facultybios/profile.php?faculty=jvinke
https://www2.aus.edu/facultybios/profile.php?faculty=jbaker
https://www2.aus.edu/facultybios/profile.php?faculty=jhassan
https://www2.aus.edu/facultybios/profile.php?faculty=jpalmer
https://www2.aus.edu/facultybios/profile.php?faculty=jkolo
https://www2.aus.edu/facultybios/profile.php?faculty=jmarch
https://www2.aus.edu/facultybios/profile.php?faculty=jinhyuk
https://www2.aus.edu/facultybios/profile.php?faculty=giesen
https://www2.aus.edu/facultybios/profile.php?faculty=jvangorp
https://www2.aus.edu/facultybios/profile.php?faculty=jswanstrom
https://www2.aus.edu/facultybios/profile.php?faculty=jking
https://www2.aus.edu/facultybios/profile.php?faculty=jmontague
https://www2.aus.edu/facultybios/profile.php?faculty=jallee
https://www2.aus.edu/facultybios/profile.php?faculty=jkatsos
https://www2.aus.edu/facultybios/profile.php?faculty=jbley
https://www2.aus.edu/facultybios/profile.php?faculty=jwallis
https://www2.aus.edu/facultybios/profile.php?faculty=jgibbs
https://www2.aus.edu/facultybios/profile.php?faculty=jroldan
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https://www2.aus.edu/facultybios/profile.php?faculty=
https

出于某种奇怪的原因,只打印来自 J 页面的链接。某些链接为空。最后一行只有https(这就是为什么我认为问题出在编写器而不是我的代码逻辑上)。我一直在尝试解决这个问题一段时间,但没有运气。

这些是我正在抓取的页面:https://www2.aus.edu/facultybios/

任何帮助将不胜感激。

Streamwriter.WriteLine() 并没有编写所有内容.奇怪的输出

我 100% 同意 Jon 的观察:你根本不需要捕获异常(相反,只需在调用 Substring() 之前检查长度!),但可以肯定的是,您应该只捕获您期望得到的异常。你应该使用 using 来处理FileStream对象和StreamWriter对象的处置(从技术上讲,后者为您处置前者,但恕我直言,明确一点就好了)。

至于实际问题,在我看来,有一个明显的错误,以及一个可能的错误:

  • 明显的错误是您在错误的范围内递增c(您用来选择要抓取的页面的变量)。也就是说,对于您处理的每个 URL,您都会增加一次其值。据推测,您实际上希望在循环之前而不是在循环内部递增该变量。

即代替这个:

HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
    c++;

你可能想写这个:

HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
c++;
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{

甚至可能是这个:

HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + (c++));
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
  • 可能的错误是您c初始化为字符@ 。我在该页面上没有看到任何表明这是一个有效的字符;看起来只有当 sort 参数设置为从AZ的字母(不区分大小写)时,它才会显示链接。

考虑到所有这些,恕我直言,编写此代码的更好方法是这样的:

using (FileStream fs = new FileStream("Links.txt", FileMode.OpenOrCreate, FileAccess.ReadWrite)_
using (StreamWriter writer = new StreamWriter(fs))
{
    string url;
    HtmlWeb web = new HtmlWeb();
    for (int i = 0; i < 26; i++)
    {
        char c = (char)('A' + i);
        HtmlDocument doc = web.Load(@"https://www2.aus.edu/facultybios/index.php?sort=" + c);
        foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
        {
            url = link.Attributes["href"].Value.ToString();
            if (url.Length > 25 &&
                url.Substring(0, 25).Equals(@"/facultybios/profile.php?", StringComparison.Ordinal))
            {
                writer.WriteLine(@"https://www2.aus.edu" + url);
                writer.Flush();
            }
        }
    }
}