从网页读取文本

本文关键字：取文本读取网页 | 更新日期: 2023-09-27 18:07:07

请注意:我做不是想要阅读页面的HTML内容，相反，我希望从网页上阅读文本。想象一下下面的例子，如果你愿意的话-

PHP脚本将"Hello User X"回显到当前页面，这样用户现在看到的页面(主要是空白的)左上角打印着"Hello User X"。从我的c#应用程序，我想读取文本到一个字符串。

String strPageData = functionToReadPageData("http://www.myURL.com/file.php");
Console.WriteLine(strPageData); // Outputs "Hello User X" to the Console.

在VB6中，我能够使用以下API来做到这一点:

InternetOpen
InternetOpenURL
InternetReadFile
InternetCloseHandle

我试图将我的VB6代码移植到c#，但我没有运气-所以我将非常感谢c#方法来完成上述任务。

从网页读取文本

我不知道。net框架的任何部分允许您自动从HTML文件中提取所有文本。我非常怀疑它的存在。

你可以试试htmllagilitypack(第三方)来访问HTML文档中的文本元素等。

您仍然需要编写逻辑来查找正确的HTML元素。一个像这样的HTML页面:

<html>
     <body>Some text</body>
</html>

然后需要使用xpath定位body标记并读取其内容。

HtmlNode body = doc.DocumentElement.SelectNodes("//body");
string bodyContent = body.InnerText;

按照这个模式，你可以阅读页面上的每个元素。你可能需要做一些后期处理来删除break，注释等。

http://htmlagilitypack.codeplex.com/wikipage?title=Examples

我知道这是一个较旧的帖子，但我很惊讶没有人提到使用microsoft.mshtml，它在这种事情上工作得很好。您需要添加对microsoft.mshtml的引用

[右键单击Solution Explorer项目中的References。]然后单击Add Reference...。在Assemblies中输入"HTML"，你会看到Microsoft.mshtml。

using System.Net;
using mshtml;
using (var client = new WebClient())
{
    var s = client.DownloadString(@"https://stackoverflow.com/questions/7264659/read-text-from-web-page");
    var htmldoc2 = (IHTMLDocument2)new HTMLDocument();
    htmldoc2.write(s);
    var plainText = htmldoc2.body.outerText;
    Console.WriteLine(plainText);
}

它将返回网页的"OuterText"，这基本上是当你用web浏览器访问它时显示的文本。

您应该使用WebClient类来完成此操作。

下面的代码可能对您有所帮助。

string result = "";
try
{
     using (StreamReader sr = new StreamReader(IOParams.ConfigPath +"SUCCESSEMPTMP.HTML"))
     {
           result = sr.ReadToEnd();
           result = result.Replace("<body/>", "<body>");
           result = result.Replace("</body>", "<body>");
           List<string> body = new List<string>(result.Split(new string[] { "<body>" }, StringSplitOptions.None));
           if (body.Count > 2)
           {
                result = body[1];
           }
      }
}
catch (Exception e)
{
    throw e;
}
return result;