C#Internet Explorer和剥离HTML标记

本文关键字：HTML 标记剥离 Explorer C#Internet | 更新日期: 2023-09-27 18:25:10

是否有任何方法可以从C#打开Internet Explorer进程，将html内容发送到此浏览器并捕获"显示的"内容？

我知道其他html剥离方法（例如HtmlAgilityPack），但我想探索上述途径。

谢谢，LG

您可以使用适用于WinForms和WPF的WebBrowser控件在应用程序中托管IE。然后，您可以将控件的Source设置为HTML，等待加载内容（使用LayoutUpdated事件，而不是Loaded事件，后者是在HTML下载完成时引发的，不一定是经过排列的，并且运行所有动态JS），然后访问Document属性以获取HTML。

    public List<LinkItem> getListOfLinksFromPage(string webpage)
    {
        WebClient w = new WebClient();
        List<LinkItem> list = new List<LinkItem>();
        try
        {
            string s = w.DownloadString(webpage);
            foreach (LinkItem i in LinkFinder.Find(s))
            {
                //Debug.WriteLine(i);
                //richTextBox1.AppendText(i.ToString() + "'n");
                list.Add(i);
            }
            listTest = list;
            return list;
        }
        catch (Exception e)
        {
            return list;
        }
    }
    public struct LinkItem
    {
        public string Href;
        public string Text;
        public override string ToString()
        {
            return Href;
        }
    }
    static class LinkFinder
    {
        public static List<LinkItem> Find(string file)
        {
            List<LinkItem> list = new List<LinkItem>();
            // 1.
            // Find all matches in file.
            MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)", RegexOptions.Singleline);
            // 2.
            // Loop over each match.
            foreach (Match m in m1)
            {
                string value = m.Groups[1].Value;
                LinkItem i = new LinkItem();
                // 3.
                // Get href attribute.
                Match m2 = Regex.Match(value, @"href='""(.*?)'""",
                RegexOptions.Singleline);
                if (m2.Success)
                {
                    i.Href = m2.Groups[1].Value;
                }
                // 4.
                // Remove inner tags from text.
                string t = Regex.Replace(value, @"'s*<.*?>'s*", "",
                RegexOptions.Singleline);
                i.Text = t;
                list.Add(i);
            }
            return list;
        }
    }

其他人创建了正则表达式，所以我不能为此承担责任，但上面的代码将打开一个指向传入网页的web客户端对象，并使用正则表达式查找该页面的所有子链接。不确定这是否是你想要的，但如果你只是想"抓取"所有HTML内容并将其保存到文件中，你可以简单地将在"string s=w.DownloadString（网页）；"行中创建的字符串"s"保存到文件。