c#将HTML转换为文本,保留下一行

本文关键字:一行 保留 HTML 转换 文本 | 更新日期: 2023-09-27 18:03:33

我有一个保存在a.t txt文件的HTML,看起来像这样。

<HTML> <HEAD>      <TITLE></TITLE> </HEAD> 
<BODY STYLE="font: 10pt Times New Roman, Times, Serif">  <P STYLE="margin: 0"></P>  <P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0pt 0; text-align: center">UNITED STATES</P>  <P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0pt 0; text-align: center">SECURITIES AND EXCHANGE COMMISSION</P>  <P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0pt 0; text-align: center">WASHINGTON, D.C. 20549</P>  
<P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0pt 0; text-align: center">&nbsp;</P>  <P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0pt 0; text-align: center"></P>  <P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0pt 0; text-align: center"><B>&nbsp;</B></P>   
<TABLE CELLSPACING="0" CELLPADDING="0" STYLE="font: 10pt Times New Roman, Times, Serif; width: 100%; border-collapse: collapse"> <TR STYLE="vertical-align: top">     <TD STYLE="width: 5%; padding-right: 5.4pt; padding-left: 5.4pt"><FONT STYLE="font-size: 10pt">[X]</FONT></TD>     <TD STYLE="width: 95%; padding-right: 5.4pt; padding-left: 5.4pt"><FONT STYLE="font-size: 10pt">ANNUAL REPORT UNDER SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934</FONT></TD></TR> <TR STYLE="vertical-align: top">     
<TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"></TD>     
<TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt">&nbsp;</TD></TR> <TR STYLE="vertical-align: top">     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"></TD> 
<TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt; text-align: right"><FONT STYLE="font-size: 10pt">For the fiscal year ended <B><U>October 31, 2012</U></B></FONT></TD></TR> <TR STYLE="vertical-align: top">     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"></TD>     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt">&nbsp;</TD></TR> <TR STYLE="vertical-align: top">     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"><FONT STYLE="font-size: 10pt">[ ]</FONT></TD>     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"><FONT STYLE="font-size: 10pt">TRANSITION REPORT UNDER SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934</FONT></TD></TR> <TR STYLE="vertical-align: top">    
<TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"></TD>     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt">&nbsp;</TD></TR> <TR STYLE="vertical-align: top">    
 <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt"></TD>     <TD STYLE="padding-right: 5.4pt; padding-left: 5.4pt; text-align: right"><FONT STYLE="font-size: 10pt">For the transition period from _________ to ________</FONT></TD></TR>

我需要文本保留换行符。所有这些文本被合并成一行。如何处理这种情况?下面是我的c#代码

string text = File.ReadAllText(@"C:'a.txt",Encoding.UTF8);
Regex regex = new Regex("<[^>]+>");
 text = regex.Replace(text, " ").Replace("(&#160;)+", Environment.NewLine).Replace("&#32;", "").Replace("&#8217;", "'").Replace("'r'n'r'n('r'n)+", Environment.NewLine);
 text = HttpUtility.HtmlDecode(text);
  Console.WriteLine(text);  

c#将HTML转换为文本,保留下一行

我永远不会使用regex来解析HTML,相反,使用HtmlAgilityPack,您可以做很多事情,只需使用简单的XQuery/XPath,例如:

        HtmlDocument doc = new HtmlDocument();
        doc.Load(@"C:'temp'stackoverflow'question23657841'question23657841'a.html");
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//p"))
        {
            Console.WriteLine(node.InnerHtml);
        }

输出为:

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
WASHINGTON, D.C. 20549
&nbsp;
<b>&nbsp;</b>

简单地将XQuery切换到//font,你会得到这个:

[X]
ANNUAL REPORT UNDER SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the fiscal year ended <b><u>October 31, 2012</u></b>
[ ]
TRANSITION REPORT UNDER SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
For the transition period from _________ to ________

为什么不逐行读取文件呢? File.ReadAllLines()就是这样做的