删除HTML代码并合并段落

本文关键字:合并 段落 代码 HTML 删除 | 更新日期: 2023-09-27 18:11:08

我有以下输入:

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc a dignissim purus. Curabitur enim nibh, tempor id lobortis tincidunt, adipiscing ac felis. Nunc interdum ullamcorper tortor non elementum. Praesent felis mauris, volutpat eu cursus nec, luctus vel odio.</p>
<p>Morbi elementum nunc at nulla iaculis tincidunt. Vivamus sit amet sapien vel enim lacinia ultrices sit amet ac urna. Sed semper mauris id nulla consectetur viverra. Quisque eget leo nisl. Etiam et risus sapien. Aenean vitae ante et erat tincidunt ullamcorper vel a odio. Integer hendrerit turpis et enim convallis rhoncus pharetra enim ullamcorper. Suspendisse porta mollis purus, in lacinia nunc sollicitudin vel. Nam id ligula mi.</p>

我怎么能得到没有HTML代码的输出(这很容易),但也段落组合?这样的:

Lorem ipsum dolor sit met, consecent adiping elite。Nunc一dignissim purus。Curabitur enim nih, tempor id lobortis incident,adipiscating是什么意思:Nunc interdum ullamcorper torementum。present felis mauris, volutpat, cursus nec, luctus vel audio。发病原因元素在零的位置。Vivamus坐到智慧的velEnim lacinia组与其他组相接。我永远不会放弃consectetur viverra。我的小弟弟。Etiam和risus sapien。阿涅斯人的生命是一种非常偶然的生命,是一种非常偶然的生命。整数亨德雷特turpis和enim convallis, rhoncus pharetra, enim ullamper。悬浮性门状软毛,在阴唇处,下垂Id ligula mi.

谢谢

删除HTML代码并合并段落

使用像HTML Agility Pack这样的HTML解析器非常容易:

// remove the html tags
var doc = new HtmlDocument();
doc.LoadHtml(htmlString);
string result = doc.DocumentNode.InnerText;
// remove the line breaks
result = result.Replace("'r", "");
result = result.Replace("'n", "");

只需阅读html并将<p></p>替换为""并删除换行符('r'n),我认为您可以继续

一旦您轻松地删除了HTML,您就可以使用正则表达式来删除多余的空白:

string input = "Lorem ipsum dolor sit amet, consectetur 'r'n Morbi elementum nunc at nulla.";
string pattern = @"'s+";
string replacement = " ";
string output = Regex.Replace(input, pattern, replacement);