你能使用Regex但保持格式吗

本文关键字：格式 Regex | 更新日期: 2023-09-27 18:01:04

我有这行代码，可以删除文本中显示的HTML标记，但它会丢失所有格式。我想知道是否有办法删除HTML标签，但保留文本的格式，如粗体、斜体等

 report.Description = Regex.Replace(report.Description, "<.*?>|&nbsp;", string.Empty);

这是显示描述字段的代码行：

        graphics.DrawString("" + report.Description, font2, XBrushes.Black, new XRect(margin, page.Height - (lineHeight * 35), page.Width, page.Height), XStringFormats.TopCenter);

我的reports.cs文件中也有这个公共类：

  public string Description { get; set; }

我正在使用PDFsharp将其显示在PDF中。如有任何建议或支持，我们将不胜感激。非常感谢。

你能使用Regex但保持格式吗

这听起来很像一个过滤器，可以用来防止跨站点脚本攻击。其想法是保留被认为是安全或可取的HTML元素的子集，并丢弃所有其他元素。

Regex.Replace的几种形式接受MatchEvaluator委托，该委托在每次找到正则表达式匹配时调用。保留某些元素的逻辑可以在委托中实现。

以下课程可能满足您的需求。

public static class HtmlFilter
{
    private static HashSet<string> _keep;
    static HtmlFilter()
    {
        _keep = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
        _keep.Add("b");
        _keep.Add("em");
        _keep.Add("i");
        _keep.Add("span");
        _keep.Add("strong");
        // Add other tags as needed.
    }
    private static string ElementFilter(Match match)
    {
        string tag = match.Result("${tag}");
        if (_keep.Contains(tag))
            return match.Value;
        else
            return string.Empty;
    }
    public static string Apply(string input)
    {
        Regex regex = new Regex(@"</?(?<tag>'w*)[^>]*>|&nbsp;");
        return regex.Replace(input, new MatchEvaluator(ElementFilter));
    }
}

然后，您可以使用过滤您的报告描述

report.Description = HtmlFilter.Apply(report.Description);

请注意，正则表达式保留HTML属性，以便保留<span style="...">等格式化元素。