使用正则表达式转换字符串

本文关键字：字符串转换正则表达式 | 更新日期: 2023-09-27 18:36:22

我有一些HTML内容需要使用C#进行修改。这在概念上很简单，但我不确定如何有效地做到这一点。内容包含多次出现的分隔数字，后跟空定位标记。我需要获取分隔的数字并将其插入锚标记中的 JavaScript 函数调用中。例如

源字符串将包含如下内容：

%%1%%<a href="#"></a> 
<p>A bunch of HTML markup</p>
%%2%%<a href="#"></a>
<p>Some more HTML markup</p>

我需要将其转换为：

<a href="#" onclick="DoSomething('1')></a> 
<p>A bunch of HTML markup</p>
<a href="#" onclick="DoSomething('2')></a>
<p>Some more HTML markup</p>

%''d+%% 的出现次数没有限制。我尝试编写正则表达式，希望可以使用 Replace 方法，但我不确定这是否甚至可以用于每个组的多个实例。这是我所拥有的：

%%(?<LinkID>'d+)%%(?<LinkStart><a['s'S]*?)(?:(?<LinkEnd>>['s'S]*?)(?=%%'d+|$))
// %%(?<LinkID>'d+)%%        Match a number surrounded by %% and put the number in a group named LinkID
// (?<LinkStart><a['s'S]*?)  Match <a followed by any characters until next match (non greedy), in a group named LinkStart
// (?:                       Logical grouping that does not get captured
// (?<LinkEnd>>['s'S]*?)     Match > followed by any characters until next match, in a group named LinkEnd
// (?=%%'d+%%|$)             Where the former LinkEnd group is followed by another instance of a delimited number or the end of the string. (I don't think this is working as I intended.)

也许可以使用几个正则表达式操作和String.Format的某种组合。我不是正则表达式的专家。

使用正则表达式转换字符串

使用正则表达式来解析 HTML 已经在 SO 上进行了广泛的介绍。共识是不应该这样做。

如果你需要解析你的HTML，我建议使用HTML敏捷包之类的东西。这允许您使用类似于 xPath 的东西来标识要处理的 HTML。

我会说你的正则表达式几乎是你想要的 - 我稍微改变了它。如果$仅在字符串末尾匹配，这将起作用：

%%('d+)%%(<a[^>]*)(></a>)(.*?)(?=%%'d|$)

如果您决定使用它，那么对于每个匹配项，您都可以访问组，通过这种方式，您可以构造新字符串 - 这可能比替换现有字符串中的内容更容易。

我会使用string.split来解决这个问题。

string emptyAnchor = "<a href=""#""></a>";
string src = GetData();
string[] splits = src.split(new string[]{"%%"}, StringSplitOptions.None);
StringBuilder sb = new StringBuilder();
//first entry is blank, set to 1
int i = 1;
while(i < splits.length)
{
    string id = splits[i];
    //increment for data string
    i++;
    //prehaps use a StringReplaceFirstOccurrence function instead
    sb.Append(splits[i].Replace(emptyAnchor, GetDataFromID(id)));
    i++;
}
string output = sb.ToString();

事实证明，Regex.Replace 已经足够智能，可以处理多个匹配项。我只是修改了我的正则表达式以不使用未来视图。这个想法是我在 %% 分隔符中找到数字并将其添加到一个组中，在下一个锚标记中找到内容并将其添加到一个组中，然后将整个匹配替换为新版本，该版本将两个组中捕获的文本插入其中。替换方法似乎会自动正确处理后续匹配项，而无需任何其他帮助。

string originalText = "<h3>%%1%%<a href='"#'">First Spot</a></h3><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>" +
                            "<h3>%%2%%<a href='"#'">Second Spot</a></h3><p>Ut vulputate lobortis feugiat.</p>" +
                            "<p>Ut nunc diam, malesuada iaculis viverra nec, auctor eget velit.</p>";
Regex regex = new Regex(@"%%('d+)%%['s]*<a['s'S]*?>(['s'S]*?)</a>");
string result = regex.Replace(originalText, "<a href='"#'" onclick='"DoSomething($1)'">$2</a>");
Debug.WriteLine("Original Text: '"" + originalText + "'"");
Debug.WriteLine("Result Text: '"" + result + "'"");

输出：

Original Text: "<h3>%%1%%<a href="#">First Spot</a></h3><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p><h3>%%2%%<a href="#">Second Spot</a></h3><p>Ut vulputate lobortis feugiat.</p><p>Ut nunc diam, malesuada iaculis viverra nec, auctor eget velit.</p>"
Result Text: "<h3><a href="#" onclick="DoSomething(1)">First Spot</a></h3><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p><h3><a href="#" onclick="DoSomething(2)">Second Spot</a></h3><p>Ut vulputate lobortis feugiat.</p><p>Ut nunc diam, malesuada iaculis viverra nec, auctor eget velit.</p>"