解码/规范化后无法在 HTML 中找到子字符串

本文关键字：字符串 HTML 规范化解码 | 更新日期: 2023-09-27 18:34:49

我有一个 html 片段作为字符串"s"保存，它是用户生成的，可能来自多个来源，所以我无法控制字符等的编码。

我有一个简单的字符串"比较">

，我需要检查比较是否存在为"s"的子字符串。"比较"没有任何HTML标签或编码。

我

正在解码、规范化和使用正则表达式来去除 html 标签，但即使我知道它在那里，我仍然无法找到子字符串......

string s = "<p>this is my string.</p><p>my string is html with tags and <a href=&quot;someurl&quot;>links</a>&nbsp;and&nbsp;encoding.</p><p>i want to&nbsp;find&nbsp;a&nbsp;substring but my comparison might not have tags &amp; encoding.";
    string comparison = "i want to find a substring";
    string decode = HttpUtility.HtmlDecode(s);
    string tagsreplaced = Regex.Replace(decode, "<.*?>", " ");
    string normalized = tagsreplaced.Normalize();

    Literal1.Text = normalized;
    if (normalized.IndexOf(comparison) != -1)
    {
        Label1.Text = "substring found";
    }
    else
    {
        Label1.Text = "substring not found";
    }

这将返回"未找到子字符串"。我可以通过单击查看源看到发送到 Literal 的字符串绝对包含完全按照提供的比较字符串，那么为什么找不到呢？

有没有其他方法可以实现这一目标？

解码/规范化后无法在 HTML 中找到子字符串

答案是 HTML 实体解码仍然将您的 解码为字符0xc2 0xa0，这不是正常的空格字符' '(这是0x20(。使用以下程序验证这一点：

using System;
using System.Text;
using System.Text.RegularExpressions;
using System.Web;
namespace TestStuff
{
    class Program
    {
        static void Main(string[] args)
        {
            string s = "<p>this is my string.</p><p>my string is html with tags and <a href=&quot;someurl&quot;>links</a>&nbsp;and&nbsp;encoding.</p><p>i want to&nbsp;find&nbsp;a&nbsp;substring but my comparison might not have tags &amp; encoding.";
            s = "i want to&nbsp;find&nbsp;a&nbsp;substring";
            string comparison = "i want to find a substring";
            string decode = HttpUtility.HtmlDecode(s);
            string tagsreplaced = Regex.Replace(decode, "<.*?>", " ");
            string normalized = tagsreplaced.Normalize();
            Console.WriteLine("Dumping first string");
            Console.WriteLine(normalized);
            Console.WriteLine(BitConverter.ToString(Encoding.UTF8.GetBytes(normalized)));
            Console.WriteLine("Dumping second string");
            Console.WriteLine(comparison);
            Console.WriteLine(BitConverter.ToString(Encoding.UTF8.GetBytes(comparison)));
            if (normalized.IndexOf(comparison) != -1)
                Console.WriteLine("substring found");
            else
                Console.WriteLine("substring not found");
            Console.ReadLine();
            return;
        }
    }
}

它会为您转储两个字符串的 UTF8 编码。你将看到输出：

Dumping first string
i want to find a substring
69-20-77-61-6E-74-20-74-6F-C2-A0-66-69-6E-64-C2-A0-61-C2-A0-73-75-62-73-74-72-69-6E-67
Dumping second string
i want to find a substring
69-20-77-61-6E-74-20-74-6F-20-66-69-6E-64-20-61-20-73-75-62-73-74-72-69-6E-67
substring not found

您会看到字节数组不匹配，因此它们不相等，因此.IndexOf()告诉您什么也没找到是正确的。

因此，问题出在HTML本身，因为有一个不间断的空格字符，您不会将其解码为普通空格。您可以通过使用 String.Replace() 将字符串中的" "替换为字符串中的" "来破解它。