确保字符串不包含特定字符的快速方法

本文关键字：方法字符字符串包含特确保 | 更新日期: 2024-10-25 05:19:17

我想确保 C# 字符串不包含特定字符。

我正在使用string.IndexOfAny(char[])，在我看来，正则表达式在这项任务中会更慢。有没有更好的方法来实现这一目标？速度在我的应用程序中至关重要。

确保字符串不包含特定字符的快速方法

对IndexOf vs IndexOfAny vs Regex vs Hashset进行了快速基准测试。
500字洛姆伊普苏姆干草堆，用两个字针。
两根针都在大海捞针中进行测试，一根在大海捞针中，两者都没有在大海捞针中进行测试。

    private long TestIndexOf(string haystack, char[] needles)
    {
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();
        for (int i = 0; i < 1000000; i++)
        {
            int x = haystack.IndexOfAny(needles);
        }
        sw.Stop();
        return sw.ElapsedMilliseconds;
    }
    private long TestRegex(string haystack, char[] needles)
    {
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();
        Regex regex = new Regex(string.Join("|", needles));
        for (int i = 0; i < 1000000; i++)
        {
            Match m = regex.Match(haystack);
        }
        sw.Stop();
        return sw.ElapsedMilliseconds;
    }
    private long TestIndexOf(string haystack, char[] needles)
    {
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();
        for (int i = 0; i < 1000000; i++)
        {
            int x = haystack.IndexOf(needles[0]);
        }
        sw.Stop();
        return sw.ElapsedMilliseconds;
    }
    private long TestHashset(string haystack, char[] needles)
    {
        HashSet<char> specificChars = new HashSet<char>(needles.ToList());
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();
        for (int i = 0; i < 1000000; i++)
        {
            bool notContainsSpecificChars = !haystack.Any(specificChars.Contains);
        }
        sw.Stop();
        return sw.ElapsedMilliseconds;
    }

1,000,000 次迭代的结果：

索引： 28/
2718/2711
任何索引： 153/141/17561
正则表达式： 1068/1102/92324
哈希集： 939/891/111702

笔记：

较小的干草堆可提高性能。
更大的针组可提高正则表达式性能。
较大的针组会降低任何性能的指数。
如果针不在大海捞针中，所有方法的性能都会下降

总体而言，根据干草堆和针头尺寸的不同，regex比indexofany慢多达 10 倍。

您可以使用以下简洁高效的 LINQ 查询：

HashSet<char> specificChars = new HashSet<char>{ 'a', 'b', 'c'};
bool notContainsSpecificChars = !"test".Any(specificChars.Contains); // true

我使用了HashSet<char>，因为它对查找有效，不允许重复。

如果你有一个数组作为输入，你可以使用构造函数从它创建一个HashSet：

char[] chars = new[] { 'a', 'b', 'c', 'c' };
specificChars = new HashSet<char>(chars); // c is removed since it was a duplicate

另一种没有HashSet的方法是使用 Enumerable.Intersect + Enumerable.Any ：

bool notContainsSpecificChars = !"test".Intersect(chars).Any();

如果你必须只找到一个字符，最好调用方法IndexOf（singleChar）或IndexOf（singleChar，startIndex，charCount）。

当然，正则表达式的计算成本要高得多！

String.IndexOfAny(char[])是在

CLR本身中实现的，并且String.IndexOf使用外部调用，因此它们都非常快。两者都比使用正则表达式快得多。

IndexOf是否比IndexOfAny好取决于您希望检查的字符数。根据一些非常粗略的基准测试，看起来 IndexOf 2 个或更少的字符的性能更好（很小），但对于 3 个或更多字符IndexOfAny性能更好。但是，差异很小 - 使用IndexOfAny的优势可能会被分配字符数组的成本所淹没。