从字符串中删除除字母以外的所有内容

本文关键字：字符串删除 | 更新日期: 2023-09-27 18:29:06

我想以高效的方式从给定字符串中删除除字母以外的任何字符。有什么建议吗？

从字符串中删除除字母以外的所有内容

var result = str.Where(c => char.IsLetter(c));

我对@KirillPolishchuk的回答很感兴趣，所以我刚刚用一个随机构建的字符串用LINQPad做了一个小的基准测试，下面是完整的代码（我不得不稍微更改我的原始代码，因为它返回了一个IEnumerable）：

void Main()
{
    TimeSpan elapsed;
    string result;
    elapsed = TheLINQWay(buildString(1000000), out result);
    Console.WriteLine("LINQ way: {0}", elapsed);
    elapsed = TheRegExWay(buildString(1000000), out result);
    Console.WriteLine("RegEx way: {0}", elapsed);
}
TimeSpan TheRegExWay(string s, out string result)
{
    Stopwatch stopw = new Stopwatch();
    stopw.Start();
    result = Regex.Replace(s, @"'P{L}", string.Empty);
    stopw.Stop();
    return stopw.Elapsed;
}
TimeSpan TheLINQWay(string s, out string result)
{
    Stopwatch stopw = new Stopwatch();
    stopw.Start();
    result = new string(s.Where(c => char.IsLetter(c)).ToArray());
    stopw.Stop();
    return stopw.Elapsed;
}
string buildString(int len)
{
    byte[] buffer = new byte[len];
    Random r = new Random((int)DateTime.Now.Ticks);
    for(int i = 0; i < len; i++)
        buffer[i] = (byte)r.Next(256);
    return Encoding.ASCII.GetString(buffer);
}

结果是：

LINQ way: 00:00:00.0150030
RegEx way: 00:00:00.2788130

不过，还有一个词需要说：正如Servy在评论中指出的那样，正则表达式的字符串越短就越快。

使用：

var result = Regex.Replace(input, @"'P{L}", string.Empty);

我能想到的最有效的方法：

string input = "ABCD 13 ~";
// at worst, all characters are alphabetical, so we have to accommodate for that
char[] output = new char[input.Length];
int numberOfAlphabeticals = 0;
for (int i = 0; i < input.Length; i++)
{
    char character = input[i];
    var charCode = (byte) character;
    // based on ASCII 
    if ((charCode >= 65 && charCode <= 90) || (charCode >= 97 && charCode <= 122))
    {
        output[numberOfAlphabeticals ] = character;
        ++numberOfAlphabeticals ;
    }
}
string outputAsString = new string(output, 0, numberOfAlphabeticals );

我认为最快的方法（性能方面）是创建一个122个字符的数组，将所选字符串转换为字节数组，并使用StringBuilder构建另一个删除了字符的字符串：

private static char[] alphabet = {''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', ''0', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', ''0', ''0', ''0', ''0', ''0', ''0', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',};

这里是删除函数（还没有编译，但它应该会给你一个想法）：

string RemoveNonAlpha(string value)
{
    byte[] asciiBytes = Encoding.ASCII.GetBytes(value);
    StringBuilder sb = new StringBuilder();
    for(int i = 0; i < asciiBytes.Length; i++)
    {
        if((asciiBytes[i] >= 65 && asciiBytes[i] <= 90) || (asciiBytes[i] >= 97 && asciiBytes[i] <= 122))
        {
            sb.Append(alphabet[asciiBytes[i]]);
        }
    }
    return sb.ToString();
}

更新

根据尼古拉的回答，这里有一个改进版本：

private static string RemoveNonAlpha(string value)
{
    char[] output = new char[value.Length];
    int numAlpha = 0;
    byte charCode = 0;
    for (int i = 0; i < value.Length; i++)
    {
        charCode = (byte)value[i];
        if ((charCode >= 65 && charCode <= 90) || (charCode >= 97 && charCode <= 122))
        {
            output[numAlpha] = value[i];
            numAlpha++;
        }
    }
    return new string(output, 0, numAlpha);
}

以下是与使用LINQ进行比较的结果：

The LINQ way 100: 6.7935
The fast way 100: 0.4648
The LINQ way 1000: 0.0442
The fast way 1000: 0.0134
The LINQ way 10000: 0.2078
The fast way 10000: 0.143
The LINQ way 100000: 2.0617
The fast way 100000: 1.3864

使用

^''w

作为regex 替换方法中的输入

http://msdn.microsoft.com/en-us/library/xwewhkd1.aspx