c# . net中的UTF-16安全子字符串

本文关键字：字符串安全 UTF-16 net 中的 | 更新日期: 2023-09-27 18:07:20

我想获得给定长度的子字符串，例如150。然而，我想确保我没有切断unicode字符之间的字符串。

。请看下面的代码:

var str = "Hello   world!";
var substr = str.Substring(0, 6);

这里substr是一个无效的字符串，因为笑脸字符被切成两半。

相反，我想要一个做以下事情的函数:

var str = "Hello   world!";
var substr = str.UnicodeSafeSubstring(0, 6);

其中substr包含"Hello "

作为参考，以下是我如何在Objective-C中使用rangeOfComposedCharacterSequencesForRange

NSString* str = @"Hello   world!";
NSRange range = [message rangeOfComposedCharacterSequencesForRange:NSMakeRange(0, 6)];
NSString* substr = [message substringWithRange:range]];

c#中的等效代码是什么?

c# . net中的UTF-16安全子字符串

看起来您希望在grapheme 上拆分字符串，即单个显示字符。

在这种情况下，您有一个方便的方法:StringInfo.SubstringByTextElements:

var str = "Hello   world!";
var substr = new StringInfo(str).SubstringByTextElements(0, 6);

这将返回从索引startIndex开始且长度不超过length的"complete"的最大子字符串。字母……所以初始/最终的分割;代理符对将被删除，初始组合标记将被删除，缺少组合标记的最后字符将被删除。

注意，可能不是你问的…您似乎想使用字素作为度量单位(或者您可能想包括最后一个字素，即使它的长度将超过length参数)

public static class StringEx
{
    public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
    {
        if (str == null)
        {
            throw new ArgumentNullException("str");
        }
        if (startIndex < 0 || startIndex > str.Length)
        {
            throw new ArgumentOutOfRangeException("startIndex");
        }
        if (length < 0)
        {
            throw new ArgumentOutOfRangeException("length");
        }
        if (startIndex + length > str.Length)
        {
            throw new ArgumentOutOfRangeException("length");
        }
        if (length == 0)
        {
            return string.Empty;
        }
        var sb = new StringBuilder(length);
        int end = startIndex + length;
        var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);
        while (enumerator.MoveNext())
        {
            string grapheme = enumerator.GetTextElement();
            startIndex += grapheme.Length;
            if (startIndex > length)
            {
                break;
            }
            // Skip initial Low Surrogates/Combining Marks
            if (sb.Length == 0)
            {
                if (char.IsLowSurrogate(grapheme[0]))
                {
                    continue;
                }
                UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0);
                if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
                {
                    continue;
                }
            }
            sb.Append(grapheme);
            if (startIndex == length)
            {
                break;
            }
        }
        return sb.ToString();
    }
}

只包含"extra"子字符串末尾的字符，如果有必要使整个字形完整:

public static class StringEx
{
    public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
    {
        if (str == null)
        {
            throw new ArgumentNullException("str");
        }
        if (startIndex < 0 || startIndex > str.Length)
        {
            throw new ArgumentOutOfRangeException("startIndex");
        }
        if (length < 0)
        {
            throw new ArgumentOutOfRangeException("length");
        }
        if (startIndex + length > str.Length)
        {
            throw new ArgumentOutOfRangeException("length");
        }
        if (length == 0)
        {
            return string.Empty;
        }
        var sb = new StringBuilder(length);
        int end = startIndex + length;
        var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);
        while (enumerator.MoveNext())
        {
            if (startIndex >= length)
            {
                break;
            }
            string grapheme = enumerator.GetTextElement();
            startIndex += grapheme.Length;
            // Skip initial Low Surrogates/Combining Marks
            if (sb.Length == 0)
            {
                if (char.IsLowSurrogate(grapheme[0]))
                {
                    continue;
                }
                UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0);
                if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
                {
                    continue;
                }
            }
            sb.Append(grapheme);
        }
        return sb.ToString();
    }
}

这将返回您要求的"Hello world!".UnicodeSafeSubstring(0, 6) == "Hello "。

注意:值得指出的是，这两个解决方案都依赖于StringInfo.GetTextElementEnumerator。这个方法在。net 5修复之前没有像预期的那样工作，所以如果你使用的是。net的早期版本，那么这个方法会拆分更复杂的多字符表情符号。

下面是截断(startIndex = 0)的简单实现:

string truncatedStr = (str.Length > maxLength)
    ? str.Substring(0, maxLength - (char.IsLowSurrogate(str[maxLength]) ? 1 : 0))
    : str;