TSQL为什么要处理“软”字?就像“索菲娅”一样?这是什么字符串编码

本文关键字：索菲娅编码一样字符串是什么为什么处理 TSQL 就像 | 更新日期: 2023-09-27 17:50:24

我遇到了一种情况，SQL服务器可以将"sofia"answers"sofia"存储为两个不同的字符串，但在TSQL中进行比较时，它们是相同的，无论使用COLLATE，即使二进制COLLATE:

CREATE TABLE #R (NAME NvarchAR(255) COLLATE SQL_Latin1_General_CP1_CI_AS)
INSERT INTO #R VALUES (N'sofia')
INSERT INTO #r VALUES (N'ｓｏｆｉａ')
SELECT * FROM #r WHERE NAME = N'ｓｏｆｉａ'
sofia
ｓｏｆｉａ
(2 row(s) affected)
IF 'ｓｏｆｉａ' = 'sofia'  COLLATE SQL_Latin1_General_CP1_CI_AS 
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
-------------------
Values are the same
(1 row(s) affected)
IF 'ｓｏｆｉａ' = 'sofia'  COLLATE SQL_Latin1_General_CP437_BIN
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
-------------------
Values are the same
(1 row(s) affected)
I tried to find out the encode of "ｓｏｆｉａ"
http://stackoverflow.com/questions/1025332/determine-a-strings-encoding-in-c-sharp
It said:
            // If all else fails, the encoding is probably (though certainly not
            // definitely) the user's local codepage! One might present to the user a
            // list of alternative encodings as shown here: http://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language
            // A full list can be found using Encoding.GetEncodings();
I iterate through all the encoding returned from Encoding.GetEncodings(), none of them match
Looking into the binary I found an interesting fact: “ｓｏｆｉａ” itself is encoded with UTF16, but it can be generated from  "SOFIA" UTF16 by filling “1” instead of “0” in the extra byte besides ASCII code (Ex for ‘S’: 83 255 vs 83 0)  It is shown as lower case. In C#, 
“ｓｏｆｉａ”
                             [0]         83          byte                                    
                             [1]         255        byte
                             [2]         79          byte
                             [3]         255        byte
                             [4]         70          byte
                             [5]         255        byte
                             [6]         73          byte
                             [7]         255        byte
                             [8]         65          byte
                             [9]         255        byte
"SOFIA"
                             [0]         83          byte                                    
                             [1]         0        byte
                             [2]         79          byte
                             [3]         0        byte
                             [4]         70          byte
                             [5]         0        byte
                             [6]         73          byte
                             [7]         0        byte
                             [8]         65          byte
                             [9]         0        byte
"sofia"
                             [0]         115          byte                                    
                             [1]         0        byte
                             [2]         79          byte
                             [3]         0        byte
                             [4]         70          byte
                             [5]         0        byte
                             [6]         105          byte
                             [7]         0        byte
                             [8]         97          byte
                             [9]         0        byte
One can create two different directorie/files with name as C:'ｓｏｆｉａ', C:'sofia' or  ｓｏｆｉａ.txt, sofia.txt.
Why does the SQL engine think they are the same while storing them with the original streams?
In order to get just the exact I want I had to convert to binary first:
SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'ｓｏｆｉａ')
ｓｏｆｉａ
(1 row(s) affected)
SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')
sofia
(1 row(s) affected)

但是这有很多副作用，比如文化和案例。我如何教 TSQL引擎知道它们是不同的，而不花费太多的成本?

这种字符串编码有正式的名称吗?

TSQL为什么要处理“软”字?就像“索菲娅”一样?这是什么字符串编码

这里有两个问题。

首先:存在排序问题。排序定义了字符的排序和相等性。正如@Kazetsukai所建议的，在这里有帮助的特定排序属性是宽度敏感性。但是，您不能简单地将_WS添加到任何排序规则名称中，并假设它将是有效的排序规则。事实上，SQL_Latin1_General_CP1_CI_AS_WS不是一个有效的排序规则。

您可以通过SELECT * FROM fn_helpcollations() WHERE [name] LIKE N'latin%[_]ws';获得有限的排序集。该查询的结果表明，您可能想要的排序是Latin1_General_CI_AS_WS。任何以_BIN2结尾的排序规则都可以工作(尽量不要使用以_BIN结尾的排序规则，因为这些已被弃用，就像以SQL_开头的排序规则一样)。

但是，由于某些原因，即使使用这些似乎也不起作用:

IF 'ｓｏｆｉａ' = 'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
IF 'ｓｏｆｉａ' = 'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

两者的结果都是"Values are The same"。这就引出了:

Second:当处理NVARCHAR ¹数据时，必须用大写的N作为字符串字面值的前缀，否则它会首先隐式地将字符转换为各自的VARCHAR ²字符(或者如果Unicode代码点与由字段或操作的排序规则指定的代码页中存在的字符之间没有定义映射，则字符转换为?)。

IF N'ｓｏｆｉａ' = N'sofia' COLLATE Latin1_General_CI_AS_WS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
IF N'ｓｏｆｉａ' = N'sofia' COLLATE Latin1_General_BIN2
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'

用N前缀这些文字值允许预期的行为，两个查询的结果现在是"值不同"。

¹ XML和N前缀类型将数据存储为UTF-16 Little Endian。默认处理只是UCS-2/基本多语言平面(BMP)字符。但是，如果使用以_SC结尾的排序规则，那么它可以正确处理带有补充字符的完整UTF-16。

² CHAR, VARCHAR和TEXT(但不要使用最后一个，因为它已被弃用)类型是8位ASCII扩展的代码页。

我相信你正在寻找的是半宽和全宽字符之间的区别。根据表使用的排序规则，这些将被视为相同或不同。在这种情况下，您使用的SQL_Latin1_General_CP1_CI_AS显然是宽度不敏感的。

您可以根据此添加_WS来增加宽度灵敏度，因此将排序规则更改为SQL_Latin1_General_CP1_CI_AS_WS应该将它们视为不相等。

编辑:正如@srutzky指出的那样，您需要找到一个包含_WS的排序规则，而不仅仅是将_WS添加到排序规则中。