正在尝试使libmecab.dll(MeCab)与C#一起使用

本文关键字：一起 MeCab dll libmecab | 更新日期: 2023-09-27 17:59:27

我正试图在C#程序（Visual Studio 2010学习版，Windows 7）中使用日语形态分析器MeCab，但编码出现问题。如果我的输入（粘贴到文本框中）是这样的：

一方、広義の「ネコ」は、ネコ類(ネコ科動物）の一部、あるいはその全ての獣を指す包括的名称を指す。

然后我的输出（在另一个文本框中）看起来是这样的：

？ågellicž*？ågellicž*？ågellicž*？ågellicž*？ågellicž*？ågellicž*？ågellicž*？ågellicž*？ågellicž*？ågellicž*？ågellicž*？ågellicž*？ågellicž*？ågellicž*？ågellicž**？ågellicž*？ågellicž*？ågellicž*？ågellicž*？ågellicž*)ågellicž*？ågellicž*？？？？？？？？？？？？？？？？？？？？？？？？？ågellicž*EOS

我想这是其他编码中的文本被误认为是UTF-8编码的文本。但假设它是EUC-JP并使用Encoding.Convert将其转换为UTF-8并不会改变输出；假设它是Shift JIS，这样做会产生不同的胡言乱语。此外，虽然它肯定在处理文本——MeCab输出应该是这样格式化的——但它似乎也没有将输入解释为UTF-8。如果它这样做，输出中就不会有所有以一个字符"化合物"开头的相同行，而这显然是它无法识别的。

当我在MeCab的命令行中运行这句话时，我又听到了另一组看起来不同的胡言乱语。但是，同样，这只是一排向左的问号和括号，所以这不仅仅是Windows命令行不支持带有日语字符的字体的问题；同样，它只是没有以UTF-8的形式读取输入。（我确实以UTF-8模式安装了MeCab。）

代码的相关部分如下所示：

[DllImport（"libmecab.dll"，CallingConvention=CallingConversion.Cdecl）]private extern static IntPtr mecab_new2（字符串arg）；[DllImport（"libmecab.dll"，CallingConvention=CallingConversion.Cdecl）][return：MarshalAs（UnmanagedType.AnsiBStr）]private extern静态字符串mecab_sparse_tostr（IntPtr m，字符串str）；[DllImport（"libmecab.dll"，CallingConvention=CallingConversion.Cdecl）]私有外部静态空隙mecab_destroy（IntPtr m）；私有字符串meCabParse（字符串jpnText）{IntPtr mecab=mecab_new2（"）；string parsedText=mecab_sparse_tostr（mecab，jpnText）；mecab_destroy（mecab）；返回parsedText；}

（为了摆弄看似合理的东西，看看它们是否有区别，我尝试将"UnmanagedType.AnsiBStr"切换为"UnmanagedType.BStr"，这会产生错误"AccessViolationException未处理"，并在DllImport参数中添加"CharSet=CharSet.Unicode"，这将输出变成"EOS"。）

这就是我进行转换的方式：

//65001=UTF-8代码页，20932=EUC-JP代码页private字符串convertEncoding（字符串sourceString，int sourceCodepage，int targetCodepage）{编码sourceEncoding=编码.GetEncoding（sourceCodepage）；Encoding targetEncoding=编码.GetEncoding（targetCodepage）；//将源字符串转换为字节数组byte[]sourceBytes=sourceEncoding.GetBytes（sourceString）；//将这些字节转换为目标编码byte[]targetBytes=Encoding.Convert（sourceEncoding，targetEncoding、sourceBytes）；//字节数组到字符数组char[]targetChars=新char[targetEncoding.GetCharCount（targetBytes，0，targetBytes.Length）]；//char数组到targt编码字符串targetEncoding.GetChars（targetBytes，0，targetBytes.Length，targetChars，0）；string targetString=新字符串（targetChars）；return targetString；}私有字符串meCabParse（字符串jpnText）{//将字符串中的文本从UTF-8转换为EUC-JPjpnText=convertEncoding（jpnText，65001200932）；IntPtr mecab=mecab_new2（"）；string parsedText=mecab_sparse_tostr（mecab，jpnText）；//annd转换回UTF-8parsedText=convertEncoding（parsedText209326501）；mecab_destroy（mecab）；}

建议/嘲讽？

正在尝试使libmecab.dll(MeCab)与C#一起使用

我偶然发现这个线程正在寻找同样的方法。我使用您的代码作为起点，并使用这篇博客文章来了解如何封送UTF8字符串。

以下代码为我提供了正确编码的输出：

public class Mecab
{
    [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet=CharSet.Unicode)]
    private extern static IntPtr mecab_new2(string arg);
    [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
    private extern static IntPtr mecab_sparse_tostr(IntPtr m, byte[] str);
    [DllImport("libmecab.dll", CallingConvention = CallingConvention.Cdecl, CharSet = CharSet.Unicode)]
    private extern static void mecab_destroy(IntPtr m);
    public static String Parse(String input)
    {
        IntPtr mecab = mecab_new2("");
        IntPtr nativeStr = mecab_sparse_tostr(mecab, Encoding.UTF8.GetBytes(input));
        int size = nativeArraySize(nativeStr) - 1;
        byte[] data = new byte[size];
        Marshal.Copy(nativeStr, data, 0, size);
        mecab_destroy(mecab);
        return Encoding.UTF8.GetString(data);
    }
    private static int nativeArraySize(IntPtr ptr)
    {
        int size = 0;
        while (Marshal.ReadByte(ptr, size) > 0)
            size++;
        return size;
    }
}