LINQ 根据引号将字符串拆分为句子

本文关键字:字符串 拆分 句子 LINQ | 更新日期: 2023-09-27 18:32:18

如何在文本中将文本拆分为句子;带有点,问号,感叹号等。我试图逐个获取每个句子,除了引号内。

例如拆分这个:

Walked. Turned back. But why? And said "Hello world. Damn this string splitting things!" without a shame.

喜欢这个:

Walked. 
Turned back. 
But why? 
And said "Hello world. Damn this string splitting things!" without a shame.

我正在使用以下代码:

 private List<String> FindSentencesWhichContainsWord(string text, string word)
        {
            string[] sentences = text.Split(new char[] { '.', '?', '!' }, StringSplitOptions.RemoveEmptyEntries);
            // Define the search terms. This list could also be dynamically populated at runtime.
            string[] wordsToMatch = { word };
            // Find sentences that contain all the terms in the wordsToMatch array.
            // Note that the number of terms to match is not specified at compile time.
            var sentenceQuery = from sentence in sentences
                                let w = sentence.Split(new char[] { '.', '?', '!', ' ', ';', ':', ',' },
                                                        StringSplitOptions.RemoveEmptyEntries)
                                where w.Distinct().Intersect(wordsToMatch).Count() == wordsToMatch.Count()
                                select sentence;
            // Execute the query. Note that you can explicitly type
            // the iteration variable here even though sentenceQuery
            // was implicitly typed. 
            List<String> rtn = new List<string>();
            foreach (string str in sentenceQuery)
            {
                rtn.Add(str);
            }
            return rtn;
        }

但它给出的结果不是我所希望的。

Walked. 
Turned back. 
But why? 
And said "Hello world.
Damn this string splitting things!
" without a shame.

LINQ 根据引号将字符串拆分为句子

我认为这个问题可以通过两步解决:

  1. 使用TextFieldParser正确识别引用的单词

    string str = "Walked. Turned back. But why? And said '"Hello world. Damn this string splitting things!'" without a shame.";
    string[] words = null;
    using (TextFieldParser parser = new TextFieldParser(new StringReader(str))){
        parser.Delimiters = new string[] { " " };
        parser.HasFieldsEnclosedInQuotes = true;
        words = parser.ReadFields();                
    }    
    
  2. 使用较早的结果根据所需的特殊行为自定义新的string数组。

    List<string> newWords = new List<string>();
    string accWord = "";
    foreach (string word in words) {
        if (word.Contains(" ")) //means this is multiple items
            accWord += (accWord.Length > 0 ? " " : "") + "'"" + word + "'"";
        else {
            accWord += (accWord.Length > 0 ? " " : "") + word;
            if (word.EndsWith(".") || word.EndsWith("!") || word.EndsWith("?")) {
                newWords.Add(accWord);
                accWord = "";
            }
        }
    }
    

结果newWords

[2016-01-28 08:29:48.534 UTC] Walked.
[2016-01-28 08:29:48.536 UTC] Turned back.
[2016-01-28 08:29:48.536 UTC] But why?
[2016-01-28 08:29:48.536 UTC] And said "Hello world. Damn this string splitting things!" without a shame.

如果需要,您可以简单地将这两个包装在一个返回List<string>的方法中

您正在寻找一种称为"句子拆分器"的东西。这不是一个微不足道的问题...

如果你对如何正确解决这类问题感兴趣,我会推荐Manning和Schutze的《统计自然语言处理基础》(Foundations of Statistical Natural Language Processing)一书。

为了让您了解这有多复杂,我将简要描述我们在Nubilosoft使用的句子拆分器作为搜索组件的一部分。

  • 首先,我们进行段落拆分。通过这样做,我们消除了一些明显的错误,并使我们的文本更小。大多数文件格式,如MS Word DOC(X)和HTML,已经提供了段落标记,这是一个很好的第一步。
  • 接下来,我们对文本进行特征提取。功能包括标点符号、一些常见的缩写(如"dr.")和一些上下文信息。
  • 我们确定分割点。拆分点是改变大小写的标点符号和字符。(人们有时会忘记标点符号)。
  • 最后,我们将这一切提供给感知器神经网络,然后确定某物是否是"分裂"位置。

这里的一切都在手动注释的语料库上进行训练和测试;我不记得确切的数字,但它是相当多的句子。

通过这样做,它大约 99% 是正确的,这对于我们的目的来说"足够好"。

请注意,语料库的许可是一项非常棘手的业务......过去,我发现让自己有一个正常工作的句子拆分器的最简单方法是简单地购买一个已经训练过的句子拆分器。

我使用了TakeWhile。 直到字符不是分隔符。 或者如果它在引号内。

var seperator = new[] {'.', '?', '!'};
string str =
    @"Walked. Turned back. But why? And said ""Hello world. Damn this string splitting things!"" without a shame.";
List<string> result = new List<string>();
int index = 0;
bool quotes = false;
while (index < str.Length)
{
    var word = str.Skip(index).TakeWhile(ch =>
    {
        index++;
        if (ch == '"') quotes = !quotes;
        return quotes || !seperator.Contains(ch);
    });
    result.Add(string.Join("", word).Trim());
}

这不是一个防弹的解决方案,但它可以像这样实现。我手工做了句子和引用识别

void Main()
{
    var text = "Walked. Turned back. But why? And said '"Hello world. Damn this string splitting things!'" without a shame.";
    var result = SplitText(text);
}
private static List<String> SplitText(string text)
{
    var result = new List<string>();
    var sentenceEndings = new HashSet<char> { '.', '?', '!' };
    var startIndex = 0;
    var length = 0;
    var isQuote = false;
    for (var i = 0; i < text.Length; i++)
    {
        var c = text[i];
        if (c == '"' && !isQuote)
        {
            isQuote = true;
            continue;
        }
        if (c == '"' && isQuote)
        {
            isQuote = false;
            continue;
        }
        if (!isQuote && sentenceEndings.Contains(c))
        {
            length = i + 1 - startIndex;
            var part = text.Substring(startIndex, length);
            result.Add(part);
            startIndex = i + 2;
        }
    }
    return result;
}