三角图比较

本文关键字:比较 三角 | 更新日期: 2023-09-27 18:22:18

我对编码还很陌生,所以我想我自己没有看到明显的答案,所以如果这是一个愚蠢的问题,我很抱歉,但我真的被困在这里了。我试图比较两种不同文本(A和B)中的两组八卦。如果A上没有B中的八卦,那么我会说这两个文本是不同的,至少就我目前的目的而言是不同的。我正在使用Nuve提取八卦。

到目前为止,我有这个:

        var paragraph = "This is not a phrase. This is not a sentence.";
        var paragraph2 = "This is a phrase. This is a sentence. This have nothing to do with sentences.";
        ITokenizer tokenizer = new ClassicTokenizer(true);
        SentenceSegmenter segmenter = new TokenBasedSentenceSegmenter(tokenizer);
        var sentences = segmenter.GetSentences(paragraph);
        ITokenizer tokenizer2 = new ClassicTokenizer(true);
        SentenceSegmenter segmenter2 = new TokenBasedSentenceSegmenter(tokenizer2);
        var sentences2 = segmenter2.GetSentences(paragraph2);

        var extractor = new NGramExtractor(3);
        var grams1 = extractor.ExtractAsList(sentences);
        var grams2 = extractor.ExtractAsList(sentences2);
        var nonintersect = grams2.Except(grams1);

        foreach (var nGram in nonintersect)
        {
            var current = nGram;
            bool found = false;
            foreach (var n in grams2)
            {
                if (!found)
                {
                    if (n == current)
                    {
                        found = true;
                    }
                }
            }
            if (!found)
            {
                var result = current;
                string finalresult = Convert.ToString(result);
                textBox3.AppendText(finalresult+ "'n");
            }

通过这种方式,我希望得到在B中不存在于A中的句子(即例子中B的所有句子),但现在我必须将B的每个八卦与A的每个八卦进行比较,看看句子之间是否真的不同。我试着用另一个嵌套的foreach来做这件事,但我得到的只是无意义的数据,如下所示:

 foreach (var sentence2 in sentences2)
        {
            var actual = sentence2;
            bool found1 = false;
            foreach (var sentence in sentences)
            {
                if (!found1) 
                {
                   if (actual == sentence)
                   {
                   found1 = true;
                   }
                }
            }
            if (!found1)
            {
                    string finalresult= Convert.ToString(actual);
                    textBox3.AppendText(finalresult+ "'n");
            }
        }

在做这件事时,我试图验证B中每个句子的八卦是否等于A中每个句子中的八卦,如果是,那么textBox3将为空。

简单地说,我正在尝试为C#编写类似于Ferret的代码,并且只是为了比较两个给定的纯文本。据我所知,目前还没有为C#做过类似的事情。

如有任何帮助或提示,我们将不胜感激。谢谢

三角图比较

比较文本正文

比较两个正文,如果它们至少有一个句子级的三元语法共同点,则将其标记为相似,这是相当直接的:

public bool AreTextsSimilar(string a, string b)
{
    // We can reuse these objects - they could be stored in member fields:
    ITokenizer tokenizer = new ClassicTokenizer(true);
    SentenceSegmenter segmenter = new TokenBasedSentenceSegmenter(tokenizer);
    NGramExtractor trigramExtractor = new NGramExtractor(3);
    IEnumerable<string> sentencesA = segmenter.GetSentences(a);
    IEnumerable<string> sentencesB = segmenter.GetSentences(b);
    // The order of trigrams doesn't matter, so we'll fetch them as sets instead,
    // to make comparisons between their elements more efficient:
    ISet<NGram> trigramsA = trigramExtractor.ExtractAsSet(sentencesA);
    ISet<NGram> trigramsB = trigramExtractor.ExtractAsSet(sentencesB);
    // 'Intersect' returns all elements that are found in both collections:
    IEnumerable<NGram> sharedTrigrams = trigramsA.Intersect(trigramsB);
    // 'Any' only returns true if the collection isn't empty:
    return sharedTrigrams.Any();
}

如果没有Linq方法(IntersectAny),最后两行可以实现为一个循环:

    foreach (NGram trigramA in trigramsA)
    {
        // As soon as we find a shared sentence trigram we can conclude that
        // the two bodies of text are indeed similar:
        if (trigramsB.Contains(trigramA))
            return true;
    }
    return false;
}

没有共享单词三元图的句子

检索所有不共享单词级三元图的句子需要更多的工作:

public IEnumerable<string> GetUniqueBSentences(string a, string b)
{
    // We can reuse these objects - they could be stored in member fields:
    ITokenizer tokenizer = new ClassicTokenizer(true);
    SentenceSegmenter segmenter = new TokenBasedSentenceSegmenter(tokenizer);
    NGramExtractor trigramExtractor = new NGramExtractor(3);
    IEnumerable<string> sentencesA = segmenter.GetSentences(a);
    IEnumerable<string> sentencesB = segmenter.GetSentences(b);
    ITokenizer wordTokenizer = new ClassicTokenizer(false);
    foreach (string sentenceB in sentencesB)
    {
        IList<string> wordsB = wordTokenizer.Tokenize(sentenceB);
        ISet<NGram> wordTrigramsB = trigramExtractor.ExtractAsSet(wordsB);
        bool foundMatchingSentence = false;
        foreach (string sentenceA in sentencesA)
        {
            // This will be repeated for every sentence in B. It would be more efficient
            // to generate trigrams for all sentences in A once, before we enter these loops:
            IList<string> wordsA = wordTokenizer.Tokenize(sentenceA);
            ISet<NGram> wordTrigramsA = trigramExtractor.ExtractAsSet(wordsA);
            if (wordTrigramsA.Intersect(wordTrigramsB).Any())
            {
                // We found a sentence in A that shares word-trigrams, so stop comparing:
                foundMatchingSentence = true;
                break;
            }
        }
        // No matching sentence in A? Then this sentence is unique to B:
        if (!foundMatchingSentence)
            yield return sentenceB;
    }
}

显然,segmenter还返回了一个额外的空句子,您可能想过滤掉它(或者想办法阻止segmenter这样做)。

如果性能是一个问题,我相信上面的代码可以进行优化,但我将由您决定。