如何提高Lucene.net的索引速度

本文关键字：索引速度 net 何提高 Lucene | 更新日期: 2023-09-27 17:59:13

我正在使用lucene.net为我的pdf文件编制索引。索引15000个pdf文件大约需要40分钟，并且索引时间随着文件夹中pdf文件数量的增加而增加。

如何提高lucene.net中的索引速度
是否有其他具有快速索引性能的索引服务

我使用的是最新版本的lucene.net索引（lucene.net 3.0.3）

这是我的索引代码。

public void refreshIndexes() 
        {
            // Create Index Writer
            string strIndexDir = @"E:'LuceneTest'index";
            IndexWriter writer = new IndexWriter(Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir)), new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED);
            // Find all files in root folder create index on them
            List<string> lstFiles = searchFiles(@"E:'LuceneTest'PDFs");
            foreach (string strFile in lstFiles)
            {
                Document doc = new Document();
                string FileName = System.IO.Path.GetFileNameWithoutExtension(strFile);
                string Text = ExtractTextFromPdf(strFile);
                string Path = strFile;
                string ModifiedDate = Convert.ToString(File.GetLastWriteTime(strFile));
                string DocumentType = string.Empty;
                string Vault = string.Empty;
                string headerText = Text.Substring(0, Text.Length < 150 ? Text.Length : 150);
                foreach (var docs in ltDocumentTypes)
                {
                    if (headerText.ToUpper().Contains(docs.searchText.ToUpper()))
                    {
                        DocumentType = docs.DocumentType;
                        Vault = docs.VaultName; ;
                    }
                }
                if (string.IsNullOrEmpty(DocumentType))
                {
                    DocumentType = "Default";
                    Vault = "Default";
                }
                doc.Add(new Field("filename", FileName, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("text", Text, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("path", Path, Field.Store.YES, Field.Index.NOT_ANALYZED));
                doc.Add(new Field("modifieddate", ModifiedDate, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("documenttype", DocumentType, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("vault", Vault, Field.Store.YES, Field.Index.ANALYZED));
                writer.AddDocument(doc);
            }
            writer.Optimize();
            writer.Dispose();
        }

如何提高Lucene.net的索引速度

索引部分看起来不错。请注意，IndexWriter是线程安全的，因此如果您在多核计算机上，使用Parallel.Foreach（MaxConcurrent设置为核数。使用此值）可能会有所帮助。

但是文档类型检测部分让GC变得疯狂。所有的ToUpper（）都很痛苦。

在lstFiles循环之外。以大写创建ltDocumentTypes.searchText的副本

var upperDocTypes = ltDocumentTypes.Select(x=>x.searchText.ToUpper()).ToList();

在文档类型循环之外创建另一个字符串
```
string headerTestUpper = headerText.ToUpper();
```

当它找到匹配"中断"时。一旦找到匹配项，就会退出循环，并阻止所有后续迭代。当然，这意味着第一场比赛，而你的比赛是最后一场比赛（如果这对你有影响的话）

string headerText = Text.Substring(0, Text.Length < 150 ? Text.Length : 150);
foreach (var searchText in upperDocTypes)
{
    if (headerTextUpper.Contains(searchText))
    {
        DocumentType = docs.DocumentType;
        Vault = docs.VaultName;
        break;
    }
}

根据ltDocumentTypes的大小，这可能不会给您带来太多改进。

我敢打赌，最昂贵的部分是ExtractTextFromPdf。通过探查器运行此操作或使用一些StopWatches进行检测，可以确定成本所在。