使用 Nest 2 在 Elasticsearch 2 中索引 pdf 文件

本文关键字:索引 pdf 文件 Elasticsearch Nest 使用 | 更新日期: 2023-09-27 18:34:19

我想将PDF文件作为附件索引到Elasticsearch中,然后查询其内容。到目前为止,我尝试索引文档,但文件未附加到它,或者至少 elastichq 无法显示它并且 elasticsearch 正在打印错误。

这是索引:

var attachment = new Attachment ();
string path = "bankvsmartin.pdf";
attachment.Name = path;
attachment.Content = Convert.ToBase64String (File.ReadAllBytes(path));
attachment.ContentType = "application/pdf";
cases.Add( new Case{
    Author="Martin Luther 2",
    CaseName="Bank vs Martin",
    File= attachment
});
var indexName = "indexname";
client.Map<Case>(m => m.UpdateAllTypes());
foreach (var caze in cases)
{
    var rsp = client.Index (caze, i=>i.Index(indexName).Type("cases"));
}

以及类和映射定义:

[ElasticsearchType(Name = "cases")]
public class Case
{
    public string Author { get; set; }
    public string CaseName { get; set; }
    [Attachment(Store = true)]
    public Attachment File { get; set; }
    public Case ()
    {
    }
    public override string ToString()
    {
        return "Case: " + Author + " - " + File.Name;
    }
}

public class Attachment
{
    [String(Name = "_content")]
    public string Content { get; set; }
    [String(Name = "_content_type")]
    public string ContentType { get; set; }
    [String(Name = "_name")]
    public string Name { get; set; }
}

尝试检索附件时控制台中的弹性搜索错误:

emoteTransportException[[Sin-Eater][127.0.0.1:9300][indices:data/read
/search[phase/fetch/id]]]; nested: IllegalArgumentException[field [file] isn't a leaf field];
Caused by: java.lang.IllegalArgumentException: field [file] isn't a leaf field
    at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:138)
    at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:590)
    at org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:408)
    at org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:405)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

我正在尝试完成与此问题几乎相同的事情,但使用最新版本的 Nest。

使用 Elasticsearch 2.2, Nest 2.0.2, Mono/.Net 4.5

更新

这是生成的映射

"mappings": {
  "cases": {
    "properties": {
      "author": {
        "type": "string"
      },
      "case_name": {
        "type": "string"
      },
      "file": {
        "properties": {
          "_content": {
            "type": "string"
          },
          "_content_type": {
            "type": "string"
          },
          "_name": {
            "type": "string"
          }
        }
      }
    }
  }

使用 Nest 2 在 Elasticsearch 2 中索引 pdf 文件

我认为

这是因为您无法使用属性映射附件。ES 和 NEST 中的附件类型需要复杂的映射,而基于属性的映射无法完成。如果您下载 NEST 源代码并检查单元测试,您可以看到许多示例。

您可以使用 NEST 的流畅 API 专门定义映射。下面是一个示例:

var mappingResponse = elasticClient.Map<Case>( m => m
        .AutoMap()
        .Properties( ps => ps
            .String( s => s
                .Name( f => f.CaseName)
                .Index(FieldIndexOption.Analyzed)
                .Store(true))
.Attachment( atm => atm
                .Name( p => p.File)
                .FileField( f => f
                        .Name( p => p.File)
                        .Index(FieldIndexOption.Analyzed)
                        .Store(true)
                        .TermVector(TermVectorOption.WithPositionsOffsets))
                .AuthorField( af => af
                        .Name( p => p.Author)
                        .Store(true)
                        .Index(FieldIndexOption.Analyzed)
                        .TermVector(TermVectorOption.WithPositionsOffsets)))));

此问题修复后,此映射有效:

[ElasticsearchType(Name = "cases")]
public class Case
{
    public Case()
    {
    }
    [String(Name = "case_name")]
    public string CaseName { get; set; }
    [String(Name = "md5")]
    public string Md5 { get; set; }
    [Attachment(Name="file")]
    public Attachment File { get; set; }
}

public class Attachment
{
    public Attachment()
    {
    }
    [String(Name = "_author")]
    public string Author { get; set; }
    [String(Name = "_content_lenght")]
    public long ContentLength { get; set; }
    [String(Name = "_content_type")]
    public string ContentType { get; set; }
    [Date(Name = "_date")]
    public DateTime Date { get; set; }
    [String(Name = "_keywords")]
    public string Keywords { get; set; }
    [String(Name = "_language")]
    public string Language { get; set; }
    [String(Name = "_name")]
    public string Name { get; set; }
    [String(Name = "_title")]
    public string Title { get; set; }
    [String(Name = "_content")]
    public string Content { get; set; }
}