在PDF文件中搜索以查找字符串

本文关键字：查找字符串搜索 PDF 文件 | 更新日期: 2023-09-27 18:03:43

我需要在pdf文件中搜索以找到字符串。我知道itextsharp有这个功能，我可以使用这个代码

public bool SearchPdfFile(string fileName, String searchText)
{
    /* technically speaking this should not happen, since "you" are calling it
       therefore this should be handled critically
        if (!File.Exists(fileName)) return false; //original workflow
    */
    if (!File.Exists(fileName))
        throw new FileNotFoundException("File not found", fileName);
    using (PdfReader reader = new PdfReader(fileName))
    {
        var strategy = new SimpleTextExtractionStrategy();
        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            var currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
            if (currentPageText.Contains(searchText))
                return true;
        }
    }
    return false;
}

但是我使用LGPL/MPL许可证(版本3.0/4.0)下的文本，更新的版本5.0只有在我使我自己的软件在AGPL下自由时才免费。类SimpleTextExtractionStrategy在这个版本的text中没有定义。使用旧版本的文本是否有其他替代方法?

在PDF文件中搜索以查找字符串

PDFClown。一个愚蠢的名字，但它是一个相当详细和灵活的PDF库。我以前用过。它在LGPL下是免费的。http://pdfclown.org/about/TheLicense

修改自PDFClown网站的示例(他们的示例是java)

File file = new File(myFilePath);
// Define the text pattern to look for!
String textRegEx = "rabbit";
Pattern pattern = Pattern.compile(textRegEx, Pattern.CASE_INSENSITIVE);
// Instantiate the extractor!
TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
  // Extract the page text!
  Map<Rectangle2D,List<ITextString>> textStrings = textExtractor.extract(page);
  // Find the text pattern matches!
  final Matcher matcher = pattern.matcher(TextExtractor.toString(textStrings));
}

更新

    File file = new File(myFilePath);
    // Define the text pattern to look for!
    var pattern = new Regex("rabbit", RegexOptions.IgnoreCase);
    // Instantiate the extractor!
    TextExtractor textExtractor = new TextExtractor(true, true);
    foreach (var page in file.Document.Pages)
    {
        // Extract the page text!
        var textStrings = textExtractor.Extract(page);
        // Find the text pattern matches!
        var matches = pattern.Matches(TextExtractor.ToString(textStrings));
    }