在Itextsharp中使用ITextExtractionStrategy和LocationTextExtraction

本文关键字:ITextExtractionStrategy LocationTextExtraction Itextsharp | 更新日期: 2023-09-27 18:05:26

我有一个PDF文件,我正在使用ITextExtractionStrategy读取字符串。现在从字符串我采取一个子字符串像My name is XYZ,需要从PDF文件中获得子字符串的直角坐标,但无法做到这一点。在谷歌上,我知道LocationTextExtractionStrategy,但不知道如何使用它来获得坐标。

代码如下:

ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
string getcoordinate="My name is XYZ";

如何使用ITEXTSHARP..获取此子字符串的直角坐标

请帮助。

在Itextsharp中使用ITextExtractionStrategy和LocationTextExtraction

这是一个非常非常简单的实现版本。

在实现它之前非常重要的是要知道pdf没有"单词","段落","句子"等概念。此外,PDF中的文本不一定是从左到右、从上到下排列的,这与非ltr语言无关。短语"Hello World"可以这样写入PDF:

Draw H at (10, 10)
Draw ell at (20, 10)
Draw rld at (90, 10)
Draw o Wo at (50, 20)

也可以写成

Draw Hello World at (10,10)

您需要实现的ITextExtractionStrategy接口有一个名为RenderText的方法,该方法对PDF中的每个文本块调用一次。注意我说的是"chunk"而不是"word"。在上面的第一个示例中,该方法将为这两个单词调用四次。在第二个例子中,对于这两个单词,它将被调用一次。这是需要理解的非常重要的部分。pdf文件没有文字,因此iTextSharp也没有文字。"单词"部分完全由你来解决。

同样,正如我上面所说的,pdf没有段落。注意这一点的原因是pdf不能将文本换行到新行。任何时候,当你看到一些看起来像段落返回的东西时,你实际上看到的是一个全新的文本绘制命令,它的y坐标与前一行不同。

下面的代码是一个非常简单的实现。对于它,我子类化LocationTextExtractionStrategy,它已经实现了ITextExtractionStrategy。在每次调用RenderText()时,我找到当前块的矩形(使用Mark的代码)并存储它以供以后使用。我使用这个简单的助手类来存储这些块和矩形:
//Helper class that stores our rectangle and text
public class RectAndText {
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public RectAndText(iTextSharp.text.Rectangle rect, String text) {
        this.Rect = rect;
        this.Text = text;
    }
}

这是子类:

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();
    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);
        //Get the bounding box for the chunk of text
        var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
        var topRight = renderInfo.GetAscentLine().GetEndPoint();
        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );
        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
    }
}

最后是上面的一个实现:

//Our test file
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
//Create our test file, nothing special
using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
    using (var doc = new Document()) {
        using (var writer = PdfWriter.GetInstance(doc, fs)) {
            doc.Open();
            doc.Add(new Paragraph("This is my sample file"));
            doc.Close();
        }
    }
}
//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();
//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
    var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}
//Loop through each chunk found
foreach (var p in t.myPoints) {
    Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}

我必须强调,上面的并没有把"words"考虑在内,这取决于你。传递给RenderTextTextRenderInfo对象有一个名为GetCharacterRenderInfos()的方法,您可以使用该方法获取更多信息。你也可以使用GetBaseline() instead of GetDescentLine() ',如果你不关心字体的下降。

编辑

(我吃了一顿很棒的午餐,所以我觉得自己更有帮助了。)

这是MyLocationTextExtractionStrategy的更新版本,它做了我下面的评论所说的,即它需要一个字符串来搜索并搜索该字符串的每个块。由于所列出的所有原因,这在某些/许多/大多数/所有情况下都行不通。如果子字符串在单个块中存在多次,它也将只返回第一个实例。结扎符和变音符符也会造成混乱。

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();
    //The string that we're searching for
    public String TextToSearchFor { get; set; }
    //How to compare strings
    public System.Globalization.CompareOptions CompareOptions { get; set; }
    public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None) {
        this.TextToSearchFor = textToSearchFor;
        this.CompareOptions = compareOptions;
    }
    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);
        //See if the current chunk contains the text
        var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);
        //If not found bail
        if (startPosition < 0) {
            return;
        }
        //Grab the individual characters
        var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList();
        //Grab the first and last character
        var firstChar = chars.First();
        var lastChar = chars.Last();

        //Get the bounding box for the chunk of text
        var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
        var topRight = lastChar.GetAscentLine().GetEndPoint();
        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );
        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor));
    }

你可以像以前一样使用它,但是现在构造函数有一个必需的参数:

var t = new MyLocationTextExtractionStrategy("sample");

这是一个老问题,但我在这里留下我的回答,因为我在网上找不到正确的答案。

正如Chris Haas所揭示的那样,处理单词并不像处理文本块那样容易。Chris发布的代码在我的大多数测试中都失败了,因为一个单词通常被分成不同的块(他在帖子中警告过这个问题)。

为了解决这个问题,我使用了以下策略:

  1. 按字符分割块(实际上每个字符都是textrenderinfo对象)
  2. 按行分组字符。这不是直接的,因为你必须处理块对齐。
  3. 为每一行搜索需要查找的单词

我把代码留在这里。我用几个文档测试了它,它工作得很好,但在某些情况下它可能会失败,因为这个chunk -> words转换有点棘手。

希望对大家有所帮助。

  class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy
{
    private List<LocationTextExtractionStrategyEx.ExtendedTextChunk> m_DocChunks = new List<ExtendedTextChunk>();
    private List<LocationTextExtractionStrategyEx.LineInfo> m_LinesTextInfo = new List<LineInfo>();
    public List<SearchResult> m_SearchResultsList = new List<SearchResult>();
    private String m_SearchText;
    public const float PDF_PX_TO_MM = 0.3528f;
    public float m_PageSizeY;

    public LocationTextExtractionStrategyEx(String sSearchText, float fPageSizeY)
        : base()
    {
        this.m_SearchText = sSearchText;
        this.m_PageSizeY = fPageSizeY;
    }
    private void searchText()
    {
        foreach (LineInfo aLineInfo in m_LinesTextInfo)
        {
            int iIndex = aLineInfo.m_Text.IndexOf(m_SearchText);
            if (iIndex != -1)
            {
                TextRenderInfo aFirstLetter = aLineInfo.m_LineCharsList.ElementAt(iIndex);
                SearchResult aSearchResult = new SearchResult(aFirstLetter, m_PageSizeY);
                this.m_SearchResultsList.Add(aSearchResult);
            }
        }
    }
    private void groupChunksbyLine()
    {                     
        LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk1 = null;
        LocationTextExtractionStrategyEx.LineInfo textInfo = null;
        foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk2 in this.m_DocChunks)
        {
            if (textChunk1 == null)
            {                    
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            else if (textChunk2.sameLine(textChunk1))
            {                      
                textInfo.appendText(textChunk2);
            }
            else
            {                                        
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            textChunk1 = textChunk2;
        }
    }
    public override string GetResultantText()
    {
        groupChunksbyLine();
        searchText();
        //In this case the return value is not useful
        return "";
    }
    public override void RenderText(TextRenderInfo renderInfo)
    {
        LineSegment baseline = renderInfo.GetBaseline();
        //Create ExtendedChunk
        ExtendedTextChunk aExtendedChunk = new ExtendedTextChunk(renderInfo.GetText(), baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetCharacterRenderInfos().ToList());
        this.m_DocChunks.Add(aExtendedChunk);
    }
    public class ExtendedTextChunk
    {
        public string m_text;
        private Vector m_startLocation;
        private Vector m_endLocation;
        private Vector m_orientationVector;
        private int m_orientationMagnitude;
        private int m_distPerpendicular;           
        private float m_charSpaceWidth;           
        public List<TextRenderInfo> m_ChunkChars;

        public ExtendedTextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth,List<TextRenderInfo> chunkChars)
        {
            this.m_text = txt;
            this.m_startLocation = startLoc;
            this.m_endLocation = endLoc;
            this.m_charSpaceWidth = charSpaceWidth;                
            this.m_orientationVector = this.m_endLocation.Subtract(this.m_startLocation).Normalize();
            this.m_orientationMagnitude = (int)(Math.Atan2((double)this.m_orientationVector[1], (double)this.m_orientationVector[0]) * 1000.0);
            this.m_distPerpendicular = (int)this.m_startLocation.Subtract(new Vector(0.0f, 0.0f, 1f)).Cross(this.m_orientationVector)[2];                
            this.m_ChunkChars = chunkChars;
        }

        public bool sameLine(LocationTextExtractionStrategyEx.ExtendedTextChunk textChunkToCompare)
        {
            return this.m_orientationMagnitude == textChunkToCompare.m_orientationMagnitude && this.m_distPerpendicular == textChunkToCompare.m_distPerpendicular;
        }

    }
    public class SearchResult
    {
        public int iPosX;
        public int iPosY;
        public SearchResult(TextRenderInfo aCharcter, float fPageSizeY)
        {
            //Get position of upperLeft coordinate
            Vector vTopLeft = aCharcter.GetAscentLine().GetStartPoint();
            //PosX
            float fPosX = vTopLeft[Vector.I1]; 
            //PosY
            float fPosY = vTopLeft[Vector.I2];
            //Transform to mm and get y from top of page
            iPosX = Convert.ToInt32(fPosX * PDF_PX_TO_MM);
            iPosY = Convert.ToInt32((fPageSizeY - fPosY) * PDF_PX_TO_MM);
        }
    }
    public class LineInfo
    {            
        public string m_Text;
        public List<TextRenderInfo> m_LineCharsList;
        public LineInfo(LocationTextExtractionStrategyEx.ExtendedTextChunk initialTextChunk)
        {                
            this.m_Text = initialTextChunk.m_text;
            this.m_LineCharsList = initialTextChunk.m_ChunkChars;
        }
        public void appendText(LocationTextExtractionStrategyEx.ExtendedTextChunk additionalTextChunk)
        {
            m_LineCharsList.AddRange(additionalTextChunk.m_ChunkChars);
            this.m_Text += additionalTextChunk.m_text;
        }
    }
}

我知道这是一个非常老的问题,但下面是我最终做的。只是把它贴在这里,希望对别人有用。

下面的代码将告诉您包含搜索文本的行的起始坐标。修改它来给出单词的位置应该不难。请注意。我在itextsharp 5.5.11.0上进行了测试,在一些旧版本上无法运行

如上所述,pdf没有单词/行或段落的概念。但我发现LocationTextExtractionStrategy在分割行和单词方面做得很好。所以我的解决方案就是基于这个。

<标题>免责声明:

这个解决方案是基于https://github.com/itext/itextsharp/blob/develop/src/core/iTextSharp/text/pdf/parser/LocationTextExtractionStrategy.cs,该文件有一个注释,说这是一个开发预览。因此,这可能在未来不起作用。

无论如何,这是代码。

using System.Collections.Generic;
using iTextSharp.text.pdf.parser;
namespace Logic
{
    public class LocationTextExtractionStrategyWithPosition : LocationTextExtractionStrategy
    {
        private readonly List<TextChunk> locationalResult = new List<TextChunk>();
        private readonly ITextChunkLocationStrategy tclStrat;
        public LocationTextExtractionStrategyWithPosition() : this(new TextChunkLocationStrategyDefaultImp())
        {
        }
        /**
         * Creates a new text extraction renderer, with a custom strategy for
         * creating new TextChunkLocation objects based on the input of the
         * TextRenderInfo.
         * @param strat the custom strategy
         */
        public LocationTextExtractionStrategyWithPosition(ITextChunkLocationStrategy strat)
        {
            tclStrat = strat;
        }

        private bool StartsWithSpace(string str)
        {
            if (str.Length == 0) return false;
            return str[0] == ' ';
        }

        private bool EndsWithSpace(string str)
        {
            if (str.Length == 0) return false;
            return str[str.Length - 1] == ' ';
        }
        /**
         * Filters the provided list with the provided filter
         * @param textChunks a list of all TextChunks that this strategy found during processing
         * @param filter the filter to apply.  If null, filtering will be skipped.
         * @return the filtered list
         * @since 5.3.3
         */
        private List<TextChunk> filterTextChunks(List<TextChunk> textChunks, ITextChunkFilter filter)
        {
            if (filter == null)
            {
                return textChunks;
            }
            var filtered = new List<TextChunk>();
            foreach (var textChunk in textChunks)
            {
                if (filter.Accept(textChunk))
                {
                    filtered.Add(textChunk);
                }
            }
            return filtered;
        }
        public override void RenderText(TextRenderInfo renderInfo)
        {
            LineSegment segment = renderInfo.GetBaseline();
            if (renderInfo.GetRise() != 0)
            { // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to 
                Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise());
                segment = segment.TransformBy(riseOffsetTransform);
            }
            TextChunk tc = new TextChunk(renderInfo.GetText(), tclStrat.CreateLocation(renderInfo, segment));
            locationalResult.Add(tc);
        }

        public IList<TextLocation> GetLocations()
        {
            var filteredTextChunks = filterTextChunks(locationalResult, null);
            filteredTextChunks.Sort();
            TextChunk lastChunk = null;
            var textLocations = new List<TextLocation>();
            foreach (var chunk in filteredTextChunks)
            {
                if (lastChunk == null)
                {
                    //initial
                    textLocations.Add(new TextLocation
                    {
                        Text = chunk.Text,
                        X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
                        Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
                    });
                }
                else
                {
                    if (chunk.SameLine(lastChunk))
                    {
                        var text = "";
                        // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
                        if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text))
                            text += ' ';
                        text += chunk.Text;
                        textLocations[textLocations.Count - 1].Text += text;
                    }
                    else
                    {
                        textLocations.Add(new TextLocation
                        {
                            Text = chunk.Text,
                            X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]),
                            Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1])
                        });
                    }
                }
                lastChunk = chunk;
            }
            //now find the location(s) with the given texts
            return textLocations;
        }
    }
    public class TextLocation
    {
        public float X { get; set; }
        public float Y { get; set; }
        public string Text { get; set; }
    }
}

如何调用方法:

using (var reader = new PdfReader(inputPdf))
{
    var parser = new PdfReaderContentParser(reader);
    var strategy = parser.ProcessContent(pageNumber, new LocationTextExtractionStrategyWithPosition());
    var res = strategy.GetLocations();
    reader.Close();
}
var searchResult = res.Where(p => p.Text.Contains(searchText)).OrderBy(p => p.Y).Reverse().ToList();
  • inputPdf是具有pdf数据
  • byte[]
  • pageNumber是您要在
  • 中搜索的页面。

下面是如何在VB.NET中使用LocationTextExtractionStrategy

类定义:

Class TextExtractor
    Inherits LocationTextExtractionStrategy
    Implements iTextSharp.text.pdf.parser.ITextExtractionStrategy
    Public oPoints As IList(Of RectAndText) = New List(Of RectAndText)
    Public Overrides Sub RenderText(renderInfo As TextRenderInfo) 'Implements IRenderListener.RenderText
        MyBase.RenderText(renderInfo)
        Dim bottomLeft As Vector = renderInfo.GetDescentLine().GetStartPoint()
        Dim topRight As Vector = renderInfo.GetAscentLine().GetEndPoint() 'GetBaseline
        Dim rect As Rectangle = New Rectangle(bottomLeft(Vector.I1), bottomLeft(Vector.I2), topRight(Vector.I1), topRight(Vector.I2))
        oPoints.Add(New RectAndText(rect, renderInfo.GetText()))
    End Sub
    Private Function GetLines() As Dictionary(Of Single, ArrayList)
        Dim oLines As New Dictionary(Of Single, ArrayList)
        For Each p As RectAndText In oPoints
            Dim iBottom = p.Rect.Bottom
            If oLines.ContainsKey(iBottom) = False Then
                oLines(iBottom) = New ArrayList()
            End If
            oLines(iBottom).Add(p)
        Next
        Return oLines
    End Function
    Public Function Find(ByVal sFind As String) As iTextSharp.text.Rectangle
        Dim oLines As Dictionary(Of Single, ArrayList) = GetLines()
        For Each oEntry As KeyValuePair(Of Single, ArrayList) In oLines
            'Dim iBottom As Integer = oEntry.Key
            Dim oRectAndTexts As ArrayList = oEntry.Value
            Dim sLine As String = ""
            For Each p As RectAndText In oRectAndTexts
                sLine += p.Text
                If sLine.IndexOf(sFind) <> -1 Then
                    Return p.Rect
                End If
            Next
        Next
        Return Nothing
    End Function
End Class
Public Class RectAndText
    Public Rect As iTextSharp.text.Rectangle
    Public Text As String
    Public Sub New(ByVal rect As iTextSharp.text.Rectangle, ByVal text As String)
        Me.Rect = rect
        Me.Text = text
    End Sub
End Class

用法(在找到的文本右边插入签名框)

Sub EncryptPdf(ByVal sInFilePath As String, ByVal sOutFilePath As String)
        Dim oPdfReader As iTextSharp.text.pdf.PdfReader = New iTextSharp.text.pdf.PdfReader(sInFilePath)
        Dim oPdfDoc As New iTextSharp.text.Document()
        Dim oPdfWriter As PdfWriter = PdfWriter.GetInstance(oPdfDoc, New FileStream(sOutFilePath, FileMode.Create))
        'oPdfWriter.SetEncryption(PdfWriter.STRENGTH40BITS, sPassword, sPassword, PdfWriter.AllowCopy)
        oPdfDoc.Open()
        oPdfDoc.SetPageSize(iTextSharp.text.PageSize.LEDGER.Rotate())
        Dim oDirectContent As iTextSharp.text.pdf.PdfContentByte = oPdfWriter.DirectContent
        Dim iNumberOfPages As Integer = oPdfReader.NumberOfPages
        Dim iPage As Integer = 0
        Dim iBottomMargin As Integer = txtBottomMargin.Text '10
        Dim iLeftMargin As Integer = txtLeftMargin.Text '500
        Dim iWidth As Integer = txtWidth.Text '120
        Dim iHeight As Integer = txtHeight.Text '780
        Dim oStrategy As New parser.SimpleTextExtractionStrategy()

        Do While (iPage < iNumberOfPages)
            iPage += 1
            oPdfDoc.SetPageSize(oPdfReader.GetPageSizeWithRotation(iPage))
            oPdfDoc.NewPage()
            Dim oPdfImportedPage As iTextSharp.text.pdf.PdfImportedPage =
            oPdfWriter.GetImportedPage(oPdfReader, iPage)
            Dim iRotation As Integer = oPdfReader.GetPageRotation(iPage)
            If (iRotation = 90) Or (iRotation = 270) Then
                oDirectContent.AddTemplate(oPdfImportedPage, 0, -1.0F, 1.0F,
                 0, 0, oPdfReader.GetPageSizeWithRotation(iPage).Height)
            Else
                oDirectContent.AddTemplate(oPdfImportedPage, 1.0F, 0, 0, 1.0F, 0, 0)
            End If
            'Dim sPageText As String = parser.PdfTextExtractor.GetTextFromPage(oPdfReader, iPage, oStrategy)
            'sPageText = System.Text.Encoding.UTF8.GetString(System.Text.ASCIIEncoding.Convert(System.Text.Encoding.Default, System.Text.Encoding.UTF8, System.Text.Encoding.Default.GetBytes(sPageText)))
            'If txtFind.Text = "" OrElse sPageText.IndexOf(txtFind.Text) <> -1 Then
            Dim oTextExtractor As New TextExtractor()
            PdfTextExtractor.GetTextFromPage(oPdfReader, iPage, oTextExtractor) 'Initialize oTextExtractor
            Dim oRect As iTextSharp.text.Rectangle = oTextExtractor.Find(txtFind.Text)
            If oRect IsNot Nothing Then
                Dim iX As Integer = oRect.Left + oRect.Width + iLeftMargin 'Move right
                Dim iY As Integer = oRect.Bottom - iBottomMargin 'Move down
                Dim field As PdfFormField = PdfFormField.CreateSignature(oPdfWriter)
                field.SetWidget(New Rectangle(iX, iY, iX + iWidth, iY + iHeight), PdfAnnotation.HIGHLIGHT_OUTLINE)
                field.FieldName = "myEmptySignatureField" & iPage
                oPdfWriter.AddAnnotation(field)
            End If
        Loop
        oPdfDoc.Close()
    End Sub

@Ivan Basart,非常感谢。

这里是完整的代码,任何人都需要从Ivan的代码中节省时间。

using System.Collections.Generic;
using System;
class Program
{
    static void Main()
    {
        //Our test file
        var testFile = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
        var searchText = "searchWords";
        using (var reader = new PdfReader(testFile))
        {
            for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
            {
                LocationTextExtractionStrategyEx strategy = new LocationTextExtractionStrategyEx(searchText, reader.GetPageSize(1).Height);
                var ex = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
                foreach (LocationTextExtractionStrategyEx.SearchResult result in strategy.m_SearchResultsList)
                {
                    Console.WriteLine("Found at position: X = {0}, Y = {1}", result.iPosX, result.iPosY);
                }
            }
        }
    }
}

class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy
{
    private List<LocationTextExtractionStrategyEx.ExtendedTextChunk> m_DocChunks = new List<ExtendedTextChunk>();
    private List<LocationTextExtractionStrategyEx.LineInfo> m_LinesTextInfo = new List<LineInfo>();
    public List<SearchResult> m_SearchResultsList = new List<SearchResult>();
    private String m_SearchText;
    public const float PDF_PX_TO_MM = 0.3528f;
    public float m_PageSizeY;

    public LocationTextExtractionStrategyEx(String sSearchText, float fPageSizeY)
        : base()
    {
        this.m_SearchText = sSearchText;
        this.m_PageSizeY = fPageSizeY;
    }
    private void searchText()
    {
        foreach (LineInfo aLineInfo in m_LinesTextInfo)
        {
            int iIndex = aLineInfo.m_Text.IndexOf(m_SearchText);
            if (iIndex != -1)
            {
                TextRenderInfo aFirstLetter = aLineInfo.m_LineCharsList.ElementAt(iIndex);
                SearchResult aSearchResult = new SearchResult(aFirstLetter, m_PageSizeY);
                this.m_SearchResultsList.Add(aSearchResult);
            }
        }
    }
    private void groupChunksbyLine()
    {
        LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk1 = null;
        LocationTextExtractionStrategyEx.LineInfo textInfo = null;
        foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk2 in this.m_DocChunks)
        {
            if (textChunk1 == null)
            {
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            else if (textChunk2.sameLine(textChunk1))
            {
                textInfo.appendText(textChunk2);
            }
            else
            {
                textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2);
                this.m_LinesTextInfo.Add(textInfo);
            }
            textChunk1 = textChunk2;
        }
    }
    public override string GetResultantText()
    {
        groupChunksbyLine();
        searchText();
        //In this case the return value is not useful
        return "";
    }
    public override void RenderText(TextRenderInfo renderInfo)
    {
        LineSegment baseline = renderInfo.GetBaseline();
        //Create ExtendedChunk
        ExtendedTextChunk aExtendedChunk = new ExtendedTextChunk(renderInfo.GetText(), baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetCharacterRenderInfos().ToList());
        this.m_DocChunks.Add(aExtendedChunk);
    }
    public class ExtendedTextChunk
    {
        public string m_text;
        private Vector m_startLocation;
        private Vector m_endLocation;
        private Vector m_orientationVector;
        private int m_orientationMagnitude;
        private int m_distPerpendicular;
        private float m_charSpaceWidth;
        public List<TextRenderInfo> m_ChunkChars;

        public ExtendedTextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth, List<TextRenderInfo> chunkChars)
        {
            this.m_text = txt;
            this.m_startLocation = startLoc;
            this.m_endLocation = endLoc;
            this.m_charSpaceWidth = charSpaceWidth;
            this.m_orientationVector = this.m_endLocation.Subtract(this.m_startLocation).Normalize();
            this.m_orientationMagnitude = (int)(Math.Atan2((double)this.m_orientationVector[1], (double)this.m_orientationVector[0]) * 1000.0);
            this.m_distPerpendicular = (int)this.m_startLocation.Subtract(new Vector(0.0f, 0.0f, 1f)).Cross(this.m_orientationVector)[2];
            this.m_ChunkChars = chunkChars;
        }

        public bool sameLine(LocationTextExtractionStrategyEx.ExtendedTextChunk textChunkToCompare)
        {
            return this.m_orientationMagnitude == textChunkToCompare.m_orientationMagnitude && this.m_distPerpendicular == textChunkToCompare.m_distPerpendicular;
        }

    }
    public class SearchResult
    {
        public float iPosX;
        public float iPosY;
        public SearchResult(TextRenderInfo aCharcter, float fPageSizeY)
        {
            //Get position of upperLeft coordinate
            Vector vTopLeft = aCharcter.GetAscentLine().GetStartPoint();
            //PosX
            float fPosX = vTopLeft[Vector.I1];
            //PosY
            float fPosY = vTopLeft[Vector.I2];
            //Transform to mm and get y from top of page
            iPosX = Convert.ToInt32(fPosX * PDF_PX_TO_MM);
            iPosY = Convert.ToInt32((fPageSizeY - fPosY) * PDF_PX_TO_MM);
        }
    }
    public class LineInfo
    {
        public string m_Text;
        public List<TextRenderInfo> m_LineCharsList;
        public LineInfo(LocationTextExtractionStrategyEx.ExtendedTextChunk initialTextChunk)
        {
            this.m_Text = initialTextChunk.m_text;
            this.m_LineCharsList = initialTextChunk.m_ChunkChars;
        }
        public void appendText(LocationTextExtractionStrategyEx.ExtendedTextChunk additionalTextChunk)
        {
            m_LineCharsList.AddRange(additionalTextChunk.m_ChunkChars);
            this.m_Text += additionalTextChunk.m_text;
        }
    }
}
相关文章:
  • 没有找到相关文章