使用itextsharp从PDF中获取文本段落
本文关键字:获取 取文本 段落 PDF itextsharp 使用 | 更新日期: 2023-09-27 18:18:14
是否有使用itextsharp从PDF文件中获取段落文本的逻辑?我知道pdf只支持文本运行,很难确定哪些文本运行与哪个段落相关,我也知道没有任何<p>
标签或其他标签来确定pdf中的段落。然而,我试图获得文本运行的坐标来从其坐标构建段落,但没有运气:(。我的代码片段在这里:
private StringBuilder result = new StringBuilder();
private Vector lastBaseLine;
//to store run of texts
public List<string> strings = new List<String>();
//to store run of texts Coordinate (Y coordinate)
public List<float> baselines = new List<float>();
public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
{
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
if ((this.lastBaseLine != null) && (curBaseline[Vector.I2] != lastBaseLine[Vector.I2]))
{
if ((!string.IsNullOrEmpty(this.result.ToString())))
{
this.baselines.Add(this.lastBaseLine[Vector.I2]);
this.strings.Add(this.result.ToString());
}
result = new StringBuilder();
}
this.result.Append(renderInfo.GetText());
this.lastBaseLine = curBaseline;
}
有没有人对这个问题有任何逻辑??
using (MemoryStream ms = new MemoryStream())
{
Document document = new Document(PageSize.A4, 25, 25, 30, 30);
PdfWriter writer = PdfWriter.GetInstance(document, ms);
document.Open();
document.Add(new Paragraph("Hello World"));
document.Close();
writer.Close();
Response.ContentType = "pdf/application";
Response.AddHeader("content-disposition",
"attachment;filename=First PDF document.pdf");
Response.OutputStream.Write(ms.GetBuffer(), 0, ms.GetBuffer().Length);
}
这里有一些例子可以帮助你....