C# iTextSharp - 代码覆盖而不是追加页面

本文关键字：追加 iTextSharp 代码覆盖 | 更新日期: 2023-09-27 18:37:18

我看过很多帮助我走到今天位置的帖子，我是编程新手。我的目的是获取目录"sourceDir"中的文件并查找正则表达式匹配。当它找到匹配项时，我想创建一个以匹配项为名称的新文件。如果代码找到另一个具有相同 Match 的文件（该文件已存在），则在该文档中创建一个新页面。

现在代码可以工作，但是它不是添加新页面，而是覆盖文档的第一页。注意：目录中的每个文档只有一页！

string sourceDir = @"C:'Users'bob'Desktop'results'";
string destDir = @"C:'Users'bob'Desktop'results'final'";
string[] files = Directory.GetFiles(sourceDir);
foreach (string file in files)
    {
       using (var pdfReader = new PdfReader(file.ToString()))
            {
                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    var text = new StringBuilder();
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    var currentText = 
                    PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
                    currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                    text.Append(currentText);
                    Regex reg = new Regex(@"ABCDEFG");
                    MatchCollection matches = reg.Matches(currentText);
                    foreach (Match m in matches)
                    {
                        string newFile = destDir + m.ToString() + ".pdf";
                        if (!File.Exists(newFile))
                        {
                            using (PdfReader reader = new PdfReader(File.ReadAllBytes(file)))
                            {
                                using (Document doc = new Document(reader.GetPageSizeWithRotation(page)))
                                {
                                    using (PdfCopy copy = new PdfCopy(doc, new FileStream(newFile, FileMode.Create)))
                                    {
                                        var importedPage = copy.GetImportedPage(reader, page);
                                        doc.Open();
                                        copy.AddPage(importedPage);
                                        doc.Close();
                                    }
                                }
                            }
                        }
                        else
                        {
                            using (PdfReader reader = new PdfReader(File.ReadAllBytes(newFile)))
                            {
                                using (Document doc = new Document(reader.GetPageSizeWithRotation(page)))
                                {
                                    using (PdfCopy copy = new PdfCopy(doc, new FileStream(newFile, FileMode.OpenOrCreate)))
                                    {
                                        var importedPage = copy.GetImportedPage(reader, page);
                                        doc.Open();
                                        copy.AddPage(importedPage);
                                        doc.Close();
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }

Bruno 在解释问题以及如何解决它方面做得很好，但既然你说过你是编程新手，并且你进一步发布了一个非常相似和相关的问题，我将更深入一点，希望能帮助你。

首先，让我们写下已知的：

有一个充满PDF的目录
每个 PDF 只有一个页面

然后目标：

提取每个 PDF 的文本
将提取的文本与模式进行比较
如果存在匹配项，则使用该匹配项作为文件名执行以下操作之一：
1. 如果文件存在，请将源 PDF 附加到该文件
2. 如果没有匹配项，请使用 PDF 创建一个新文件

在继续之前，您需要了解几件事。您尝试使用 FileMode.OpenOrCreate 在"追加模式"下工作。这是一个很好的猜测，但不正确。PDF格式有开始和结束，所以"从这里开始"和"在这里结束"。当您尝试将另一个PDF（或与此相关的任何内容）附加到现有文件时，您只是在写"在此处结束"部分。充其量，这是被忽略的垃圾数据，但更有可能的是，您最终会得到损坏的PDF。几乎任何文件格式都是如此。连接的两个 XML 文件无效，因为一个 XML 文档只能有一个根元素。

其次，iText/iTextSharp无法编辑现有文件。这一点非常重要。但是，它可以创建全新的文件，这些文件恰好具有其他文件的确切版本或可能修改的版本。我不知道我是否可以强调这有多重要。

第三，您正在使用一遍又一遍地复制的行，但非常错误，实际上可能会损坏您的数据。对于为什么它不好，请阅读此内容。

currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));

第四，您正在使用正则表达式，这是一种过于复杂的搜索方式。也许您发布的代码只是一个示例，但如果不是，我建议您只使用 currentText.Contains("") 或者如果您需要忽略大小写currentText.IndexOf( "", StringComparison.InvariantCultureIgnoreCase )。为了便于怀疑，下面的代码假设您有一个更复杂的正则表达式。

有了所有这些，下面是一个完整的工作示例，应该可以引导您完成所有内容。由于我们无法访问您的 PDF，因此第二部分实际上创建了 100 个示例 PDF，其中偶尔会添加我们的搜索词。您的真实代码显然不会这样做，但我们需要与您合作的共同点。第三部分是您尝试执行的搜索和合并功能。希望代码中的注释可以解释一切。

/**
 * Step 1 - Variable Setup
 */
//This is the folder that we'll be basing all other directory paths on
var workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
//This folder will hold our PDFs with text that we're searching for
var folderPathContainingPdfsToSearch = Path.Combine(workingFolder, "Pdfs");
var folderPathContainingPdfsCombined = Path.Combine(workingFolder, "Pdfs Combined");
//Create our directories if they don't already exist
System.IO.Directory.CreateDirectory(folderPathContainingPdfsToSearch);
System.IO.Directory.CreateDirectory(folderPathContainingPdfsCombined);
var searchText1 = "ABC";
var searchText2 = "DEF";
/**
 * Step 2 - Create sample PDFs
 */
//Create 100 sample PDFs
for (var i = 0; i < 100; i++) {
    using (var fs = new FileStream(Path.Combine(folderPathContainingPdfsToSearch, i.ToString() + ".pdf"), FileMode.Create, FileAccess.Write, FileShare.None)) {
        using (var doc = new Document()) {
            using (var writer = PdfWriter.GetInstance(doc, fs)) {
                doc.Open();
                //Add a title so we know what page we're on when we combine
                doc.Add(new Paragraph(String.Format("This is page {0}", i)));
                //Add various strings every once in a while.
                //(Yes, I know this isn't evenly distributed but I haven't
                // had enough coffee yet.)
                if (i % 10 == 3) {
                    doc.Add(new Paragraph(searchText1));
                } else if (i % 10 == 6) {
                    doc.Add(new Paragraph(searchText2));
                } else if (i % 10 == 9) {
                    doc.Add(new Paragraph(searchText1 + searchText2));
                } else {
                    doc.Add(new Paragraph("Blah blah blah"));
                }
                doc.Close();
            }
        }
    }
}
/**
 * Step 3 - Search and merge
 */

//We'll search for two different strings just to add some spice
var reg = new Regex("(" + searchText1 + "|" + searchText2 + ")");
//Loop through each file in the directory
foreach (var filePath in Directory.EnumerateFiles(folderPathContainingPdfsToSearch, "*.pdf")) {
    using (var pdfReader = new PdfReader(filePath)) {
        for (var page = 1; page <= pdfReader.NumberOfPages; page++) {
            //Get the text from the page
            var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, new SimpleTextExtractionStrategy());
            currentText.IndexOf( "",  StringComparison.InvariantCultureIgnoreCase )

            //DO NOT DO THIS EVER!! See this for why https://stackoverflow.com/a/10191879/231316
            //currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            //Match our pattern against the extracted text
            var matches = reg.Matches(currentText);
            //Bail early if we can
            if (matches.Count == 0) {
                continue;
            }
            //Loop through each match
            foreach (var m in matches) {
                //This is the file path that we want to target
                var destFile = Path.Combine(folderPathContainingPdfsCombined, m.ToString() + ".pdf");
                //If the file doesn't already exist then just copy the file and move on
                if (!File.Exists(destFile)) {
                    System.IO.File.Copy(filePath, destFile);
                    continue;
                }
                //The file exists so we're going to "append" the page
                //However, writing to the end of file in Append mode doesn't work,
                //that would be like "add a file to a zip" by concatenating two
                //two files. In this case, we're actually creating a brand new file
                //that "happens" to contain the original file and the matched file.
                //Instead of writing to disk for this new file we're going to keep it
                //in memory, delete the original file and write our new file
                //back onto the old file
                using (var ms = new MemoryStream()) {
                    //Use a wrapper helper provided by iText
                    var cc = new PdfConcatenate(ms);
                    //Open for writing
                    cc.Open();
                    //Import the existing file
                    using (var subReader = new PdfReader(destFile)) {
                        cc.AddPages(subReader);
                    }
                    //Import the matched file
                    //The OP stated a guarantee of only 1 page so we don't
                    //have to mess around with specify which page to import.
                    //Also, PdfConcatenate closes the supplied PdfReader so
                    //just use the variable pdfReader.
                    using (var subReader = new PdfReader(filePath)) {
                        cc.AddPages(subReader);
                    }
                    //Close for writing
                    cc.Close();
                    //Erase our exisiting file
                    File.Delete(destFile);
                    //Write our new file
                    File.WriteAllBytes(destFile, ms.ToArray());
                }
            }
        }
    }
}

我将用伪代码编写它。

你做这样的事情：

// loop over different single-page documents
for () {
    // introduce a condition
    if (condition == met) {
        // create single-page PDF
        new Document();
        new PdfCopy();
        document.Open();
        copy.add(singlePage);
        document.Close();
    }
}

这意味着每次满足条件时，您都会创建一个单页 PDF。顺便说一下，您多次覆盖现有文件。

你应该做的是这样的：

// Create a document with as many pages as times a condition is met
new Document();
new PdfCopy();
document.Open();
// loop over different single-page documents
for () {
    // introduce a condition
    if (condition == met) {
        copy.addPage(singlePage);
    }
}
document.Close();

现在，您可能正在使用PdfCopy创建的新文档中添加多个页面。请注意：如果从未满足条件，则可能会引发异常。