如何使用PDFSharp从PDF中提取FlateEcoded图像

本文关键字:提取 FlateEcoded 图像 PDF 何使用 PDFSharp | 更新日期: 2023-09-27 18:30:09

如何使用PDFSharp从PDF文档中提取经过FlateDecoded(如PNG)的图像?

我在PDFSharp:的样本中发现了这一评论

// TODO: You can put the code here that converts vom PDF internal image format to a
// Windows bitmap
// and use GDI+ to save it in PNG format.
// [...]
// Take a look at the file
// PdfSharp.Pdf.Advanced/PdfImage.cs to see how we create the PDF image formats.

有人能解决这个问题吗?

谢谢你的回复。

编辑:因为我无法在8小时内回答自己的问题,所以我用这种方式回答:

谢谢你很快的回复。

我在方法"ExportAsPngImage"中添加了一些代码,但没有得到想要的结果。它只是提取了更多的图像(png),它们没有正确的颜色,并且失真了。

这是我的实际代码:

PdfSharp.Pdf.Filters.FlateDecode flate = new PdfSharp.Pdf.Filters.FlateDecode();
        byte[] decodedBytes = flate.Decode(bytes);
        System.Drawing.Imaging.PixelFormat pixelFormat;
        switch (bitsPerComponent)
        {
            case 1:
                pixelFormat = PixelFormat.Format1bppIndexed;
                break;
            case 8:
                pixelFormat = PixelFormat.Format8bppIndexed;
                break;
            case 24:
                pixelFormat = PixelFormat.Format24bppRgb;
                break;
            default:
                throw new Exception("Unknown pixel format " + bitsPerComponent);
        }
        Bitmap bmp = new Bitmap(width, height, pixelFormat);
        var bmpData = bmp.LockBits(new Rectangle(0, 0, width, height), ImageLockMode.WriteOnly, pixelFormat);
        int length = (int)Math.Ceiling(width * bitsPerComponent / 8.0);
        for (int i = 0; i < height; i++)
        {
            int offset = i * length;
            int scanOffset = i * bmpData.Stride;
            Marshal.Copy(decodedBytes, offset, new IntPtr(bmpData.Scan0.ToInt32() + scanOffset), length);
        }
        bmp.UnlockBits(bmpData);
        using (FileStream fs = new FileStream(@"C:'Export'PdfSharp'" + String.Format("Image{0}.png", count), FileMode.Create, FileAccess.Write))
        {
            bmp.Save(fs, System.Drawing.Imaging.ImageFormat.Png);
        }

这是正确的方式吗?还是我应该选择另一种方式?非常感谢!

如何使用PDFSharp从PDF中提取FlateEcoded图像

我知道这个答案可能会晚几年,但也许它会帮助其他人。

在我的案例中,由于image.Elements.GetInteger(PdfImage.Keys.BitsPerComponent)似乎没有返回正确的值,所以会出现比例失调。正如Vive la déraison在您的问题中指出的那样,您可以获得使用Marshal.Copy的BGR格式。因此,在执行Marshal.Copy之后反转Bytes并旋转Bitmap将完成此任务。

生成的代码如下所示:

private static void ExportAsPngImage(PdfDictionary image, ref int count)
    {
        int width = image.Elements.GetInteger(PdfImage.Keys.Width);
        int height = image.Elements.GetInteger(PdfImage.Keys.Height);
        var canUnfilter = image.Stream.TryUnfilter();
        byte[] decodedBytes;
        if (canUnfilter)
        {
            decodedBytes = image.Stream.Value;
        }
        else
        {
            PdfSharp.Pdf.Filters.FlateDecode flate = new PdfSharp.Pdf.Filters.FlateDecode();
            decodedBytes = flate.Decode(image.Stream.Value);
        }
        int bitsPerComponent = 0;
        while (decodedBytes.Length - ((width * height) * bitsPerComponent / 8) != 0)
        {
            bitsPerComponent++;
        }
        System.Drawing.Imaging.PixelFormat pixelFormat;
        switch (bitsPerComponent)
        {
            case 1:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format1bppIndexed;
                break;
            case 8:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format8bppIndexed;
                break;
            case 16:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format16bppArgb1555;
                break;
            case 24:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format24bppRgb;
                break;
            case 32:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format32bppArgb;
                break;
            case 64:
                pixelFormat = System.Drawing.Imaging.PixelFormat.Format64bppArgb;
                break;
            default:
                throw new Exception("Unknown pixel format " + bitsPerComponent);
        }
        decodedBytes = decodedBytes.Reverse().ToArray();
        Bitmap bmp = new Bitmap(width, height, pixelFormat);
        BitmapData bmpData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.WriteOnly, bmp.PixelFormat);
        int length = (int)Math.Ceiling(width * (bitsPerComponent / 8.0));
        for (int i = 0; i < height; i++)
        {
            int offset = i * length;
            int scanOffset = i * bmpData.Stride;
            Marshal.Copy(decodedBytes, offset, new IntPtr(bmpData.Scan0.ToInt32() + scanOffset), length);
        }
        bmp.UnlockBits(bmpData);
        bmp.RotateFlip(RotateFlipType.Rotate180FlipNone);
        bmp.Save(String.Format("exported_Images''Image{0}.png", count++), System.Drawing.Imaging.ImageFormat.Png);
    }

代码可能需要一些优化,但在我的情况下,它确实正确地导出了FlateEcoded Images。

要获得Windows BMP,只需创建一个位图标头,然后将图像数据复制到位图中。PDF图像是字节对齐的(每条新行从字节边界开始),而Windows BMP是DWORD对齐的(每一条新行从DWORD边界开始(由于历史原因,DWORD为4字节))。位图标头所需的所有信息都可以在过滤器参数中找到,也可以进行计算。

调色板是PDF中的另一个FlateEncoded对象。也可以将其复制到BMP中。

这必须针对几种格式(每像素1位、8bpp、24bpp、32bpp)执行。

以下是我的完整代码。

我正在从PDF中提取UPS运输标签,以便提前了解格式。如果提取的图像类型未知,则需要检查bitsPerComponent并进行相应处理。我也只处理第一页上的第一个图像。

注意:我使用TryUnfilter来"deflate",它使用任何应用的过滤器,并为我解码数据。无需显式调用"deflate"。

    var file = @"c:'temp'PackageLabels.pdf";
    var doc = PdfReader.Open(file);
    var page = doc.Pages[0];
    {
        // Get resources dictionary
        PdfDictionary resources = page.Elements.GetDictionary("/Resources");
        if (resources != null)
        {
            // Get external objects dictionary
            PdfDictionary xObjects = resources.Elements.GetDictionary("/XObject");
            if (xObjects != null)
            {
                ICollection<PdfItem> items = xObjects.Elements.Values;
                // Iterate references to external objects
                foreach (PdfItem item in items)
                {
                    PdfReference reference = item as PdfReference;
                    if (reference != null)
                    {
                        PdfDictionary xObject = reference.Value as PdfDictionary;
                        // Is external object an image?
                        if (xObject != null && xObject.Elements.GetString("/Subtype") == "/Image")
                        {
                            // do something with your image here 
                            // only the first image is handled here
                            var bitmap = ExportImage(xObject);
                            bmp.Save(@"c:'temp'exported.png", System.Drawing.Imaging.ImageFormat.Bmp);
                        }
                    }
                }
            }
        }
    }

使用这些辅助功能

    private static Bitmap ExportImage(PdfDictionary image)
    {
        string filter = image.Elements.GetName("/Filter");
        switch (filter)
        {
            case "/FlateDecode":
                return ExportAsPngImage(image);
            default:
                throw new ApplicationException(filter + " filter not implemented");
        }
    }
    private static Bitmap ExportAsPngImage(PdfDictionary image)
    {
        int width = image.Elements.GetInteger(PdfImage.Keys.Width);
        int height = image.Elements.GetInteger(PdfImage.Keys.Height);
        int bitsPerComponent = image.Elements.GetInteger(PdfImage.Keys.BitsPerComponent);   
        var canUnfilter = image.Stream.TryUnfilter();
        var decoded = image.Stream.Value;
        Bitmap bmp = new Bitmap(width, height, System.Drawing.Imaging.PixelFormat.Format8bppIndexed);
        BitmapData bmpData = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.WriteOnly, bmp.PixelFormat);
        Marshal.Copy(decoded, 0, bmpData.Scan0, decoded.Length);
        bmp.UnlockBits(bmpData);
        return bmp;
    }

到目前为止。。。我的代码。。。它可以处理许多png文件,但不能处理来自adobephotoshop的带有颜色空间索引的文件:

    private bool ExportAsPngImage(PdfDictionary image, string SaveAsName)
        {
            int width = image.Elements.GetInteger(PdfSharp.Pdf.Advanced.PdfImage.Keys.Width);
            int height = image.Elements.GetInteger(PdfSharp.Pdf.Advanced.PdfImage.Keys.Height);
            int bitsPerComponent = image.Elements.GetInteger(PdfSharp.Pdf.Advanced.PdfImage.Keys.BitsPerComponent);
            var ColorSpace = image.Elements.GetArray(PdfImage.Keys.ColorSpace);
System.Drawing.Imaging.PixelFormat pixelFormat= System.Drawing.Imaging.PixelFormat.Format24bppRgb; //24 just for initalize
            if (ColorSpace is null) //no colorspace.. bufferedimage?? is in BGR order instead of RGB so change the byte order. Right now it works
            {
                byte[] origineel_byte_boundary = image.Stream.UnfilteredValue;
                bitsPerComponent = (origineel_byte_boundary.Length) / (width * height);
                switch (bitsPerComponent)
                {
                    case 4:
                        pixelFormat = System.Drawing.Imaging.PixelFormat.Format32bppPArgb;
                        break;
                    case 3:
                        pixelFormat = System.Drawing.Imaging.PixelFormat.Format24bppRgb;
                        break;
                    default:
                        {
                            MessageBox.Show("Unknown pixel format " + bitsPerComponent, "Error", MessageBoxButtons.OK, MessageBoxIcon.Warning);
                            return false;
                        }
                        break;
                }
                Bitmap bmp = new Bitmap(width, height, pixelFormat); //copy raw bytes to "master" bitmap so we are out of pdf format to work with 
                System.Drawing.Imaging.BitmapData bmd = bmp.LockBits(new Rectangle(0, 0, width, height), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat);
                System.Runtime.InteropServices.Marshal.Copy(origineel_byte_boundary, 0, bmd.Scan0, origineel_byte_boundary.Length);
                bmp.UnlockBits(bmd);
                Bitmap bmp2 = new Bitmap(width, height, pixelFormat);
                for (int indicex = 0; indicex < bmp.Width; indicex++)
                {
                    for (int indicey = 0; indicey < bmp.Height; indicey++)
                    {
                        Color nuevocolor = bmp.GetPixel(indicex, indicey);
                        Color colorintercambiado = Color.FromArgb(nuevocolor.A, nuevocolor.B, nuevocolor.G, nuevocolor.R);
                        bmp2.SetPixel(indicex, indicey, colorintercambiado);
                    }
                }
                using (FileStream fs = new FileStream(SaveAsName, FileMode.Create, FileAccess.Write))
                {
                    bmp2.Save(fs, System.Drawing.Imaging.ImageFormat.Png);
                }
                bmp2.Dispose();
                bmp.Dispose();
            }
            else
            {
// this is the case of photoshop... work needs to be done here. I ´m able to get the color palette but no idea how to put it back or create the png file... 
                switch (bitsPerComponent)
                {
                    case 4:
                        pixelFormat = System.Drawing.Imaging.PixelFormat.Format32bppArgb;
                        break;
                    default:
                        {
                            MessageBox.Show("Unknown pixel format " + bitsPerComponent, "Error", MessageBoxButtons.OK, MessageBoxIcon.Warning);
                            return false;
                        }
                        break;
                }
                if ((ColorSpace.Elements.GetName(0) == "/Indexed") && (ColorSpace.Elements.GetName(1) == "/DeviceRGB"))
                {
                    //we need to create the palette
                    int paletteColorCount = ColorSpace.Elements.GetInteger(2);
                    List<System.Drawing.Color> paletteList = new List<Color>();
                    //Color[] palette = new Color[paletteColorCount+1]; // no idea why but it seams that there´s always 1 color more. ¿transparency?
                    PdfObject paletteObj = ColorSpace.Elements.GetObject(3);
                    PdfDictionary paletteReference = (PdfDictionary)paletteObj;
                    byte[] palettevalues = paletteReference.Stream.Value;
                    for (int index = 0; index < (paletteColorCount + 1); index++)
                    {
                        //palette[index] = Color.FromArgb(1, palettevalues[(index*3)], palettevalues[(index*3)+1], palettevalues[(index*3)+2]); // RGB
                        paletteList.Add(Color.FromArgb(1, palettevalues[(index * 3)], palettevalues[(index * 3) + 1], palettevalues[(index * 3) + 2])); // RGB
                    }                  
                }
            }
            return true;
        }

PDF可能包含带有掩码和不同颜色空间选项的图像,这就是为什么在某些情况下简单解码图像对象可能无法正常工作的原因。

因此,代码还需要检查PDF中的图像掩码(/ImageMask)和图像对象的其他属性(看看图像是否也应该使用反转颜色或使用索引颜色),以重新创建类似于PDF中显示的图像。请参阅官方PDF参考中的Image object、/ImageMask和/Decode字典。

不确定PDFSharp是否能够在PDF中找到图像掩码对象,但iTextSharp能够访问图像掩码对象(请参阅PdfName.Mask对象类型)。

像PDF Extractor SDK这样的商业工具能够提取原始形式和"渲染"形式的图像。

我为ByteScout工作,PDF Extractor SDK的制造商

也许不能直接回答这个问题,但从PDF中提取图像的另一个选项是使用FreeSpire.PDF,它可以轻松地从PDF中抽取图像。它以Nuget包的形式提供https://www.nuget.org/packages/FreeSpire.PDF/.它们处理所有的图像格式,并可以导出为PNG。他们的样本代码是

using System;
using System.Collections.Generic;
using System.Text;
using System.Drawing;
using Spire.Pdf;
namespace ExtractImagesFromPDF
{
    class Program
    {
        static void Main(string[] args)
        {
            //Instantiate an object of Spire.Pdf.PdfDocument
            PdfDocument doc = new PdfDocument();
            //Load a PDF file 
            doc.LoadFromFile("sample.pdf");
            List<Image> ListImage = new List<Image>();
            for (int i = 0; i < doc.Pages.Count; i++)
            {
                // Get an object of Spire.Pdf.PdfPageBase
                PdfPageBase page = doc.Pages[i];
                // Extract images from Spire.Pdf.PdfPageBase
                Image[] images = page.ExtractImages();
                if (images != null && images.Length > 0)
                {
                    ListImage.AddRange(images);
                }
            }
            if (ListImage.Count > 0)
            {
                for (int i = 0; i < ListImage.Count; i++)
                {
                    Image image = ListImage[i];
                    image.Save("image" + (i + 1).ToString() + ".png", System.Drawing.Imaging.ImageFormat.Png);
                }
                System.Diagnostics.Process.Start("image1.png");
            }  
        }
    }
}

(代码取自https://www.e-iceblue.com/Tutorials/Spire.PDF/Spire.PDF-Program-Guide/How-to-Extract-Image-From-PDF-in-C.html)