使用iTextSharp 5.5.8将html转换为PDF时未显示希伯来语内容

本文关键字:显示 PDF 希伯来语 iTextSharp html 转换 使用 | 更新日期: 2023-09-27 18:27:15

我使用以下代码使用iTextSharp 将Html文件转换为Pdf

    Document doc = new Document(iTextSharp.text.PageSize.A4, 10, 20, 5, 35);
    var writer = PdfWriter.GetInstance(doc, new FileStream(savePath, FileMode.Create));
    var xmlWorkerFontProvider = new XMLWorkerFontProvider();
    var cssAppliers = new CssAppliersImpl(new MyFontProvider());
    CssFilesImpl cssFiles = new CssFilesImpl();
    StyleAttrCSSResolver cssResolver = new StyleAttrCSSResolver(cssFiles);
    HtmlPipelineContext htmlContext = new HtmlPipelineContext(cssAppliers);
    htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
    htmlContext.SetImageProvider(new ITextImageHandler());
    IPipeline pipeline = new CssResolverPipeline(cssResolver, new HtmlPipeline(htmlContext, new PdfWriterPipeline(doc, writer)));
    XMLWorker worker = new XMLWorker(pipeline, true);
    XMLParser xmlParser = new XMLParser(true, worker, Encoding.Unicode);
    doc.Open();
    doc.NewPage();
    xmlParser.Parse(new StringReader(htmlString.ToString()));
    doc.Close();

对于英语内容来说,这很好。但如果内容是希伯来语,则PDF中不会显示文本。

我已经在Stack溢出上检查了与此相关的其他答案,但它们似乎使用了HtmlParser,这是不推荐使用的。所以我不想用这个。

如果需要其他东西,请告诉我。谢谢你抽出时间。

编辑:在阅读评论后,我也尝试了设置字体。但仍然没有运气。以下是更新后的代码。

 Document document = new Document();
        PdfWriter writer =
            PdfWriter.GetInstance(document, new FileStream(savePath, FileMode.Create));
        document.Open();
        var cssResolver = new StyleAttrCSSResolver();
        XMLWorkerFontProvider fontProvider =
            new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
        fontProvider.Register(@"E:'fonts'NotoSansHebrew-Regular.ttf");

        CssAppliers cssAppliers = new CssAppliersImpl(fontProvider);
        HtmlPipelineContext htmlContext = new HtmlPipelineContext(cssAppliers);
        htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
        htmlContext.SetImageProvider(new ITextImageHandler());

        PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
        HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
        CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);

        XMLWorker worker = new XMLWorker(css, true);
        XMLParser p = new XMLParser(worker);
        p.Parse(new StringReader(htmlString.ToString()));
        document.Close();

使用iTextSharp 5.5.8将html转换为PDF时未显示希伯来语内容

下面是Bruno的代码与一些实际HTML的改编。要运行它,你只需要下载Noto Sans Hebrew字体并将其放在桌面上。在没有任何修改的情况下(除了可能的文件路径),尝试运行这段对我有效的代码。(我针对5.5.5测试了这段代码,所以5.5.8应该绝对有效。)

var file = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
var fontFile = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "NotoSansHebrew-Regular.ttf");
var htmlText = @"<div dir=""rtl"" style=""font-family: Noto Sans Hebrew;"">שלום עולם</div>";
using (var FS = new System.IO.FileStream(file, FileMode.Create, FileAccess.Write, FileShare.None)) {
    using (var document = new Document()) {
        using (var writer = PdfWriter.GetInstance(document, FS)) {
            document.Open();
            var cssResolver = new StyleAttrCSSResolver();
            var fontProvider = new XMLWorkerFontProvider(XMLWorkerFontProvider.DONTLOOKFORFONTS);
            fontProvider.Register(fontFile);
            var cssAppliers = new CssAppliersImpl(fontProvider);
            var htmlContext = new HtmlPipelineContext(cssAppliers);
            htmlContext.SetTagFactory(Tags.GetHtmlTagProcessorFactory());
            var pdf = new PdfWriterPipeline(document, writer);
            var html = new HtmlPipeline(htmlContext, pdf);
            var css = new CssResolverPipeline(cssResolver, html);

            var worker = new XMLWorker(css, true);
            var p = new XMLParser(worker);
            using (var ms = new System.IO.MemoryStream(System.Text.Encoding.UTF8.GetBytes(htmlText))) {
                using (var sr = new StreamReader(ms)) {
                    p.Parse(sr);
                }
            }
            document.Close();
        }
    }
}

这整件事的诀窍是获得HTML中字体的确切名称,就像它在字体文件中一样。有时令人困惑的是,字体中可能有一堆名字。字体越老,就越有可能有这些。如果我没记错的话,iText有一些确定字体名称的启发式方法,但如果你想安全起见,你也可以使用别名并随意调用它。例如,您可以将HTML更改为:

var htmlText = @"<div dir=""rtl"" style=""font-family: Gerp;"">שלום עולם</div>";

只要你在注册时使用别名,一切都会很好:

fontProvider.Register(fontFile, "Gerp");