如何将html转换成文档而不丢失段落

本文关键字：段落文档 html 转换 | 更新日期: 2023-09-27 18:07:38

我尝试从网站获取文本。例如http://ahmetturanalkan.net/yazi/laik-cemaatin-kokleri-kadikoyde-mi/

它以纯文本的形式导出到word。我是说没有& lt;/p> & lt;P>使得它没有段落。我如何把它转换成文本，就像在原始网站上有适当的段落?

这是我获取text

的方法

       private string yazial(string s)
    {
        string htmlContent = getsource(s);
        HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
        document.LoadHtml(htmlContent);
        var nodlar = document.DocumentNode.SelectSingleNode("*//article").InnerHtml;
        var nodlar1 = document.DocumentNode.SelectSingleNode("*//article/div[@class='page-header']").InnerText;
        //lectSingleNode
        docyap(nodlar,nodlar1);
        return nodlar;
    }

这是我尝试导出word doc

的方法

  private void docyap(string s,string g)
    {
        Microsoft.Office.Interop.Word.Application oWord = new Microsoft.Office.Interop.Word.Application();
        oWord.Visible = true;
        object oMissing = System.Reflection.Missing.Value;
        Microsoft.Office.Interop.Word.Document wBelge = oWord.Documents.Add(ref oMissing, ref oMissing,
        ref oMissing, ref oMissing);    
        Microsoft.Office.Interop.Word.Paragraph baslik = wBelge.Paragraphs.Add(ref oMissing);
        object styleHeading = "Başlık 1";
        baslik.Range.set_Style(styleHeading);
        baslik.Range.Text = g; 
        baslik.Range.InsertParagraphAfter();
        Microsoft.Office.Interop.Word.Paragraph paragraf2;
        paragraf2 = wBelge.Paragraphs.Add(ref oMissing);
        paragraf2.Range.Text = s;
        paragraf2.Range.InsertParagraphAfter();           
        wBelge.SaveAs(ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing);

如何将html转换成文档而不丢失段落

我最后用like

           string yeni = nodlar.Replace("<p>", "    ").Replace("</p>","'n");
           yeni = System.Text.RegularExpressions.Regex.Replace(yeni, "<div(.*)</div>", string.Format(""));
           yeni = System.Text.RegularExpressions.Regex.Replace(yeni, "<div(.*)</a>", string.Format(""));
            yeni = Regex.Replace(yeni, @"<[^>]*>", String.Empty);

但最好知道更简洁的解，如果它存在