如何仅从word文件中获取标题及其副标题

本文关键字：标题副标题获取何仅 word 文件 | 更新日期: 2023-09-27 18:37:09

我想以编程方式从单词文件中获取其子标题的所有标题以 c# 为例，我有以下内容：

标题 1 XYZ

标题
2 标题 3
标题 1 ABC
标题 1 DEF
标题 2 洛伦·伊斯普姆

所以我的代码应该返回我：

标题 1 XYZ
标题
2 标题 3

单独和类似地保留标题和副标题。

我已经尝试过了，但我的代码不单独将所有标题和副标题一起返回给我，这是我获取标题的代码：

foreach (Microsoft.Office.Interop.Word.Paragraph paragraph in oMyDoc.Paragraphs )
{
    Microsoft.Office.Interop.Word.Style style = 
        paragraph.get_Style() as Microsoft.Office.Interop.Word.Style;
    string styleName = style.NameLocal;
    string text = paragraph.Range.Text;
    if (styleName == "Title")
    {
        title = text.ToString();
    }
    else if (styleName == "Subtitle")
    {
        st = text.ToString() + "'n";
    }
    else if (styleName=="Heading 1")
    {
        heading1[h1c] = text.ToString()+"'n";
    }
}

如何仅从word文件中获取标题及其副标题

我假设您将 title 和 st 声明为字符串，每次迭代循环时，旧值都会被当前值替换。如果使用列表，则可以向其添加文本和副标题。然后，您可以轻松地用它们做您想做的事情。

List<String> title = new List<String>();
List<String> st = new List<String>();
foreach (Microsoft.Office.Interop.Word.Paragraph paragraph in oMyDoc.Paragraphs )
        {
            Microsoft.Office.Interop.Word.Style style = paragraph.get_Style() as Microsoft.Office.Interop.Word.Style;
            string styleName = style.NameLocal;
            string text = paragraph.Range.Text;

            if (styleName == "Title")
            {
                title.Add(text.ToString());
            }
            else if (styleName == "Subtitle")
            {
                st.Add(text.ToString());
            }
            else if (styleName=="Heading 1")
            {
                heading1[h1c] = text.ToString()+"'n";

            }
}

如果你想得到整个大纲（如目录中的内容），"标题"并不是很可靠，因为所有样式都可以重命名/复制。因此，其他一些样式，如"H1"或"标题1"（中文单词"Heading1"）可以充当标题并显示在目录和导航面板中。

我什至看到"正常"在文档中充当标题。这让我放弃了使用样式来查找标题。

尝试段落。大纲级别改为。它的值范围从WdOutlineLevel.wdOutlineLevel1到9（意味着它是某种"标题"），并以WdOutlineLevel.wdOutlineLevelBodyText结尾（意味着它只是文本正文）。

这是我的代码。我甚至建立了一个树状的标题列表（每行一个标题）。

public static class WordBridge
{
    public static Dictionary<WdOutlineLevel, string> Level2Spaces = new Dictionary<WdOutlineLevel, string>()
    {
        {WdOutlineLevel.wdOutlineLevel1, ""},
        {WdOutlineLevel.wdOutlineLevel2, " "},
        {WdOutlineLevel.wdOutlineLevel3, "  "},
        {WdOutlineLevel.wdOutlineLevel4, "   "},
        {WdOutlineLevel.wdOutlineLevel5, "    "},
        {WdOutlineLevel.wdOutlineLevel6, "     "},
        {WdOutlineLevel.wdOutlineLevel7, "      "},
        {WdOutlineLevel.wdOutlineLevel8, "       "},
        {WdOutlineLevel.wdOutlineLevel9, "        "},
        {WdOutlineLevel.wdOutlineLevelBodyText, "         "},
    };
    public static string GetOutlines(object? sender, Document currentWordDoc)
    {
        var sb = new StringBuilder();
        var countFinished = 0;
        foreach (Paragraph paragraph in currentWordDoc.Paragraphs)
        {
            countFinished++;
            (sender as BackgroundWorker)?.ReportProgress(countFinished);
            if (paragraph.OutlineLevel == WdOutlineLevel.wdOutlineLevelBodyText)
                continue;
            if (Level2Spaces.ContainsKey(paragraph.OutlineLevel))
                sb.Append(Level2Spaces[paragraph.OutlineLevel] + paragraph.Range.Text);
        }
        return sb.ToString();
    }
}

我可能会在不久的将来把它变成一棵真正的树。顺便说一句，如果您没有进度条，请删除报告进度内容。但是由于单线程，它真的很慢。

//This will return you headers and text below of corrousponding header
    private List<Tuple<string, string>> GetPlainTextByHeaderFromWordDoc(string docname)
    {
        #region for Plain text collection from document
        List<Tuple<string, string>> docPlainTextWithHeaderList = new List<Tuple<string, string>>();
        string headerText = string.Empty;
        string finalTextBelowHeader = string.Empty;
        try
        {
            Document doc = ReadMsWord(docname, objCommonVariables);
            if (doc.Paragraphs.Count > 0)
            {
                //heading with 1st paragraph
                foreach (Paragraph paragraph in doc.Paragraphs)
                {
                    Style style = paragraph.get_Style() as Style;
                    headerText = string.Empty;
                    finalTextBelowHeader = string.Empty;
                    if (style.NameLocal == "Heading 1")
                    {
                    headerText = paragraph.Range.Text.TrimStart().TrimEnd();
                        //reading 1st paragraph of each section
                        for (int i = 0; i < doc.Paragraphs.Count; i++)
                        {
                            if (paragraph.Next(i) != null)
                            {
                                Style yle = paragraph.Next(i).get_Style() as Style;
                                if (yle.NameLocal != "Heading 1")
                                {
                                    finalTextBelowHeader += paragraph.Next(i).Range.Text.ToString();
                                }
                                else if (yle.NameLocal == "Heading 1" && !headerText.Contains(paragraph.Next(i).Range.Text.ToString()))
                                {
                                    break;
                                }
                            }
                        }
                        string header = Regex.Replace(headerText, "[^a-zA-Z''s]", string.Empty).TrimStart().TrimEnd();
                        string belowText = Regex.Replace(finalTextBelowHeader, @"'s+", String.Empty);
                        belowText = belowText.Trim().Replace("'a", string.Empty);
                        docPlainTextWithHeaderList.Add(new Tuple<string, string>(header, belowText));
                    }
                }
            }
            else
            {
             //error msg: unable to read
            }
            doc.Close(Type.Missing, Type.Missing, Type.Missing);
        }
        catch (Exception ex)
        {
            MessageBox.Show(ex.StackTrace);
        }
 }
  //This will read and return word document
  private Document ReadMsWord(string docName)
     {
    Document docs = new Document();
    try
    {
        // variable to store file path
        string FilePath = @"C:'Kaustubh_Tupe'WordRepository/docName.docx";
        // create word application
        Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
        // create object of missing value
        object miss = System.Reflection.Missing.Value;
        // create object of selected file path
        object path = FilePath;
        // set file path mode
        object readOnly = false;
        // open Destination                
        docs = word.Documents.Open(ref path, ref miss, ref readOnly,
            ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss,
            ref miss, ref miss, ref miss, ref miss, ref miss);
        //select whole data from active window Destination
        docs.ActiveWindow.Selection.WholeStory();
        // handover the data to cllipboard
        docs.ActiveWindow.Selection.Copy();
        // clipboard create reference of idataobject interface which transfer the data
    }
    catch (Exception ex)
    {
        //MessageBox.Show(ex.ToString());
    }
    return docs;
}