如何将一串项目符号(带有标题和正文内容)分割成多维数组?

本文关键字:正文 分割 数组 一串 项目 符号 标题 | 更新日期: 2023-09-27 18:17:33

我从一个PDF文档中提取了一些文本,该文档中有一个项目符号列表,其中包含以下内容:

3法案提交邮件委员会
菲茨吉本先生(首席政府党鞭)请假,提出了《税法修正案》(2011年)《2011年第7号措施条例草案》将提交主要委员会作进一步审议。提出问题并通过。
4《2011年法团修订(财务意见的未来)条例草案》
肖顿先生(金融服务和退休金部长)根据通知,提出了一项法案修订与财务建议有关的法律,以及为相关目的。文档肖顿先生提交了一份对该法案的解释性备忘录。比尔第一次读。肖顿先生提议,现在再读一遍法案。辩论暂停(兰德尔先生),辩论的恢复下达了第二天的命令坐着。
5 .税法修正案(2011年办法第5号)《2011年法案》
肖顿先生(金融服务和退休金部长)提交了一份法案修正案税法:与税收有关的法律,以及为有关目的而制定的法律文档

我需要把它们分开,这样我就能把每个要点像这样:

[0,0] =
[0,1] =
[1,0] =
[1,1] = Body

我已经修改了这个例子,以包含一些真实世界的内容。

任何帮助都将非常感激。
我正在使用。net框架c#。

如何将一串项目符号(带有标题和正文内容)分割成多维数组?

您可以使用LINQ:

var result = input
    .Split(new[] { "'r'n" }, StringSplitOptions.None)
    .Where(x => !string.IsNullOrWhiteSpace(x))
    .GroupAdjacent((g, x) => !char.IsDigit(x[0]))
    .Select(g => new
    {
        Title = g.First().Trim(),
        Body = string.Join(" ", g.Skip(1).Select(x => x.Trim()))
    })
    .ToArray();

例子:

string input = @"3 BILL REFERRED TO MAIL COMMITTEE
Mr Fitzgibbon (Chief Government Whip), by leave, moved—That the
Tax Laws Amendment (2011 Measures No. 7) Bill 2011 be referred
to the Main Committee for further consideration. Question—put
and passed.
4 CORPORATIONS AMENDMENT (FUTURE OF FINANCIAL ADVICE) BILL 2011
Mr Shorten (Minister for Financial Services and Superannuation),
pursuant to notice, presented a Bill for an Act to amend the law
in relation to financial advice,and for related purposes. Mr
Shorten presented an explanatory memorandum to the bill. Bill
read a first time. Mr Shorten moved—That the bill be now read
a second time. Debate adjourned (Mr Randall), and the resumption
of the debate made an order of the day for the next sitting.
5 TAX LAWS AMENDMENT (2011 MEASURES NO. 8) BILL 2011
Mr Shorten (Minister for Financial Services and Superannuation)
presented a Bill for an Act to amend the law relating to
taxation, and for related purposes.";
输出:

result[0] == { Title = "3 BILL REFERRED ...", Body = "Mr Fitzgibbon ..." }
result[1] == { Title = "4 CORPORATIONS ...",  Body = "Mr Shorten ..." }
result[2] == { Title = "5 TAX LAWS ...",      Body = "Mr Shorten ..." }

扩展方法:

public static IEnumerable<IEnumerable<T>> GroupAdjacent<T>(
    this IEnumerable<T> source, Func<IEnumerable<T>, T, bool> adjacent)
{
    var g = new List<T>();
    foreach (var x in source)
    {
        if (g.Count != 0 && !adjacent(g, x))
        {
            yield return g;
            g = new List<T>();
        }
        g.Add(x);
    }
    yield return g;
}