正则表达式匹配大型文本文件中的多个字符串

本文关键字:字符串 文件 大型 文本 正则表达式 | 更新日期: 2023-09-27 18:18:11

>问题

我有一个相当大的文本文件(大约 10 兆字节,700,000 行(,其中包含 HTML 代码。

我的目标是从中提取某些信息。我相信使用正则表达式将是最好的方法,因为我有多个文件也需要做同样的事情。

有,我相信一个与我需要的数据相匹配的正则表达式,但我相信我遇到了锚点的问题。我一直在使用 regex101.com 来帮助我匹配和学习正则表达式,但我一次只能匹配一部分数据。我尝试过用 ''A、$、^ 作为字符串的开头和结尾,但没有运气。我尝试在谷歌上搜索这个,但我只找到一篇文章似乎与我的用例相匹配,它使用的是perl,解决方案是创建整个文本文件的单个字符串,我认为这不是一个好主意。

示例输入文件

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type"  content="text/html; charset=ISO-8859-1">
<title></title>
</head>
<body dir="LTR" bgcolor="#ffffff">
<!-- Created by Oracle Reports 04:00 Fri Aug 15 04:00:37 AM, 2014 -->
<table border=0 cellspacing=0 cellpadding=0 width=774>
<tr><td width=15></td><td width=1></td><td width=3></td><td width=6></td><td width=44></td><td width=1></td><td width=15></td><td width=4></td><td width=17></td><td width=1></td><td width=11></td><td width=1></td><td width=14></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=11></td><td width=4></td><td width=11></td><td width=2></td><td width=13></td><td width=45></td><td width=1></td><td width=15></td><td width=3></td><td width=9></td><td width=8></td><td width=1></td><td width=11></td><td width=1></td><td width=14></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=12></td><td width=17></td><td width=12></td><td width=45></td><td width=1></td><td width=9></td><td width=6></td><td width=4></td><td width=16></td><td width=1></td><td width=11></td><td width=1></td><td width=13></td><td width=1></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=12></td><td width=17></td><td width=13></td><td width=36></td><td width=8></td><td width=1></td><td width=15></td><td width=4></td><td width=17></td><td width=1></td><td width=11></td><td width=1></td><td width=14></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=12></td><td width=17></td><td width=8></td><td width=1></td><td width=10></td><td width=25></td></tr>
<tr><td colspan=77 height=9></td></tr>
<tr valign=top>
  <td height=9></td>
  <td colspan=23></td>
  <td colspan=2></td>
</tr>
<tr><td colspan=77 height=9></td></tr>
<tr valign=top>
  <td height=9></td>
  <td width=174 colspan=19 rowspan=2><font face="helvetica" color="#007f7f"><b>15-AUG-2014</b></font></td>
  <td colspan=38></td>
  <td width=139 colspan=16 rowspan=2 align=center> <font face="helvetica" color="#007f7f"><b>Page&nbsp;</b></font><font face="helvetica" color="#007f7f"><b>1</b></font><font face="helvetica" color="#007f7f"><b>&nbsp;of&nbsp;</b></font><font face="helvetica" color="#007f7f"><b>58</b></font><br></td>
  <td colspan=3></td>
</tr>
<tr valign=top>
  <td height=9></td>
  <td colspan=38></td>
  <td colspan=3></td>
</tr>
<tr valign=top>
  <td height=9 colspan=3></td>
  <td></td>
</tr>
<tr valign=top>
  <td height=9 colspan=3></td>
  <td></td>
</tr>
<tr><td colspan=77 height=9></td></tr>
<tr valign=top>
  <td height=9 colspan=2></td>
  <td></td>
</tr>
<tr valign=top>
  <td height=9 colspan=27></td>
  <td colspan=28></td>
</tr>
<tr valign=top>
  <td height=9 colspan=4></td>
  <td width=44><font size=2 face="helvetica">08/14/14</font></td>
  <td></td>
  <td width=15 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 align=right><font size=2 face="helvetica">7</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">3</font></td>
  <td></td>
  <td width=17 colspan=3 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45><font size=2 face="helvetica">07/19/14</font></td>
  <td></td>
  <td width=15 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 colspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 align=right><font size=2 face="helvetica">2</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">4</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45><font size=2 face="helvetica">06/23/14</font></td>
  <td></td>
  <td width=15 colspan=2 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=16 align=right><font size=2 face="helvetica">0</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 colspan=2 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">6</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=44 colspan=2><font size=2 face="helvetica">05/28/14</font></td>
  <td></td>
  <td width=15 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 align=right><font size=2 face="helvetica">3</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">1</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td colspan=4></td>
</tr>
<tr><td colspan=77 height=1></td></tr>
<tr valign=top>
  <td height=9 colspan=4></td>
  <td width=44 rowspan=2><font size=2 face="helvetica">08/14/14</font></td>
  <td></td>
  <td width=15 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;M</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">4</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 rowspan=2 align=right><font size=2 face="helvetica">3</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=17 colspan=3 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45 rowspan=2><font size=2 face="helvetica">07/19/14</font></td>
  <td></td>
  <td width=15 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;M</font></td>
  <td></td>
  <td width=17 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45 rowspan=2><font size=2 face="helvetica">06/23/14</font></td>
  <td></td>
  <td width=15 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;M</font></td>
  <td></td>
  <td width=16 rowspan=2 align=right><font size=2 face="helvetica">7</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">8</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=44 colspan=2 rowspan=2><font size=2 face="helvetica">05/28/14</font></td>
  <td></td>
  <td width=15 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;M</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">2</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td colspan=4></td>
</tr>
<tr valign=top>
  <td height=9 colspan=4></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td colspan=4></td>
</tr>
<tr><td colspan=77 height=1></td></tr>
<tr valign=top>
  <td height=9 colspan=4></td>
  <td width=44 rowspan=2><font size=2 face="helvetica">08/13/14</font></td>
  <td></td>
  <td width=15 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">8</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
  <td></td>
  <td width=17 colspan=3 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45 rowspan=2><font size=2 face="helvetica">07/18/14</font></td>
  <td></td>
  <td width=15 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">0</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 rowspan=2 align=right><font size=2 face="helvetica">4</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">3</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45 rowspan=2><font size=2 face="helvetica">06/22/14</font></td>
  <td></td>
  <td width=15 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=16 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=44 colspan=2 rowspan=2><font size=2 face="helvetica">05/27/14</font></td>
  <td></td>
  <td width=15 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">4</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 rowspan=2 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">2</font></td>
  <td></td>
  <td width=17 rowspan=2 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td colspan=4></td>
</tr>
<tr valign=top>
  <td height=9 colspan=4></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td colspan=4></td>
</tr>

正则表达式

使用全局和多行修饰符

's*<td width='d* rowspan='d*><font size='d face="helvetica">(?<Date>'d+.'d+.'d+)<.font><.td>
's*<td width='d* rowspan='d* align=right><font size='d* face="helvetica">&nbsp;(?<Time>E|M)<.font><.td>
's*<td width='d* colspan='d* rowspan='d* align=right><font size='d* face="helvetica">(?<FirstNum>'d)<.font><.td>
's*<td width='d* rowspan='d* align=right><font size='d* face="helvetica">-<.font><.td>
's*<td width='d* rowspan='d* align=right><font size='d* face="helvetica">(?<SecondNum>'d)<.font><.td>
's*<td width='d* rowspan='d* align=right><font size='d* face="helvetica">-<.font><.td>
's*<td width='d* rowspan='d* align=right><font size='d* face="helvetica">(?<ThirdNum>'d)<.font><.td>

C# 源代码

static void Main(string[] args)
{
    string filePathDirty = @"DataBase/InputFile.htm";
    string filePathClean = @"DataBase/InputFile-CLEAN.htm";
    int totalLines = File.ReadAllLines(filePathDirty).Length;
    try
    {
        string[] lines = File.ReadAllLines(filePathDirty);
        string cleanLine;
        int progress = 0;
        string pattern = String.Empty;
            // Group Name: Date
            pattern += @"'s*<td width='d* rowspan='d*><font size='d face=""helvetica"">(?<Date>'d+.'d+.'d+)<.font><.td>";
            // Group Name: Time
            pattern += @"'s*<td width='d* rowspan='d* align=right><font size='d* face=""helvetica"">&nbsp;(?<Time>E|M)<.font><.td>";
            // Group Name: FirstNumber
            pattern += @"'s*<td width='d* colspan='d* rowspan='d* align=right><font size='d* face=""helvetica"">(?<FirstNum>'d)<.font><.td>";
            pattern += @"'s*<td width='d* rowspan='d* align=right><font size='d* face=""helvetica"">-<.font><.td>";
            // Group Name: SecondNumber
            pattern += @"'s*<td width='d* rowspan='d* align=right><font size='d* face=""helvetica"">(?<SecondNum>'d)<.font><.td>";
            pattern += @"'s*<td width='d* rowspan='d* align=right><font size='d* face=""helvetica"">-<.font><.td>";
            // Group Name: ThirdNumber
            pattern += @"'s*<td width='d* rowspan='d* align=right><font size='d* face=""helvetica"">(?<ThirdNum>'d)<.font><.td>";
        foreach (string line in lines)
        {
            // Skip the First 69 Lines, No Need to Since there is no Data
            if (progress > 69)
            {
                foreach (Match match in Regex.Matches(line, pattern))
                {
                        cleanLine = String.Format("{0} | {1} | {2} | {3} | {4}'r'n", match.Groups["Date"].Value, match.Groups["Time"].Value, match.Groups["FirstNum"].Value, match.Groups["SecondNum"].Value, match.Groups["ThirdNum"].Value);
                        WriteToFile(cleanLine, filePathClean);
                }
            }
            progress++;
        }
    }
    catch (Exception e)
    {
        Console.WriteLine("The file could not be read:");
        Console.WriteLine(e.Message);
    }
}

简化规格

在 HTML 中,需要提取的数据很少。我已经发表了评论,以帮助确定数据的位置以及如何格式化。

<!-- Start Matching -->
<tr valign=top>
  <td height=9 colspan=4></td>
<!-- Line Below Has the Date // 08/14/14 -->
  <td width=44><font size=2 face="helvetica">08/14/14</font></td>
  <td></td>
<!-- Line Below Has the Time // E -->
<!-- Will Either be a Capital E or M for Evening or Morning -->
  <td width=15 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
<!-- Line Below Has the First Number // 5 -->
  <td width=17 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
<!-- Line Below Has the Second Number // 7 -->
  <td width=14 align=right><font size=2 face="helvetica">7</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
<!-- Line Below Has the Third Number // 3 -->
  <td width=17 align=right><font size=2 face="helvetica">3</font></td>
  <td></td>
  <td width=17 colspan=3 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
<!-- End of Matching // There are Three Sets of Data per HTML Table Row -->
  <td width=45><font size=2 face="helvetica">07/19/14</font></td>
  <td></td>
  <td width=15 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 colspan=2 align=right><font size=2 face="helvetica">9</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 align=right><font size=2 face="helvetica">2</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">4</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=45><font size=2 face="helvetica">06/23/14</font></td>
  <td></td>
  <td width=15 colspan=2 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=16 align=right><font size=2 face="helvetica">0</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 colspan=2 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">6</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td></td>
  <td width=44 colspan=2><font size=2 face="helvetica">05/28/14</font></td>
  <td></td>
  <td width=15 align=right><font size=2 face="helvetica">&nbsp;E</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">5</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=14 align=right><font size=2 face="helvetica">3</font></td>
  <td></td>
  <td width=11 align=right><font size=2 face="helvetica">-</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">1</font></td>
  <td></td>
  <td width=17 align=right><font size=2 face="helvetica">&nbsp;</font></td>
  <td colspan=4></td>
</tr>

我想将这些集分组以以下格式创建一个新的平面文件,以便干净地导入到数据库中。

日期 |时间 |头号人物 |二号人物 |三号位

正则表达式匹配大型文本文件中的多个字符串

考虑另一种方法..

  1. 首先将 HTML 文档/HTML 表格转换为 XML(我们可以获得免费的工具/代码来执行此操作
  2. 编写您自己的 XQuery/XML 解析代码来获取所需的详细信息并完成其余的工作。希望这有帮助..