正则表达式匹配大型文本文件中的多个字符串
本文关键字:字符串 文件 大型 文本 正则表达式 | 更新日期: 2023-09-27 18:18:11
>问题
我有一个相当大的文本文件(大约 10 兆字节,700,000 行(,其中包含 HTML 代码。
我的目标是从中提取某些信息。我相信使用正则表达式将是最好的方法,因为我有多个文件也需要做同样的事情。
我有,我相信一个与我需要的数据相匹配的正则表达式,但我相信我遇到了锚点的问题。我一直在使用 regex101.com 来帮助我匹配和学习正则表达式,但我一次只能匹配一部分数据。我尝试过用 ''A、$、^ 作为字符串的开头和结尾,但没有运气。我尝试在谷歌上搜索这个,但我只找到一篇文章似乎与我的用例相匹配,它使用的是perl,解决方案是创建整个文本文件的单个字符串,我认为这不是一个好主意。
示例输入文件
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title></title>
</head>
<body dir="LTR" bgcolor="#ffffff">
<!-- Created by Oracle Reports 04:00 Fri Aug 15 04:00:37 AM, 2014 -->
<table border=0 cellspacing=0 cellpadding=0 width=774>
<tr><td width=15></td><td width=1></td><td width=3></td><td width=6></td><td width=44></td><td width=1></td><td width=15></td><td width=4></td><td width=17></td><td width=1></td><td width=11></td><td width=1></td><td width=14></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=11></td><td width=4></td><td width=11></td><td width=2></td><td width=13></td><td width=45></td><td width=1></td><td width=15></td><td width=3></td><td width=9></td><td width=8></td><td width=1></td><td width=11></td><td width=1></td><td width=14></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=12></td><td width=17></td><td width=12></td><td width=45></td><td width=1></td><td width=9></td><td width=6></td><td width=4></td><td width=16></td><td width=1></td><td width=11></td><td width=1></td><td width=13></td><td width=1></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=12></td><td width=17></td><td width=13></td><td width=36></td><td width=8></td><td width=1></td><td width=15></td><td width=4></td><td width=17></td><td width=1></td><td width=11></td><td width=1></td><td width=14></td><td width=1></td><td width=11></td><td width=1></td><td width=17></td><td width=12></td><td width=17></td><td width=8></td><td width=1></td><td width=10></td><td width=25></td></tr>
<tr><td colspan=77 height=9></td></tr>
<tr valign=top>
<td height=9></td>
<td colspan=23></td>
<td colspan=2></td>
</tr>
<tr><td colspan=77 height=9></td></tr>
<tr valign=top>
<td height=9></td>
<td width=174 colspan=19 rowspan=2><font face="helvetica" color="#007f7f"><b>15-AUG-2014</b></font></td>
<td colspan=38></td>
<td width=139 colspan=16 rowspan=2 align=center> <font face="helvetica" color="#007f7f"><b>Page </b></font><font face="helvetica" color="#007f7f"><b>1</b></font><font face="helvetica" color="#007f7f"><b> of </b></font><font face="helvetica" color="#007f7f"><b>58</b></font><br></td>
<td colspan=3></td>
</tr>
<tr valign=top>
<td height=9></td>
<td colspan=38></td>
<td colspan=3></td>
</tr>
<tr valign=top>
<td height=9 colspan=3></td>
<td></td>
</tr>
<tr valign=top>
<td height=9 colspan=3></td>
<td></td>
</tr>
<tr><td colspan=77 height=9></td></tr>
<tr valign=top>
<td height=9 colspan=2></td>
<td></td>
</tr>
<tr valign=top>
<td height=9 colspan=27></td>
<td colspan=28></td>
</tr>
<tr valign=top>
<td height=9 colspan=4></td>
<td width=44><font size=2 face="helvetica">08/14/14</font></td>
<td></td>
<td width=15 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 align=right><font size=2 face="helvetica">7</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">3</font></td>
<td></td>
<td width=17 colspan=3 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45><font size=2 face="helvetica">07/19/14</font></td>
<td></td>
<td width=15 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 colspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 align=right><font size=2 face="helvetica">2</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">4</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45><font size=2 face="helvetica">06/23/14</font></td>
<td></td>
<td width=15 colspan=2 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=16 align=right><font size=2 face="helvetica">0</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 colspan=2 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">6</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=44 colspan=2><font size=2 face="helvetica">05/28/14</font></td>
<td></td>
<td width=15 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 align=right><font size=2 face="helvetica">3</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">1</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica"> </font></td>
<td colspan=4></td>
</tr>
<tr><td colspan=77 height=1></td></tr>
<tr valign=top>
<td height=9 colspan=4></td>
<td width=44 rowspan=2><font size=2 face="helvetica">08/14/14</font></td>
<td></td>
<td width=15 rowspan=2 align=right><font size=2 face="helvetica"> M</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">4</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 rowspan=2 align=right><font size=2 face="helvetica">3</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=17 colspan=3 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45 rowspan=2><font size=2 face="helvetica">07/19/14</font></td>
<td></td>
<td width=15 rowspan=2 align=right><font size=2 face="helvetica"> M</font></td>
<td></td>
<td width=17 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45 rowspan=2><font size=2 face="helvetica">06/23/14</font></td>
<td></td>
<td width=15 colspan=2 rowspan=2 align=right><font size=2 face="helvetica"> M</font></td>
<td></td>
<td width=16 rowspan=2 align=right><font size=2 face="helvetica">7</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">8</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=44 colspan=2 rowspan=2><font size=2 face="helvetica">05/28/14</font></td>
<td></td>
<td width=15 rowspan=2 align=right><font size=2 face="helvetica"> M</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">2</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td colspan=4></td>
</tr>
<tr valign=top>
<td height=9 colspan=4></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td colspan=4></td>
</tr>
<tr><td colspan=77 height=1></td></tr>
<tr valign=top>
<td height=9 colspan=4></td>
<td width=44 rowspan=2><font size=2 face="helvetica">08/13/14</font></td>
<td></td>
<td width=15 rowspan=2 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">8</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">6</font></td>
<td></td>
<td width=17 colspan=3 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45 rowspan=2><font size=2 face="helvetica">07/18/14</font></td>
<td></td>
<td width=15 rowspan=2 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">0</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 rowspan=2 align=right><font size=2 face="helvetica">4</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">3</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45 rowspan=2><font size=2 face="helvetica">06/22/14</font></td>
<td></td>
<td width=15 colspan=2 rowspan=2 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=16 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 colspan=2 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=44 colspan=2 rowspan=2><font size=2 face="helvetica">05/27/14</font></td>
<td></td>
<td width=15 rowspan=2 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">4</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 rowspan=2 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 rowspan=2 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica">2</font></td>
<td></td>
<td width=17 rowspan=2 align=right><font size=2 face="helvetica"> </font></td>
<td colspan=4></td>
</tr>
<tr valign=top>
<td height=9 colspan=4></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td colspan=4></td>
</tr>
正则表达式
使用全局和多行修饰符
's*<td width='d* rowspan='d*><font size='d face="helvetica">(?<Date>'d+.'d+.'d+)<.font><.td>
's*<td width='d* rowspan='d* align=right><font size='d* face="helvetica"> (?<Time>E|M)<.font><.td>
's*<td width='d* colspan='d* rowspan='d* align=right><font size='d* face="helvetica">(?<FirstNum>'d)<.font><.td>
's*<td width='d* rowspan='d* align=right><font size='d* face="helvetica">-<.font><.td>
's*<td width='d* rowspan='d* align=right><font size='d* face="helvetica">(?<SecondNum>'d)<.font><.td>
's*<td width='d* rowspan='d* align=right><font size='d* face="helvetica">-<.font><.td>
's*<td width='d* rowspan='d* align=right><font size='d* face="helvetica">(?<ThirdNum>'d)<.font><.td>
C# 源代码
static void Main(string[] args)
{
string filePathDirty = @"DataBase/InputFile.htm";
string filePathClean = @"DataBase/InputFile-CLEAN.htm";
int totalLines = File.ReadAllLines(filePathDirty).Length;
try
{
string[] lines = File.ReadAllLines(filePathDirty);
string cleanLine;
int progress = 0;
string pattern = String.Empty;
// Group Name: Date
pattern += @"'s*<td width='d* rowspan='d*><font size='d face=""helvetica"">(?<Date>'d+.'d+.'d+)<.font><.td>";
// Group Name: Time
pattern += @"'s*<td width='d* rowspan='d* align=right><font size='d* face=""helvetica""> (?<Time>E|M)<.font><.td>";
// Group Name: FirstNumber
pattern += @"'s*<td width='d* colspan='d* rowspan='d* align=right><font size='d* face=""helvetica"">(?<FirstNum>'d)<.font><.td>";
pattern += @"'s*<td width='d* rowspan='d* align=right><font size='d* face=""helvetica"">-<.font><.td>";
// Group Name: SecondNumber
pattern += @"'s*<td width='d* rowspan='d* align=right><font size='d* face=""helvetica"">(?<SecondNum>'d)<.font><.td>";
pattern += @"'s*<td width='d* rowspan='d* align=right><font size='d* face=""helvetica"">-<.font><.td>";
// Group Name: ThirdNumber
pattern += @"'s*<td width='d* rowspan='d* align=right><font size='d* face=""helvetica"">(?<ThirdNum>'d)<.font><.td>";
foreach (string line in lines)
{
// Skip the First 69 Lines, No Need to Since there is no Data
if (progress > 69)
{
foreach (Match match in Regex.Matches(line, pattern))
{
cleanLine = String.Format("{0} | {1} | {2} | {3} | {4}'r'n", match.Groups["Date"].Value, match.Groups["Time"].Value, match.Groups["FirstNum"].Value, match.Groups["SecondNum"].Value, match.Groups["ThirdNum"].Value);
WriteToFile(cleanLine, filePathClean);
}
}
progress++;
}
}
catch (Exception e)
{
Console.WriteLine("The file could not be read:");
Console.WriteLine(e.Message);
}
}
简化规格
在 HTML 中,需要提取的数据很少。我已经发表了评论,以帮助确定数据的位置以及如何格式化。
<!-- Start Matching -->
<tr valign=top>
<td height=9 colspan=4></td>
<!-- Line Below Has the Date // 08/14/14 -->
<td width=44><font size=2 face="helvetica">08/14/14</font></td>
<td></td>
<!-- Line Below Has the Time // E -->
<!-- Will Either be a Capital E or M for Evening or Morning -->
<td width=15 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<!-- Line Below Has the First Number // 5 -->
<td width=17 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<!-- Line Below Has the Second Number // 7 -->
<td width=14 align=right><font size=2 face="helvetica">7</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<!-- Line Below Has the Third Number // 3 -->
<td width=17 align=right><font size=2 face="helvetica">3</font></td>
<td></td>
<td width=17 colspan=3 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<!-- End of Matching // There are Three Sets of Data per HTML Table Row -->
<td width=45><font size=2 face="helvetica">07/19/14</font></td>
<td></td>
<td width=15 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 colspan=2 align=right><font size=2 face="helvetica">9</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 align=right><font size=2 face="helvetica">2</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">4</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=45><font size=2 face="helvetica">06/23/14</font></td>
<td></td>
<td width=15 colspan=2 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=16 align=right><font size=2 face="helvetica">0</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 colspan=2 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">6</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica"> </font></td>
<td></td>
<td width=44 colspan=2><font size=2 face="helvetica">05/28/14</font></td>
<td></td>
<td width=15 align=right><font size=2 face="helvetica"> E</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">5</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=14 align=right><font size=2 face="helvetica">3</font></td>
<td></td>
<td width=11 align=right><font size=2 face="helvetica">-</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica">1</font></td>
<td></td>
<td width=17 align=right><font size=2 face="helvetica"> </font></td>
<td colspan=4></td>
</tr>
我想将这些集分组以以下格式创建一个新的平面文件,以便干净地导入到数据库中。
日期 |时间 |头号人物 |二号人物 |三号位
考虑另一种方法..
- 首先将 HTML 文档/HTML 表格转换为 XML(我们可以获得免费的工具/代码来执行此操作
- 编写您自己的 XQuery/XML 解析代码来获取所需的详细信息并完成其余的工作。希望这有帮助..