如何遍历多个日志/文本文件约200 MB每个使用c# ?和应用正则表达式
本文关键字:MB 正则表达式 应用 文件 遍历 何遍历 文本 日志 | 更新日期: 2023-09-27 17:52:50
我必须开发一个实用程序,它接受包含多个日志/文本文件的文件夹的路径,每个文件大约200 MB,然后遍历所有文件,从它们存在的行中选择四个元素。
我尝试了多种解决方案,所有的解决方案都非常适合较小的文件,但是当我加载更大的文件时,Windows窗体只是挂起,或者它显示"OutOfMemory异常"。请帮忙
解决方案1:
string textFile;
string re1 = "((?:2|1)''d{3}(?:-|''/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|''/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|''s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
FolderBrowserDialog fbd = new FolderBrowserDialog();
DialogResult result = fbd.ShowDialog();
if (!string.IsNullOrWhiteSpace(fbd.SelectedPath))
{
string[] files = Directory.GetFiles(fbd.SelectedPath);
System.Windows.Forms.MessageBox.Show("Files found: " + files.Length.ToString(), "Message");
foreach (string fileName in files)
{
textFile = File.ReadAllText(fileName);
MatchCollection mc = Regex.Matches(textFile, re1);
foreach (Match m in mc)
{
string a = m.ToString();
Path.Text += a; //Temporary, Just to check the output
Path.Text += Environment.NewLine;
}
}
}
Soltuion 2:
string re1 = "((?:2|1)''d{3}(?:-|''/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|''/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|''s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
FolderBrowserDialog fbd = new FolderBrowserDialog();
DialogResult result = fbd.ShowDialog();
foreach (string file in System.IO.Directory.GetFiles(fbd.SelectedPath))
{
const Int32 BufferSize = 512;
using (var fileStream = File.OpenRead(file))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize))
{
String line;
while ((line = streamReader.ReadLine()) != null)
{
MatchCollection mc = Regex.Matches(line, re1);
foreach (Match m in mc)
{
string a = m.ToString();
Path.Text += a; //Temporary, Just to check the output
Path.Text += Environment.NewLine;
}
}
}
解决方案3:
string re1 = "((?:2|1)''d{3}(?:-|''/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|''/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|''s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
FolderBrowserDialog fbd = new FolderBrowserDialog();
DialogResult result = fbd.ShowDialog();
using (StreamReader r = new StreamReader(file))
{
try
{
string line = String.Empty;
while (!r.EndOfStream)
{
line = r.ReadLine();
MatchCollection mc = Regex.Matches(line, re1);
foreach (Match m in mc)
{
string a = m.ToString();
Path.Text += a; //Temporary, Just to check the output
Path.Text += Environment.NewLine;
}
}
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
有几件事需要注意
- 不要附加到字符串
Path.Text += ...
。我假设这只是一个测试代码,希望应该被扔掉 - 您可以使用简单的
File.ReadLines
调用,在文件读取速度上没有实际差异 你应该编译你的Regex - 你可以试着简化你的正则表达式
- 你可以在做regex匹配之前添加简单的基于字符串的预检查
下面是实现上述准则的示例代码
string re1 = "((?:2|1)''d{3}(?:-|''/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|''/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|''s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
var buf = new List<string>();
var re2 = new Regex(re1, RegexOptions.Compiled);
FolderBrowserDialog fbd = new FolderBrowserDialog();
DialogResult result = fbd.ShowDialog();
foreach (string file in System.IO.Directory.GetFiles(fbd.SelectedPath)) {
foreach (var line in File.ReadLines(file)) {
if ((indx = line.IndexOf('-')) == -1 || line.IndexOf(':', indx + 1) == -1)
continue;
MatchCollection mc = re2.Matches(line);
foreach (Match m in mc) {
string a = m.ToString();
buf.Add(a + Environment.NewLine); //Temporary, Just to check the output
}
}
}
您的"Path"调试可能连接了大量的字符串。将它改为StringBuilder而不是+= concatation,看看这是否是内存问题的原因
查看了MS日志解析器2.2的替代方法吗?