如何遍历多个日志/文本文件约200 MB每个使用c# ?和应用正则表达式

本文关键字:MB 正则表达式 应用 文件 遍历 何遍历 文本 日志 | 更新日期: 2023-09-27 17:52:50

我必须开发一个实用程序,它接受包含多个日志/文本文件的文件夹的路径,每个文件大约200 MB,然后遍历所有文件,从它们存在的行中选择四个元素。

我尝试了多种解决方案,所有的解决方案都非常适合较小的文件,但是当我加载更大的文件时,Windows窗体只是挂起,或者它显示"OutOfMemory异常"。请帮忙

解决方案1:

string textFile;
string re1 = "((?:2|1)''d{3}(?:-|''/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|''/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|''s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
FolderBrowserDialog fbd = new FolderBrowserDialog();
DialogResult result = fbd.ShowDialog();
if (!string.IsNullOrWhiteSpace(fbd.SelectedPath))
{
    string[] files = Directory.GetFiles(fbd.SelectedPath);
    System.Windows.Forms.MessageBox.Show("Files found: " + files.Length.ToString(), "Message");
    foreach (string fileName in files)
    {
        textFile = File.ReadAllText(fileName); 
        MatchCollection mc = Regex.Matches(textFile, re1);
        foreach (Match m in mc)
        {
            string a = m.ToString();
            Path.Text += a; //Temporary, Just to check the output
            Path.Text += Environment.NewLine;
        }
    }
}

Soltuion 2:

string re1 = "((?:2|1)''d{3}(?:-|''/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|''/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|''s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
FolderBrowserDialog fbd = new FolderBrowserDialog();
DialogResult result = fbd.ShowDialog();
foreach (string file in System.IO.Directory.GetFiles(fbd.SelectedPath))
{
    const Int32 BufferSize = 512;
    using (var fileStream = File.OpenRead(file))
    using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize))

    {
        String line;
        while ((line = streamReader.ReadLine()) != null)
        {
            MatchCollection mc = Regex.Matches(line, re1);
            foreach (Match m in mc)
            {
                string a = m.ToString();
                Path.Text += a; //Temporary, Just to check the output
                Path.Text += Environment.NewLine;
            }
       }  
}

解决方案3:

string re1 = "((?:2|1)''d{3}(?:-|''/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|''/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|''s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
FolderBrowserDialog fbd = new FolderBrowserDialog();
DialogResult result = fbd.ShowDialog();
using (StreamReader r = new StreamReader(file))
{
    try
    {
        string line = String.Empty;
        while (!r.EndOfStream)
        {
            line = r.ReadLine();
            MatchCollection mc = Regex.Matches(line, re1);
            foreach (Match m in mc)
            {
                string a = m.ToString();
                Path.Text += a; //Temporary, Just to check the output
                Path.Text += Environment.NewLine;
            }
        }
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message);
    }
}

如何遍历多个日志/文本文件约200 MB每个使用c# ?和应用正则表达式

有几件事需要注意

  1. 不要附加到字符串Path.Text += ...。我假设这只是一个测试代码,希望应该被扔掉
  2. 您可以使用简单的File.ReadLines调用,在文件读取速度上没有实际差异
  3. 你应该编译你的Regex
  4. 你可以试着简化你的正则表达式
  5. 你可以在做regex匹配之前添加简单的基于字符串的预检查

下面是实现上述准则的示例代码

string re1 = "((?:2|1)''d{3}(?:-|''/)(?:(?:0[1-9])|(?:1[0-2]))(?:-|''/)(?:(?:0[1-9])|(?:[1-2][0-9])|(?:3[0-1]))(?:T|''s)(?:(?:[0-1][0-9])|(?:2[0-3])):(?:[0-5][0-9]):(?:[0-5][0-9]))";
var buf = new List<string>();
var re2 = new Regex(re1, RegexOptions.Compiled);
FolderBrowserDialog fbd = new FolderBrowserDialog();
DialogResult result = fbd.ShowDialog();
foreach (string file in System.IO.Directory.GetFiles(fbd.SelectedPath)) {
    foreach (var line in File.ReadLines(file)) {
        if ((indx = line.IndexOf('-')) == -1 || line.IndexOf(':', indx + 1) == -1)
            continue;
        MatchCollection mc = re2.Matches(line);
        foreach (Match m in mc) {
            string a = m.ToString();
            buf.Add(a + Environment.NewLine); //Temporary, Just to check the output
        }
    }
}

您的"Path"调试可能连接了大量的字符串。将它改为StringBuilder而不是+= concatation,看看这是否是内存问题的原因

查看了MS日志解析器2.2的替代方法吗?