如何从具有超过200K行且大小为1 GB的文本文件中删除重复行

本文关键字:GB 文本 文件 删除 小为 200K | 更新日期: 2023-09-27 18:18:38

目前我使用以下代码......它的工作只有300行文本文件,它将花费2分钟来执行这个程序代码…但是我的文本文件有超过200k的行(行),所以这个代码不适合那个文件…所以请大家帮我解决这个问题…提前感谢……

string[] source = System.IO.File.ReadAllLines(@"C:'Documents and Settings'finaloutput.txt");      
var q1 = (from line in source
          let fields = line.Split(',')
          select new
          {
              autoid = fields[0],
              ATMID = fields[4],
              DATE = fields[2],
              TIME = fields[3],
              CARDNo = fields[5],
              TRANSId = fields[6],
              SEQNo = fields[7],
              TRANSIT = fields[8],
              CheckNo = fields[9],
              CATEGORY = fields[10],
              SCORE = fields[11],
              //THRESHOLD = fields[12]
          });

    var ids = (from d in q1
               where d.CATEGORY != "Accepted"
               group d by new { d.ATMID, d.DATE, d.CARDNo, d.TRANSIT, d.CheckNo } into grp
               select grp.Min(x => x.autoid));

    var toDelete = (from d in q1
                    where !ids.Contains(d.autoid) && d.CATEGORY != "Accepted"
                    select d.autoid);
    // source1.DeleteOnSubmit(toDelete);
    var distinct = (from d in q1
                    where !toDelete.Contains(d.autoid)
                    select d);

    // Makes a list of the DeletedFields  
    // var list_Of_CSV_ItemsDeleted = distinct.Select(x => string.Join(",", x.autoid));
    // Makes a list of the distinct Fields  
    var list_Of_CSV_ItemsDistinct = distinct.Select(x => string.Join(",", x.autoid, x.ATMID, x.DATE, x.TIME, x.CARDNo, x.TRANSId, x.SEQNo, x.TRANSIT, x.CheckNo, x.CATEGORY, x.SCORE)); 
    System.IO.File.WriteAllLines(@"C:'Documents and Settings'distict1.txt", list_Of_CSV_ItemsDistinct);

如何从具有超过200K行且大小为1 GB的文本文件中删除重复行

我不打算为您重写这个,但是您需要做的一件事是利用延迟执行。考虑以下代码:

var enumerable = File.ReadLines(filePath);

返回一个IEnumerable<string>,因此当您请求时,它只从文件中读取一行。现在考虑这段代码:

var next100 = enumerable.Take(100);

这将需要100行,并让您使用它们。这就是你要做的。您几乎仍然可以使用相同的LINQ查询,但一次只能使用一个section。

所以,不像这样:

var q1 = (from line in source ...

可能是这样的:

var q1 = (from line in source.Take(100) ...