如何从具有超过200K行且大小为1 GB的文本文件中删除重复行
本文关键字:GB 文本 文件 删除 小为 200K | 更新日期: 2023-09-27 18:18:38
目前我使用以下代码......它的工作只有300行文本文件,它将花费2分钟来执行这个程序代码…但是我的文本文件有超过200k的行(行),所以这个代码不适合那个文件…所以请大家帮我解决这个问题…提前感谢……
string[] source = System.IO.File.ReadAllLines(@"C:'Documents and Settings'finaloutput.txt");
var q1 = (from line in source
let fields = line.Split(',')
select new
{
autoid = fields[0],
ATMID = fields[4],
DATE = fields[2],
TIME = fields[3],
CARDNo = fields[5],
TRANSId = fields[6],
SEQNo = fields[7],
TRANSIT = fields[8],
CheckNo = fields[9],
CATEGORY = fields[10],
SCORE = fields[11],
//THRESHOLD = fields[12]
});
var ids = (from d in q1
where d.CATEGORY != "Accepted"
group d by new { d.ATMID, d.DATE, d.CARDNo, d.TRANSIT, d.CheckNo } into grp
select grp.Min(x => x.autoid));
var toDelete = (from d in q1
where !ids.Contains(d.autoid) && d.CATEGORY != "Accepted"
select d.autoid);
// source1.DeleteOnSubmit(toDelete);
var distinct = (from d in q1
where !toDelete.Contains(d.autoid)
select d);
// Makes a list of the DeletedFields
// var list_Of_CSV_ItemsDeleted = distinct.Select(x => string.Join(",", x.autoid));
// Makes a list of the distinct Fields
var list_Of_CSV_ItemsDistinct = distinct.Select(x => string.Join(",", x.autoid, x.ATMID, x.DATE, x.TIME, x.CARDNo, x.TRANSId, x.SEQNo, x.TRANSIT, x.CheckNo, x.CATEGORY, x.SCORE));
System.IO.File.WriteAllLines(@"C:'Documents and Settings'distict1.txt", list_Of_CSV_ItemsDistinct);
我不打算为您重写这个,但是您需要做的一件事是利用延迟执行。考虑以下代码:
var enumerable = File.ReadLines(filePath);
返回一个IEnumerable<string>
,因此当您请求时,它只从文件中读取一行。现在考虑这段代码:
var next100 = enumerable.Take(100);
这将需要100行,并让您使用它们。这就是你要做的。您几乎仍然可以使用相同的LINQ查询,但一次只能使用一个section。
所以,不像这样:
var q1 = (from line in source ...
可能是这样的:
var q1 = (from line in source.Take(100) ...