在File1.txt和File2.txt之间循环确实很慢.两个文件都是280MB
本文关键字:txt 两个 280MB 文件 File2 File1 之间 循环 | 更新日期: 2023-09-27 18:27:45
我有两个大的文本文件,每个文件中有400000行文本。在File2.txt中,我需要从File1.txt中的当前行中找到包含userId的行。一旦我在File2.txt中找到了正确的行,我就会进行一些计算,并将该行写入一个新的文本文件。
我创建的代码运行速度非常慢。我试过用各种方式重写它,但它总是断断续续,永远不会结束。我怎么能这么快?
private void btnExecute_Click(object sender, EventArgs e) {
string line1 = "";
string line2 = "";
//the new text file we are creating. Located in IVR_Text_Update'bin'Debug
StreamWriter sw = new StreamWriter("NewFile.txt");
//the new text file which contains the registrants which need removing
StreamWriter sw_removeRegs = new StreamWriter("RemoveRegistrants.txt");
//address has changed so we write the line to the address file
StreamWriter sw_addressChange = new StreamWriter("AddressChanged.txt");
List<string> lines_secondFile = new List<string>();
using (StreamReader sr = new StreamReader(openFileDialog2.FileName)) {
string line;
while ((line = sr.ReadLine()) != null) {
lines_secondFile.Add(line);
}
}
//loop through the frozen file one line at a time
while ((line1 = sr1.ReadLine()) != null) {
//get the line from the update file, assign it to line2
//function accepts (userId, List)
line2 = getLine(line1.Substring(3, 8), lines_secondFile);
//if line2 is null then userId was not found therefore we write
//the line to Remove Registrants file
if (line2 == null) {
sw_removeRegs.Write(line1 + Environment.NewLine);
}
//address between the two lines was found to be different so we still write
//them to the new text file but don't update codes
else if (line1.Substring(93, 53) != line2.Substring(93, 53)) {
sw_addressChange.Write(line1 + Environment.NewLine);
sw.Write(line1 + Environment.NewLine);
}
//test for null then write the new line in our new text file
else if ((line1 != null) && (line2 != null)) {
sw.Write(line1.Substring(0, 608) +
line2.Substring(608, 9) +
line2.Substring(617, 9) +
line2.Substring(626, 9) +
line2.Substring(635, 9) +
line2.Substring(644, 9) +
line2.Substring(653, 9) +
line2.Substring(662, 9) +
line2.Substring(671, 9) +
line2.Substring(680, 9) +
line1.Substring(680, 19) +
Environment.NewLine);
}
}
textBox1.Text = "Finished.";
sr1.Close();
sw.Close();
sw_removeRegs.Close();
sw_addressChange.Close();
}
//returns the line from the update file which has the corresponding userId
//from the frozen file
string getLine(string userId, List<string> lines_secondFile) {
foreach (string currentLine in lines_secondFile) {
if (currentLine.Contains(userId)) {
return currentLine;
}
}
return null;
}
不考虑磁盘访问速度,您当前的算法是O(n^2)
-对于第一个文件中的每一行,您都要在列表中查找用户id-您可以使用一些缓存来避免多次查找相同的用户id,我假设您的用户少于400k,因此大多数情况下应该是重复的:
private Dictionary<string, string> userMap = new Dictionary<string, string>();
string getLine(string userId, List<string> lines_secondFile)
{
if(userMap.ContainsKey(userId))
return userMap[userId];
else
{
foreach (string currentLine in lines_secondFile)
{
if (currentLine.Contains(userId))
{
userMap.Add(userId, currentLine);
return currentLine;
}
}
return null;
}
与其逐行读取,不如尝试一次读取所有文件。这比对一个文件发出多次读取请求要快得多。这是因为文件访问比内存访问慢得多。尝试File.ReadAllText
话虽如此,您应该尝试对代码进行分析,以了解代码中的瓶颈所在。
如果您有资源,您可以将整个文件放在内存中。然后应该提高速度。在C#4之前,您必须使用WIN32 API来内存映射文件,但C#4添加了System.IO.MemoryMappedFiles.MemoryMappedFile
。
还可以实现多线程方法来处理pararrel中的部分文件,但这会增加额外的复杂性。