在File1.txt和File2.txt之间循环确实很慢.两个文件都是280MB

本文关键字:txt 两个 280MB 文件 File2 File1 之间 循环 | 更新日期: 2023-09-27 18:27:45

我有两个大的文本文件,每个文件中有400000行文本。在File2.txt中,我需要从File1.txt中的当前行中找到包含userId的行。一旦我在File2.txt中找到了正确的行,我就会进行一些计算,并将该行写入一个新的文本文件。

我创建的代码运行速度非常慢。我试过用各种方式重写它,但它总是断断续续,永远不会结束。我怎么能这么快?

private void btnExecute_Click(object sender, EventArgs e) {        
    string line1 = "";
    string line2 = "";
    //the new text file we are creating. Located in IVR_Text_Update'bin'Debug
    StreamWriter sw = new StreamWriter("NewFile.txt");
    //the new text file which contains the registrants which need removing
    StreamWriter sw_removeRegs = new StreamWriter("RemoveRegistrants.txt");
    //address has changed so we write the line to the address file
    StreamWriter sw_addressChange = new StreamWriter("AddressChanged.txt");
    List<string> lines_secondFile = new List<string>();
    using (StreamReader sr = new StreamReader(openFileDialog2.FileName)) {
        string line;
        while ((line = sr.ReadLine()) != null) {
            lines_secondFile.Add(line);
        }
    }
    //loop through the frozen file one line at a time
    while ((line1 = sr1.ReadLine()) != null) {
        //get the line from the update file, assign it to line2
        //function accepts (userId, List)
        line2 = getLine(line1.Substring(3, 8), lines_secondFile);
        //if line2 is null then userId was not found therefore we write
        //the line to Remove Registrants file
        if (line2 == null) {
            sw_removeRegs.Write(line1 + Environment.NewLine);
        }
        //address between the two lines was found to be different so we still write
        //them to the new text file but don't update codes
        else if (line1.Substring(93, 53) != line2.Substring(93, 53)) {
            sw_addressChange.Write(line1 + Environment.NewLine);
            sw.Write(line1 + Environment.NewLine);
        }
        //test for null then write the new line in our new text file
        else if ((line1 != null) && (line2 != null)) {
            sw.Write(line1.Substring(0, 608) +                    
                     line2.Substring(608, 9) +
                     line2.Substring(617, 9) +
                     line2.Substring(626, 9) +
                     line2.Substring(635, 9) +
                     line2.Substring(644, 9) +
                     line2.Substring(653, 9) +
                     line2.Substring(662, 9) +
                     line2.Substring(671, 9) +
                     line2.Substring(680, 9) +
                     line1.Substring(680, 19) + 
                     Environment.NewLine);
        }
    }
    textBox1.Text = "Finished.";
    sr1.Close();
    sw.Close();
    sw_removeRegs.Close();
    sw_addressChange.Close();
}
//returns the line from the update file which has the corresponding userId
//from the frozen file
string getLine(string userId, List<string> lines_secondFile) {
    foreach (string currentLine in lines_secondFile) {
        if (currentLine.Contains(userId)) {
            return currentLine;
        }
    }
    return null;
}

在File1.txt和File2.txt之间循环确实很慢.两个文件都是280MB

不考虑磁盘访问速度,您当前的算法是O(n^2)-对于第一个文件中的每一行,您都要在列表中查找用户id-您可以使用一些缓存来避免多次查找相同的用户id,我假设您的用户少于400k,因此大多数情况下应该是重复的:

private Dictionary<string, string> userMap = new Dictionary<string, string>();
string getLine(string userId, List<string> lines_secondFile) 
{
    if(userMap.ContainsKey(userId))
        return userMap[userId];
    else
    {
      foreach (string currentLine in lines_secondFile) 
      {
        if (currentLine.Contains(userId)) 
        {
            userMap.Add(userId, currentLine);
            return currentLine;
        }
    }
    return null;
}

与其逐行读取,不如尝试一次读取所有文件。这比对一个文件发出多次读取请求要快得多。这是因为文件访问比内存访问慢得多。尝试File.ReadAllText

话虽如此,您应该尝试对代码进行分析,以了解代码中的瓶颈所在。

如果您有资源,您可以将整个文件放在内存中。然后应该提高速度。在C#4之前,您必须使用WIN32 API来内存映射文件,但C#4添加了System.IO.MemoryMappedFiles.MemoryMappedFile

还可以实现多线程方法来处理pararrel中的部分文件,但这会增加额外的复杂性。