正在合并c#中文本文件的行内容
本文关键字:文件 文本 合并 中文 | 更新日期: 2023-09-27 18:20:06
我有两个巨大的文本文件,其格式如下。
文件1:
ID1,20
ID2,20
ID3,30
文件2:
ID3,75
ID1,84
ID2,70
这两个文件都包含超过200000行。我需要读取这两个文件,并以这种格式创建第三个文件:
文件3:
ID1,20,84
ID2,20,70
ID3,30,75
ID可以是用户输入的任何字符串。第三个文件应该通过将文件1行中的ID与文件2行中的ID相匹配来创建。我已经写了一段代码,但生成文件3需要很多时间。手头的任务涉及并行化,所以我希望代码能尽可能多地为我节省时间。请提出一种更快、更有效的方法来处理这个问题。
(这是我使用的代码)
public void positionCure(string afile,string bfile,string dfile)
{
string alphaFilePath = afile;
List<string> alphaFileContent = new List<string>();
using (FileStream fs = new FileStream(alphaFilePath, FileMode.Open))
using(StreamReader rdr = new StreamReader(fs))
{
while(!rdr.EndOfStream)
{
alphaFileContent.Add(rdr.ReadLine());
}
}
string betaFilePath = bfile;
StringBuilder sb = new StringBuilder();
using (FileStream fs = new FileStream(betaFilePath, FileMode.Open))
using (StreamReader rdr = new StreamReader(fs))
{
while(! rdr.EndOfStream)
{
string[] betaFileLine = rdr.ReadLine().Split(Convert.ToChar(","));
foreach (string alphaline in alphaFileContent)
{
string[] alphaFileLine = alphaline.Split(Convert.ToChar(","));
if (alphaFileLine[0].Equals(betaFileLine[0].ToString()))
{
sb.AppendLine(String.Format("{0}, {1}, {2}", betaFileLine[0], betaFileLine[1], alphaline.Substring(alphaline.IndexOf(Convert.ToChar(","))+1)));
}
}
}
}
using (FileStream fs = new FileStream(dfile, FileMode.Create))
using (StreamWriter writer = new StreamWriter(fs))
{
writer.Write(sb.ToString());
}
}
}
我会做一些类似的事情:
string[] files = new string[] { @"c:'temp'file1.txt", @"c:'temp'file2.txt" };
var hash = new Dictionary<string, Dictionary<string, bool>>();
foreach (var file in files)
{
string[] fileContents = File.ReadAllLines(file);
foreach (string line in fileContents)
{
string[] a = line.Split(',');
if (!hash.Keys.Contains(a[0]))
hash[a[0]] = new Dictionary<string, bool>();
hash[a[0]][a[1]] = true;
}
}
foreach (var key in hash.Keys)
Console.WriteLine(key + "," + string.Join(",", hash[key].Keys.ToArray()));
我建议使用Dictionary
:
var combined = new Dictionary<string, string>();
// loop through each of the rows in the first file, and the second file,
while (going through each of the files)
{
string id;
string number;
//and after splitting the line into the two variables:
if (combined.ContainsKey(id))
{
combined[id] = combined[id] + "," + number; // or do something else, this is just an example
// changing it from a Dictionary<string, string> to a Dictionary<string, List<string>> might be more performant, especially if you have a bunch of files you want to do this do, but it also might not be necessary.
}
else
{
combined[id] = number;
}
}
// you can then run through the file and output it.
foreach (var pair in combined)
{
file.Write(pair.Key);
file.Write(",");
file.Writeline(pair.Value);
}
这里有几个不错的解决方案。这是另一个简单的例子:
将内容放入字典:
private Dictionary<string, string> LoadFile(string path)
{
string line;
Dictionary<string, string> vals = new Dictionary<string, string>();
using (StreamReader file = new StreamReader(path))
{
while ((line = file.ReadLine()) != null)
{
string[] parts = line.Split(',');
vals.Add(parts[0], parts[1]);
}
}
return vals;
}
然后在您的程序中,加载每个文件并合并
Dictionary<string, string> fileAValues = LoadFile(@"C:'Temp'FileA.txt");
Dictionary<string, string> fileBValues = LoadFile(@"C:'Temp'FileB.txt");
using (StreamWriter sr = new StreamWriter(@"C:'Temp'FileC.txt"))
{
foreach (string key in fileAValues.Keys)
{
if (fileBValues.ContainsKey(key))
{
string combined = key + "," +
String.Join(",", fileAValues[key].ToString(),
fileBValues[key].ToString());
sr.WriteLine(combined);
}
}
}
如果有人对VB.NET版本感兴趣(我对C#总是太慢了),那么为了完整性。我也在使用字典的方法。
Dim dic1 As New Dictionary(Of String, List(Of String))
Dim file1 = System.IO.File.ReadAllLines("C:'Temp'File1.txt")
For Each l In file1
Dim cols = l.Split(","c)
If cols.Any Then
Dim key = cols(0)
If Not dic1.ContainsKey(key) Then
Dim values = (From col In cols Skip (1)).ToList
dic1.Add(key, values)
End If
End If
Next
Dim file2 = System.IO.File.ReadAllLines("C:'Temp'File2.txt")
For Each l In file2
Dim cols = l.Split(","c)
If cols.Any Then
Dim key = cols(0)
If dic1.ContainsKey(key) Then
' append '
Dim values = (From col In cols Skip (1)).ToList
dic1(key).AddRange(values)
Else
Dim values = (From col In cols Skip (1)).ToList
dic1.Add(key, values)
End If
End If
Next
Using writer = New System.IO.StreamWriter("C:'Temp'File3.txt")
For Each entry In dic1
writer.WriteLine(String.Format("{0},{1}", entry.Key, String.Join(",", entry.Value)))
Next
End Using
通过将其构造为LINQ查询,您可以利用AsParallel方法在多个线程上执行它。考虑到你有这么多数据,这将大大提高你的算法的性能。
首先,我们需要变得更有条理。我们可以对您使用的价值进行建模:
public class InputLine
{
public string Id { get; set; }
public string Value { get; set; }
}
public class OutputLine
{
public string Id { get; set; }
public string Value1 { get; set; }
public string Value2 { get; set; }
}
我们还可以定义这些价值观的生产者和消费者:
public class InputFile
{
private readonly string _path;
public InputFile(string path)
{
_path = path;
}
public IEnumerable<InputLine> GetLines()
{
return
from line in File.ReadAllLines(_path)
let parts = line.Split(',')
select new InputLine { Id = parts[0], Value = parts[1] };
}
}
public class OutputFile
{
private readonly string _path;
public OutputFile(string path)
{
_path = path;
}
public void WriteLines(IEnumerable<OutputLine> lines)
{
File.WriteAllLines(_path, lines.Select(line => String.Join(",", line.Id, line.Value1, line.Value2)));
}
}
现在,我们有了编写一个将所有查询联系在一起的查询的要素。此查询有两个关键方面:
- 使用
.AsParallel()
扩展方法并行执行 - 使用
join
运算符将两个输入文件之间的键关联起来
我们只需要两个输入文件和输出文件:
public void WriteResults(InputFile file1, InputFile file2, OutputFile resultFile)
{
var resultLines =
from file1Line in file1.GetLines().AsParallel()
join file2Line in file2.GetLines() on file1Line.Id equals file2Line.Id
select new OutputLine
{
Id = file1Line.Id,
Value1 = file1Line.Value,
Value2 = file2Line.Value
};
resultFile.WriteLines(resultLines);
}
join
操作符在后台使用与字典类似的方法,并且您还可以从在多个线程上进行关联中获益。