合并两个文本文件,删除重复项
本文关键字:删除 文件 文本 两个 合并 | 更新日期: 2023-09-27 17:53:08
我有2个文本文件如下所示(像1466786391
这样的大数是唯一的时间戳):
--- 10.0.0.6 ping statistics ---
50 packets transmitted, 49 packets received, 2% packet loss
round-trip min/avg/max = 20.917/70.216/147.258 ms
1466786342
PING 10.0.0.6 (10.0.0.6): 56 data bytes
....
--- 10.0.0.6 ping statistics ---
50 packets transmitted, 50 packets received, 0% packet loss
round-trip min/avg/max = 29.535/65.768/126.983 ms
1466786391
:
--- 10.0.0.6 ping statistics ---
50 packets transmitted, 49 packets received, 2% packet loss
round-trip min/avg/max = 20.917/70.216/147.258 ms
1466786342
PING 10.0.0.6 (10.0.0.6): 56 data bytes
--- 10.0.0.6 ping statistics ---
50 packets transmitted, 50 packets received, 0% packet loss
round-trip min/avg/max = 29.535/65.768/126.983 ms
1466786391
PING 10.0.0.6 (10.0.0.6): 56 data byte
--- 10.0.0.6 ping statistics ---
50 packets transmitted, 44 packets received, 12% packet loss
round-trip min/avg/max = 30.238/62.772/102.959 ms
1466786442
PING 10.0.0.6 (10.0.0.6): 56 data bytes
....
所以第一个文件以timestamp
1466786391结尾,第二个文件在中间的某个地方有相同的数据块,之后有更多的数据,在特定时间戳之前的数据与第一个文件完全相同。
所以我想要的输出是:
--- 10.0.0.6 ping statistics ---
50 packets transmitted, 49 packets received, 2% packet loss
round-trip min/avg/max = 20.917/70.216/147.258 ms
1466786342
PING 10.0.0.6 (10.0.0.6): 56 data bytes
....
--- 10.0.0.6 ping statistics ---
50 packets transmitted, 50 packets received, 0% packet loss
round-trip min/avg/max = 29.535/65.768/126.983 ms
1466786391
--- 10.0.0.6 ping statistics ---
50 packets transmitted, 44 packets received, 12% packet loss
round-trip min/avg/max = 30.238/62.772/102.959 ms
1466786442
PING 10.0.0.6 (10.0.0.6): 56 data bytes
....
也就是说,连接两个文件,并创建第三个文件,删除第二个文件的重复项(第一个文件中已经存在的文本块)。下面是我的代码:
public static void UnionFiles()
{
string folderPath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http");
string outputFilePath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http''union.dat");
var union = Enumerable.Empty<string>();
foreach (string filePath in Directory
.EnumerateFiles(folderPath, "*.txt")
.OrderBy(x => Path.GetFileNameWithoutExtension(x)))
{
union = union.Union(File.ReadAllLines(filePath));
}
File.WriteAllLines(outputFilePath, union);
}
这是我得到的错误输出(文件结构被破坏):
--- 10.0.0.6 ping statistics ---
50 packets transmitted, 49 packets received, 2% packet loss
round-trip min/avg/max = 20.917/70.216/147.258 ms
1466786342
PING 10.0.0.6 (10.0.0.6): 56 data bytes
--- 10.0.0.6 ping statistics ---
50 packets transmitted, 50 packets received, 0% packet loss
round-trip min/avg/max = 29.535/65.768/126.983 ms
1466786391
round-trip min/avg/max = 30.238/62.772/102.959 ms
1466786442
round-trip min/avg/max = 5.475/40.986/96.964 ms
1466786492
round-trip min/avg/max = 5.276/61.309/112.530 ms
编辑:这段代码是用来处理多个文件的,但是我很高兴即使只有2个可以正确完成。
然而,这并没有像它应该的那样删除textblocks
,它删除了几个有用的行,使输出完全无用。我卡住了。
如何实现这一点?谢谢。
我想你是想比较块,而不是逐行。
类似的东西应该可以工作:
public static void UnionFiles()
{
var firstFilePath = "log1.txt";
var secondFilePath = "log2.txt";
var firstLogBlocks = ReadFileAsLogBlocks(firstFilePath);
var secondLogBlocks = ReadFileAsLogBlocks(secondFilePath);
var cleanLogBlock = firstLogBlocks.Union(secondLogBlocks);
var cleanLog = new StringBuilder();
foreach (var block in cleanLogBlock)
{
cleanLog.Append(block);
}
File.WriteAllText("cleanLog.txt", cleanLog.ToString());
}
private static List<LogBlock> ReadFileAsLogBlocks(string filePath)
{
var allLinesLog = File.ReadAllLines(filePath);
var logBlocks = new List<LogBlock>();
var currentBlock = new List<string>();
var i = 0;
foreach (var line in allLinesLog)
{
if (!string.IsNullOrEmpty(line))
{
currentBlock.Add(line);
if (i == 4)
{
logBlocks.Add(new LogBlock(currentBlock.ToArray()));
currentBlock.Clear();
i = 0;
}
else
{
i++;
}
}
}
return logBlocks;
}
与日志块定义如下:
public class LogBlock
{
private readonly string[] _logs;
public LogBlock(string[] logs)
{
_logs = logs;
}
public override string ToString()
{
var logBlock = new StringBuilder();
foreach (var log in _logs)
{
logBlock.AppendLine(log);
}
return logBlock.ToString();
}
public override bool Equals(object obj)
{
return obj is LogBlock && Equals((LogBlock)obj);
}
private bool Equals(LogBlock other)
{
return _logs.SequenceEqual(other._logs);
}
public override int GetHashCode()
{
var hashCode = 0;
foreach (var log in _logs)
{
hashCode += log.GetHashCode();
}
return hashCode;
}
}
请注意在LogBlock中重写Equals,并使用一致的GetHashCode实现作为Union使用它们,如这里所述。
使用正则表达式的一个相当粗糙的解决方案:
var logBlockPattern = new Regex(@"(^---.*ping statistics ---$)'s+"
+ @"(^.+packets transmitted.+packets received.+packet loss$)'s+"
+ @"(^round-trip min/avg/max.+$)'s+"
+ @"(^'d+$)'s*"
+ @"(^PING.+$)?",
RegexOptions.Multiline);
var logBlocks1 = logBlockPattern.Matches(FileContent1).Cast<Match>().ToList();
var logBlocks2 = logBlockPattern.Matches(FileContent2).Cast<Match>().ToList();
var mergedLogBlocks = logBlocks1.Concat(logBlocks2.Where(lb2 =>
logBlocks1.All(lb1 => lb1.Groups[4].Value != lb2.Groups[4].Value)));
var mergedLogContents = string.Join("'n'n", mergedLogBlocks);
正则表达式Match
的Groups
集合包含日志块的每一行(因为在模式中每行都用括号()
括起来)和索引0
处的完全匹配。因此,索引为4
的匹配组是我们可以用来比较日志块的时间戳。
工作示例:https://dotnetfiddle.net/kAkGll
在连接唯一记录时存在问题。你能检查下面的代码吗?
public static void UnionFiles()
{
string folderPath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http");
string outputFilePath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http''union.dat");
var union =new List<string>();
foreach (string filePath in Directory
.EnumerateFiles(folderPath, "*.txt")
.OrderBy(x => Path.GetFileNameWithoutExtension(x)))
{
var filter = File.ReadAllLines(filePath).Where(x => !union.Contains(x)).ToList();
union.AddRange(filter);
}
File.WriteAllLines(outputFilePath, union);
}