如何解析非结构化csv文件
本文关键字:csv 文件 结构化 何解析 | 更新日期: 2023-09-27 18:19:08
我有一个csv文件,如下所示,
Processname:;ABC Buying
ID:;31
Message Date:;08-02-2012
Receiver (code):;12345
Object code:
Location (code):;12345
Date;time
2012.02.08;00:00;0;0,00
2012.02.08;00:15;0;0,00
2012.02.08;00:30;0;0,00
2012.02.08;00:45;0;0,00
2012.02.08;01:00;0;0,00
2012.02.08;01:15;0;0,00
上面的消息可以出现1次或多次,所以我们假设它有2次出现,那么csv文件看起来像…
Processname:;ABC Buying
ID:;31
Message Date:;08-02-2012
Receiver (code):;12345
Object code:
Location (code):;12345
Date;time
2012.02.08;00:00;0;0,00
2012.02.08;00:15;0;0,00
2012.02.08;00:30;0;0,00
2012.02.08;00:45;0;0,00
2012.02.08;01:00;0;0,00
2012.02.08;01:15;0;0,00
Processname:;ABC Buying
ID:;41
Message Date:;08-02-2012
Receiver (code):;12345
Object code:
Location (code):;12345
Date;time
2012.02.08;00:00;0;17,00
2012.02.08;00:15;0;1,00
2012.02.08;00:30;0;15,00
2012.02.08;00:45;0;0,00
2012.02.08;01:00;0;0,00
2012.02.08;01:15;0;9,00
解析这个csv文件的最佳方法是什么?
我的方法的伪代码…
// Read the complete file
var lines = File.ReadAllLines(filePath);
// Split the lines at the occurrence of "Processname:;ABC Buying"
var blocks = lines.SplitAtTheOccuranceOf("Processname:;ABC Buying");
// The results will go to
var results = new List<Result>();
// Loop through the collection
foreach(var b in blocks)
{
var result = new Result();
foreach(var l in b.lines)
{
// read the first line and check it contains "Processname" if so, assign the value to result.ProcessName =
// read the 2nd line and check it contains "ID" if so, assign the value to result.ID
// read the 3rd line and check it contains "Object Code" if so, assign the value to result.ObjectCode
// Ignore string.empty
// check for location (code), if so assign the value to result.LocationCode
// Parse all the other rows by spliting with ';' the first part is date, 2nd part is time, 3rd part is value
}
results.Add(result);
}
最好的方法是什么?
首先,这看起来不像一个CSV文件。其次,我会逐行读取整个文件。当你得到像"Processname:;ABC purchasing"这样的行时,创建一个新对象,这看起来像是你的对象的第一行。然后对每一行进行解析,并使用该行上的任何信息修改对象。当您到达另一个"Processname:;ABC purchasing "时,然后将您一直在处理的对象保存到结果列表中并创建新对象。
你的问题没有足够的细节来深入了解如何解析行等,但以上是我将使用的方法,我怀疑你会得到更好的。值得注意的是,这几乎就是你所得到的,只是我不会将文件分成与每个对象对应的行,而是在你执行时进行分割。
我要做的是有一个强类型对象来保存这些数据,以及一个解析器,该解析器接受字符串并将其分解为单独的项:
// Has no behaviour - only properties
public class Record
{
public string ID { get;set;}
// Other fields
}
// ------------------
// Only has methods ...
public class RecordParser
{
private string content;
public RecordParser(string content)
{
this.content = content;
}
public IEnumerable<Record> SplitRecords()
{
var list = new List<Record>();
foreach(string section in this.content.Split(/* ... */))
{
var record = CreateRecordFromSection(section);
list.Add(record);
}
return list;
}
private static Record CreateRecordFromSection(string content)
{
StringBuilder currentText = new StringBuilder(content);
var record = new Record()
{
ID = SetId(currentText),
ProcessName = SetProcessName(currentText),
/* Set other properties **/
};
return record;
}
/* Methods for specific behaviour **/
/* Modify the StringBuilder after you have trimmed the content required from it */
private static string SetProcessName(StringBuilder content) { }
private static int SetID(StringBuilder content) { }
/** Others **/
}
通过阅读Clean Code, Bob大叔可能会提供另一种更符合你喜好的方法。
这种方法更倾向于使用局部变量,而不是在方法内外传递变量。这背后的思想是,您很快就会意识到您的类在内部移动了多少数据。如果声明了太多变量,则表明发生了太多事情。它也更喜欢较短的方法而不是较长的方法。
public class RecordParser
{
private List<Record> records;
private Record currentRecord;
private string allContent;
private string currentSection;
public RecordParser(string content)
{
this.allContent = content;
}
public IEnumerable<Record> Split()
{
records = new List<Record>();
foreach(string section in GetSections())
{
this.currentSection = section;
this.currentRecord = new Record();
ParseSection();
records.Add(currentRecord);
}
return records;
}
private IEnumerable<string> GetSections()
{
// Split allContent as needed and return the string sections
}
private void ParseSection()
{
ParseId();
ParseProcessName();
}
private void ParseId()
{
int id = // Get ID from 'currentRecord'
currentRecord.ID = id;
}
private void ParseProcessName()
{
string processName = // Get ProcessNamefrom 'currentRecord'
currentRecord.ProcessName = processName;
}
/** Add methods with no parameters and use local variables
}
这种方法可能需要一段时间才能习惯,因为您没有传入和传出变量,但它的流程非常好。