以不规则模式拆分数据
本文关键字:数据 拆分 模式 不规则 | 更新日期: 2023-09-27 17:57:33
以下是一些真实的样本数据:
string s1 = "CLR DRBR|r 0004 BLCK|r 0006 WHIT|r 0006"
string s2 = "WGT WHGN|c 0004 YLGN|c 0006"
string s3 = "296 312|d 0004 137.2|n 0006"
string s4 = "HGT SH|r 0004"
string s5 = "ANLP ANLP1 PNPL|r 0004"
数据将始终以模式出现:[Group] [Value][Pipe + letter][Key]
,并且[Value][Pipe + letter][Key]
部分可能重复多次。
有没有什么方法可以把这种数据分解成这样的东西:
string out1[] = { "CLR", "DRBR", "|r 0004", "BLCK", "|r 0006", "WHIT", "|r 0006" }
string out2[] = { "WGT", "WHGN", "|c 0004", "YLGN", "|c 0006" }
string out3[] = { "296", "312", "|m 0004", "137.2", "|n 0006" }
string out4[] = { "HGT", "SH", "|r 0004" }
string out5[] = { "ANLP", "ANLP1 PNPL", "|r 0004" }
请注意,s5的数据模式与其他数据模式略有不同。
这些都是20世纪60年代的遗留数据,所以请不要问我他们是如何/为什么以这种方式存储数据的。非常感谢。
查看数据,您似乎有以下规则:
Phase 1 : Read to first space and split and remove space.
Phase 2 : Read to `|` and split prior to `|`.
Phase 3 : Include `|` and next 3 characters (space) and read to next space or EOT split and remove space if exists.
Goto Phase 2 if more data.
像这样(你可能想要比我更多的错误检查):
void Main()
{
string s1 = "CLR DRBR|r 0004 BLCK|r 0006 WHIT|r 0006";
string s2 = "WGT WHGN|c 0004 YLGN|c 0006";
string s3 = "296 312|d 0004 137.2|n 0006";
string s4 = "HGT SH|r 0004";
string s5 = "ANLP ANLP1 PNPL|r 0004" ;
splitit(s1).Dump();
}
string [] splitit(string input)
{
List<string> output = new List<string>();
int index = 0;
// phase one
while (input[index] != ' ') index++;
output.Add(input.Substring(0,index));
// skip space
while (input[index] == ' ') index++;
int indexTmp = index;
do
{
// phase two
while (input[index] != '|') index++;
output.Add(input.Substring(indexTmp,(index)-indexTmp));
// phase three
indexTmp = index;
index = index + 3; // save | code and space
while ((input[index] != ' ') && index < (input.Length-1)) index++;
output.Add(input.Substring(indexTmp,(index)-indexTmp));
// skip spaces
while (input[index] == ' ') index++;
indexTmp = index;
} while(index < input.Length-1);
return output.ToArray();
}
你有一个公认的答案,但只要你说我的方法行不通,我就会这样做:
int index;
List<string[]> output = new List<string[]>();
List<string> current = null;
string[] fields;
//i imagine this will be in an array when you read it in from a file
string[] input = new string[5];
input[0] = "CLR DRBR|r 0004 BLCK|r 0006 WHIT|r 0006";
input[1] = "WGT WHGN|c 0004 YLGN|c 0006";
input[2] = "296 312|d 0004 137.2|n 0006";
input[3] = "HGT SH|r 0004";
input[4] = "ANLP ANLP1 PNPL|r 0004";
现在,您只需循环处理第一条记录,对于后续记录,请检查是否出现第二个空间并正确处理它。
bool first = true;
//loop through each of the input records
foreach (string record in input)
{
//split the input records based on the pipe character
fields = record.Split("|".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
//loop through each of the fields
foreach (string field in fields)
{
if (first) //split the first field based on the first space in field
{
current = new List<string>();
index = field.IndexOf(" ");
current.Add(field.Substring(0, index).Trim());
current.Add(field.Substring(index + 1).Trim());
first = false;
}
else //split subsequent records based on second space if it exists
{
index = field.IndexOf(" ", 3);
if (index == -1)
{
current.Add("|" + field);
}
else
{
current.Add("|" + field.Substring(0, index).Trim());
current.Add(field.Substring(index + 1).Trim());
}
}
}
//control break processing
first = true;
output.Add(current.ToArray());
}
您可以很容易地将内部循环修改为另一个函数。如果你测试一下,我想这会快得多。