以不规则模式拆分数据

本文关键字:数据 拆分 模式 不规则 | 更新日期: 2023-09-27 17:57:33

以下是一些真实的样本数据:

string s1 = "CLR DRBR|r 0004  BLCK|r 0006  WHIT|r 0006"
string s2 = "WGT WHGN|c 0004 YLGN|c 0006"
string s3 = "296  312|d 0004  137.2|n 0006"
string s4 = "HGT SH|r 0004"
string s5 = "ANLP  ANLP1 PNPL|r 0004"

数据将始终以模式出现:[Group] [Value][Pipe + letter][Key],并且[Value][Pipe + letter][Key]部分可能重复多次。

有没有什么方法可以把这种数据分解成这样的东西:

string out1[] = { "CLR", "DRBR", "|r 0004", "BLCK", "|r 0006", "WHIT", "|r 0006" }
string out2[] = { "WGT", "WHGN", "|c 0004", "YLGN", "|c 0006" }
string out3[] = { "296", "312", "|m 0004", "137.2", "|n 0006" }
string out4[] = { "HGT", "SH", "|r 0004" }
string out5[] = { "ANLP", "ANLP1 PNPL", "|r 0004" }

请注意,s5的数据模式与其他数据模式略有不同。

这些都是20世纪60年代的遗留数据,所以请不要问我他们是如何/为什么以这种方式存储数据的。非常感谢。

以不规则模式拆分数据

查看数据,您似乎有以下规则:

Phase 1 : Read to first space and split and remove space.
Phase 2 : Read to `|` and split prior to `|`.
Phase 3 : Include `|` and next 3 characters (space) and read to next space or EOT split and remove space if exists.
Goto Phase 2 if more data.

像这样(你可能想要比我更多的错误检查):

void Main()
{
  string s1 = "CLR DRBR|r 0004  BLCK|r 0006  WHIT|r 0006";
  string s2 = "WGT WHGN|c 0004 YLGN|c 0006";
  string s3 = "296  312|d 0004  137.2|n 0006";
  string s4 = "HGT SH|r 0004";
  string s5 = "ANLP  ANLP1 PNPL|r 0004"  ;
   splitit(s1).Dump();
}
string [] splitit(string input)
{
    List<string> output = new List<string>();
    int index = 0;
    // phase one
    while (input[index] != ' ') index++;
    output.Add(input.Substring(0,index));
    // skip space
    while (input[index] == ' ') index++;
    int indexTmp = index;
    do
    {
      // phase two
      while (input[index] != '|') index++;
      output.Add(input.Substring(indexTmp,(index)-indexTmp));
      // phase three
      indexTmp = index;
      index = index + 3; // save | code and space
      while ((input[index] != ' ') && index < (input.Length-1)) index++;
      output.Add(input.Substring(indexTmp,(index)-indexTmp));
      // skip spaces
      while (input[index] == ' ') index++;
      indexTmp = index;
    } while(index < input.Length-1);  
    return output.ToArray();
}

你有一个公认的答案,但只要你说我的方法行不通,我就会这样做:

int index;
List<string[]> output = new List<string[]>();
List<string> current = null;
string[] fields;
//i imagine this will be in an array when you read it in from a file
string[] input = new string[5];
input[0] = "CLR DRBR|r 0004  BLCK|r 0006  WHIT|r 0006";
input[1] = "WGT WHGN|c 0004 YLGN|c 0006";
input[2] = "296  312|d 0004  137.2|n 0006";
input[3] = "HGT SH|r 0004";
input[4] = "ANLP  ANLP1 PNPL|r 0004";

现在,您只需循环处理第一条记录,对于后续记录,请检查是否出现第二个空间并正确处理它。

bool first = true;
//loop through each of the input records
foreach (string record in input)
{
    //split the input records based on the pipe character
    fields = record.Split("|".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
    //loop through each of the fields
    foreach (string field in fields)
    {
        if (first) //split the first field based on the first space in field
        {
            current = new List<string>();
            index = field.IndexOf(" ");
            current.Add(field.Substring(0, index).Trim());
            current.Add(field.Substring(index + 1).Trim());
            first = false;
        }
        else  //split subsequent records based on second space if it exists
        {
             index = field.IndexOf(" ", 3);
             if (index == -1)
             {
                 current.Add("|" + field);
             }
             else
             {
                 current.Add("|" + field.Substring(0, index).Trim());
                 current.Add(field.Substring(index + 1).Trim());
             }
        }
    }
    //control break processing
    first = true;
    output.Add(current.ToArray());
}

您可以很容易地将内部循环修改为另一个函数。如果你测试一下,我想这会快得多。