如何拆分列中可能包含逗号的csv

本文关键字:包含逗 csv 何拆分 拆分 | 更新日期: 2023-09-27 18:00:45

给定

2,1016,7/31/2008 14:22,Geoff Dalgas,6/5/2011 22:21,http://stackoverflow.com,"Corvallis, OR",7679,351,81,b437f461b3fd27387c5d8ab47a293d35,34

如何使用C#将上述信息拆分为字符串,如下所示:

2
1016
7/31/2008 14:22
Geoff Dalgas
6/5/2011 22:21
http://stackoverflow.com
Corvallis, OR
7679
351
81
b437f461b3fd27387c5d8ab47a293d35
34

正如您所看到的,其中一列包含<=(Corvallis,OR)

基于C#Regex拆分-引号外的逗号

string[] result = Regex.Split(samplestring, ",(?=(?:[^'"]*'"[^'"]*'")*[^'"]*$)");

如何拆分列中可能包含逗号的csv

使用Microsoft.VisualBasic.FileIO.TextFieldParser类。这将处理对分隔文件TextReaderStream的解析,其中有些字段用引号括起来,有些字段没有。

例如:

using Microsoft.VisualBasic.FileIO;
string csv = "2,1016,7/31/2008 14:22,Geoff Dalgas,6/5/2011 22:21,http://stackoverflow.com,'"Corvallis, OR'",7679,351,81,b437f461b3fd27387c5d8ab47a293d35,34";
TextFieldParser parser = new TextFieldParser(new StringReader(csv));
// You can also read from a file
// TextFieldParser parser = new TextFieldParser("mycsvfile.csv");
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
string[] fields;
while (!parser.EndOfData)
{
    fields = parser.ReadFields();
    foreach (string field in fields)
    {
        Console.WriteLine(field);
    }
} 
parser.Close();

这将产生以下输出:

2.10162008年7月31日14:22Geoff Dalgas2011年6月5日22:21http://stackoverflow.comCorvallis,OR767935181b437f461b3fd27387c5d8ab47a293d3534

有关详细信息,请参见Microsoft.VisualBasic.FileIO.TextFieldParser。

您需要在添加引用.NET选项卡中添加对Microsoft.VisualBasic的引用。

虽然已经晚了很多,但这对某些人来说会有所帮助。我们可以使用RegEx,如下所示。

Regex CSVParser = new Regex(",(?=(?:[^'"]*'"[^'"]*'")*(?![^'"]*'"))");
String[] Fields = CSVParser.Split(Test);

我看到,如果您在Excel中粘贴csv分隔的文本并执行"文本到列",它会要求您提供"文本限定符"。它默认为双引号,因此它将双引号内的文本视为文本。我想象Excel通过一次遍历一个字符来实现这一点,如果它遇到"文本限定符",它会继续遍历下一个"限定符"。您可能可以自己用for循环和布尔值来实现这一点,以表示您是否在文本中。

public string[] CsvParser(string csvText)
{
    List<string> tokens = new List<string>();
    int last = -1;
    int current = 0;
    bool inText = false;
    while(current < csvText.Length)
    {
        switch(csvText[current])
        {
            case '"':
                inText = !inText; break;
            case ',':
                if (!inText) 
                {
                    tokens.Add(csvText.Substring(last + 1, (current - last)).Trim(' ', ',')); 
                    last = current;
                }
                break;
            default:
                break;
        }
        current++;
    }
    if (last != csvText.Length - 1) 
    {
        tokens.Add(csvText.Substring(last+1).Trim());
    }
    return tokens.ToArray();
}

您可以拆分所有后面有偶数引号的逗号。

您还想在specf for CSV格式中查看有关处理逗号的信息。

有用链接:C# Regex Split - commas outside quotes

使用LumenWorks这样的库来进行CSV读取。它将处理带有引号的字段,并且由于已经存在很长时间,因此总体上可能比您的自定义解决方案更健壮。

这个问题及其重复问题有很多答案。我尝试了这个看起来很有希望的,但发现了一些错误。我对它进行了大量修改,使它能够通过我的所有测试。

    /// <summary>
    /// Returns a collection of strings that are derived by splitting the given source string at
    /// characters given by the 'delimiter' parameter.  However, a substring may be enclosed between
    /// pairs of the 'qualifier' character so that instances of the delimiter can be taken as literal
    /// parts of the substring.  The method was originally developed to split comma-separated text
    /// where quotes could be used to qualify text that contains commas that are to be taken as literal
    /// parts of the substring.  For example, the following source:
    ///     A, B, "C, D", E, "F, G"
    /// would be split into 5 substrings:
    ///     A
    ///     B
    ///     C, D
    ///     E
    ///     F, G
    /// When enclosed inside of qualifiers, the literal for the qualifier character may be represented
    /// by two consecutive qualifiers.  The two consecutive qualifiers are distinguished from a closing
    /// qualifier character.  For example, the following source:
    ///     A, "B, ""C"""
    /// would be split into 2 substrings:
    ///     A
    ///     B, "C"
    /// </summary>
    /// <remarks>Originally based on: https://stackoverflow.com/a/43284485/2998072</remarks>
    /// <param name="source">The string that is to be split</param>
    /// <param name="delimiter">The character that separates the substrings</param>
    /// <param name="qualifier">The character that is used (in pairs) to enclose a substring</param>
    /// <param name="toTrim">If true, then whitespace is removed from the beginning and end of each
    /// substring.  If false, then whitespace is preserved at the beginning and end of each substring.
    /// </param>
    public static List<String> SplitQualified(this String source, Char delimiter, Char qualifier,
                                Boolean toTrim)
    {
        // Avoid throwing exception if the source is null
        if (String.IsNullOrEmpty(source))
            return new List<String> { "" };
        var results = new List<String>();
        var result = new StringBuilder();
        Boolean inQualifier = false;
        // The algorithm is designed to expect a delimiter at the end of each substring, but the
        // expectation of the caller is that the final substring is not terminated by delimiter.
        // Therefore, we add an artificial delimiter at the end before looping through the source string.
        String sourceX = source + delimiter;
        // Loop through each character of the source
        for (var idx = 0; idx < sourceX.Length; idx++)
        {
            // If current character is a delimiter
            // (except if we're inside of qualifiers, we ignore the delimiter)
            if (sourceX[idx] == delimiter && inQualifier == false)
            {
                // Terminate the current substring by adding it to the collection
                // (trim if specified by the method parameter)
                results.Add(toTrim ? result.ToString().Trim() : result.ToString());
                result.Clear();
            }
            // If current character is a qualifier
            else if (sourceX[idx] == qualifier)
            {
                // ...and we're already inside of qualifier
                if (inQualifier)
                {
                    // check for double-qualifiers, which is escape code for a single
                    // literal qualifier character.
                    if (idx + 1 < sourceX.Length && sourceX[idx + 1] == qualifier)
                    {
                        idx++;
                        result.Append(sourceX[idx]);
                        continue;
                    }
                    // Since we found only a single qualifier, that means that we've
                    // found the end of the enclosing qualifiers.
                    inQualifier = false;
                    continue;
                }
                else
                    // ...we found an opening qualifier
                    inQualifier = true;
            }
            // If current character is neither qualifier nor delimiter
            else
                result.Append(sourceX[idx]);
        }
        return results;
    }

以下是证明其有效性的测试方法:

    [TestMethod()]
    public void SplitQualified_00()
    {
        // Example with no substrings
        String s = "";
        var substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "" }, substrings);
    }
    [TestMethod()]
    public void SplitQualified_00A()
    {
        // just a single delimiter
        String s = ",";
        var substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "", "" }, substrings);
    }
    [TestMethod()]
    public void SplitQualified_01()
    {
        // Example with no whitespace or qualifiers
        String s = "1,2,3,1,2,3";
        var substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" }, substrings);
    }
    [TestMethod()]
    public void SplitQualified_02()
    {
        // Example with whitespace and no qualifiers
        String s = " 1, 2 ,3,  1  ,2't,   3   ";
        // whitespace should be removed
        var substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" }, substrings);
    }
    [TestMethod()]
    public void SplitQualified_03()
    {
        // Example with whitespace and no qualifiers
        String s = " 1, 2 ,3,  1  ,2't,   3   ";
        // whitespace should be preserved
        var substrings = s.SplitQualified(',', '"', false);
        CollectionAssert.AreEquivalent(
            new List<String> { " 1", " 2 ", "3", "  1  ", "2't", "   3   " },
            substrings);
    }
    [TestMethod()]
    public void SplitQualified_04()
    {
        // Example with no whitespace and trivial qualifiers.
        String s = "1,'"2'",3,1,2,'"3'"";
        var substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" }, substrings);
        s = "'"1'",'"2'",3,1,'"2'",3";
        substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" }, substrings);
    }
    [TestMethod()]
    public void SplitQualified_05()
    {
        // Example with no whitespace and qualifiers that enclose delimiters
        String s = "1,'"2,2a'",3,1,2,'"3,3a'"";
        var substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "1", "2,2a", "3", "1", "2", "3,3a" },
                                substrings);
        s = "'"1,1a'",'"2,2b'",3,1,'"2,2c'",3";
        substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "1,1a", "2,2b", "3", "1", "2,2c", "3" },
                                substrings);
    }
    [TestMethod()]
    public void SplitQualified_06()
    {
        // Example with qualifiers enclosing whitespace but no delimiter
        String s = "'" 1 '",'"2 '",3,1,2,'"'t3't'"";
        // whitespace should be removed
        var substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" },
                                substrings);
    }
    [TestMethod()]
    public void SplitQualified_07()
    {
        // Example with qualifiers enclosing whitespace but no delimiter
        String s = "'" 1 '",'"2 '",3,1,2,'"'t3't'"";
        // whitespace should be preserved
        var substrings = s.SplitQualified(',', '"', false);
        CollectionAssert.AreEquivalent(new List<String> { " 1 ", "2 ", "3", "1", "2", "'t3't" },
                                substrings);
    }
    [TestMethod()]
    public void SplitQualified_08()
    {
        // Example with qualifiers enclosing whitespace but no delimiter; also whitespace btwn delimiters
        String s = "'" 1 '", '"2 '"  ,  3,1, 2 ,'"  3  '"";
        // whitespace should be removed
        var substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" },
                                substrings);
    }
    [TestMethod()]
    public void SplitQualified_09()
    {
        // Example with qualifiers enclosing whitespace but no delimiter; also whitespace btwn delimiters
        String s = "'" 1 '", '"2 '"  ,  3,1, 2 ,'"  3  '"";
        // whitespace should be preserved
        var substrings = s.SplitQualified(',', '"', false);
        CollectionAssert.AreEquivalent(new List<String> { " 1 ", " 2   ", "  3", "1", " 2 ", "  3  " },
                                substrings);
    }
    [TestMethod()]
    public void SplitQualified_10()
    {
        // Example with qualifiers enclosing whitespace and delimiter
        String s = "'" 1 '",'"2 , 2b '",3,1,2,'"  3,3c  '"";
        // whitespace should be removed
        var substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "1", "2 , 2b", "3", "1", "2", "3,3c" },
                                substrings);
    }
    [TestMethod()]
    public void SplitQualified_11()
    {
        // Example with qualifiers enclosing whitespace and delimiter; also whitespace btwn delimiters
        String s = "'" 1 '", '"2 , 2b '"  ,  3,1, 2 ,'"  3,3c  '"";
        // whitespace should be preserved
        var substrings = s.SplitQualified(',', '"', false);
        CollectionAssert.AreEquivalent(new List<String> { " 1 ", " 2 , 2b   ", "  3", "1", " 2 ", "  3,3c  " },
                                substrings);
    }
    [TestMethod()]
    public void SplitQualified_12()
    {
        // Example with tab characters between delimiters
        String s = "'t1,'t2't,3,1,'t2't,'t3't";
        // whitespace should be removed
        var substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" }, substrings);
    }
    [TestMethod()]
    public void SplitQualified_13()
    {
        // Example with newline characters between delimiters
        String s = "'n1,'n2'n,3,1,'n2'n,'n3'n";
        // whitespace should be removed
        var substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2", "3" }, substrings);
    }
    [TestMethod()]
    public void SplitQualified_14()
    {
        // Example with qualifiers enclosing whitespace and delimiter, plus escaped qualifier
        String s = "'" 1 '",'"'"'"2 , 2b '"'"'",3,1,2,'"  '"'"3,3c  '"";
        // whitespace should be removed
        var substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "1", "'"2 , 2b '"", "3", "1", "2", "'"3,3c" },
                                substrings);
    }
    [TestMethod()]
    public void SplitQualified_14A()
    {
        // Example with qualifiers enclosing whitespace and delimiter, plus escaped qualifier
        String s = "'"'"'"1'"'"'"";
        // whitespace should be removed
        var substrings = s.SplitQualified(',', '"', true);
        CollectionAssert.AreEquivalent(new List<String> { "'"1'"" },
                                substrings);
    }

    [TestMethod()]
    public void SplitQualified_15()
    {
        // Instead of comma-delimited and quote-qualified, use pipe and hash
        // Example with no whitespace or qualifiers
        String s = "1|2|3|1|2,2f|3";
        var substrings = s.SplitQualified('|', '#', true);
        CollectionAssert.AreEquivalent(new List<String> { "1", "2", "3", "1", "2,2f", "3" }, substrings);
    }
    [TestMethod()]
    public void SplitQualified_16()
    {
        // Instead of comma-delimited and quote-qualified, use pipe and hash
        // Example with qualifiers enclosing whitespace and delimiter
        String s = "# 1 #|#2 | 2b #|3|1|2|#  3|3c  #";
        // whitespace should be removed
        var substrings = s.SplitQualified('|', '#', true);
        CollectionAssert.AreEquivalent(new List<String> { "1", "2 | 2b", "3", "1", "2", "3|3c" },
                                substrings);
    }
    [TestMethod()]
    public void SplitQualified_17()
    {
        // Instead of comma-delimited and quote-qualified, use pipe and hash
        // Example with qualifiers enclosing whitespace and delimiter; also whitespace btwn delimiters
        String s = "# 1 #| #2 | 2b #  |  3|1| 2 |#  3|3c  #";
        // whitespace should be preserved
        var substrings = s.SplitQualified('|', '#', false);
        CollectionAssert.AreEquivalent(new List<String> { " 1 ", " 2 | 2b   ", "  3", "1", " 2 ", "  3|3c  " },
                                substrings);
    }

当.csv文件可能是逗号分隔的字符串、逗号分隔的引号字符串或两者的混乱组合时,解析.csv文件是一件棘手的事情。我提出的解决方案允许三种可能性中的任何一种。

我创建了一个方法ParseCsvRow(),它从csv字符串返回一个数组。我首先处理字符串中的双引号,方法是将双引号上的字符串拆分为一个名为quotesArray的数组。带引号的字符串.csv文件只有在双引号为偶数的情况下才有效。列值中的双引号应替换为一对双引号(这是Excel的方法)。只要.csv文件满足这些要求,就可以期望分隔符逗号只出现在双引号对之外。双引号对内的逗号是列值的一部分,在将.csv拆分为数组时应忽略。

我的方法将通过只查看quotesArray的偶数索引来测试双引号对之外的逗号。它还从列值的开始和结束处删除双引号。

    public static string[] ParseCsvRow(string csvrow)
    {
        const string obscureCharacter = "ᖳ";
        if (csvrow.Contains(obscureCharacter)) throw new Exception("Error: csv row may not contain the " + obscureCharacter + " character");
        var unicodeSeparatedString = "";
        var quotesArray = csvrow.Split('"');  // Split string on double quote character
        if (quotesArray.Length > 1)
        {
            for (var i = 0; i < quotesArray.Length; i++)
            {
                // CSV must use double quotes to represent a quote inside a quoted cell
                // Quotes must be paired up
                // Test if a comma lays outside a pair of quotes.  If so, replace the comma with an obscure unicode character
                if (Math.Round(Math.Round((decimal) i/2)*2) == i)
                {
                    var s = quotesArray[i].Trim();
                    switch (s)
                    {
                        case ",":
                            quotesArray[i] = obscureCharacter;  // Change quoted comma seperated string to quoted "obscure character" seperated string
                            break;
                        default:
                            if (s.All(chars => chars == ','))
                            {
                                quotesArray[i] = "";
                                for (int j = 0; j < s.Count(); j++)
                                {
                                    quotesArray[i] += obscureCharacter;
                                } 
                            }
                            break;
                    }
                }
                // Build string and Replace quotes where quotes were expected.
                unicodeSeparatedString += (i > 0 ? "'"" : "") + quotesArray[i].Trim();
            }
        }
        else
        {
            // String does not have any pairs of double quotes.  It should be safe to just replace the commas with the obscure character
            unicodeSeparatedString = csvrow.Replace(",", obscureCharacter);
        }
        var csvRowArray = unicodeSeparatedString.Split(obscureCharacter[0]); 
        for (var i = 0; i < csvRowArray.Length; i++)
        {
            var s = csvRowArray[i].Trim();
            if (s.StartsWith("'"") && s.EndsWith("'""))
            {
                csvRowArray[i] = s.Length > 2 ? s.Substring(1, s.Length - 2) : "";  // Remove start and end quotes.
            }
        }
        
        return csvRowArray;
    }

我的方法的一个缺点是用模糊的unicode字符临时替换分隔符逗号。这个字符需要非常模糊,它永远不会出现在.csv文件中。您可能需要对此进行更多处理。

我遇到了一个CSV问题,它包含带有引号的字段,所以使用TextFieldParser,我得出了以下结果:

private static string[] parseCSVLine(string csvLine)
{
  using (TextFieldParser TFP = new TextFieldParser(new MemoryStream(Encoding.UTF8.GetBytes(csvLine))))
  {
    TFP.HasFieldsEnclosedInQuotes = true;
    TFP.SetDelimiters(",");
    try 
    {           
      return TFP.ReadFields();
    }
    catch (MalformedLineException)
    {
      StringBuilder m_sbLine = new StringBuilder();
      for (int i = 0; i < TFP.ErrorLine.Length; i++)
      {
        if (i > 0 && TFP.ErrorLine[i]== '"' &&(TFP.ErrorLine[i + 1] != ',' && TFP.ErrorLine[i - 1] != ','))
          m_sbLine.Append("'"'"");
        else
          m_sbLine.Append(TFP.ErrorLine[i]);
      }
      return parseCSVLine(m_sbLine.ToString());
    }
  }
}

StreamReader仍然用于逐行读取CSV,如下所示:

using(StreamReader SR = new StreamReader(FileName))
{
  while (SR.Peek() >-1)
    myStringArray = parseCSVLine(SR.ReadLine());
}

使用Cinchoo ETL-一个开源库,它可以自动处理包含分隔符的列值。

string csv = @"2,1016,7/31/2008 14:22,Geoff Dalgas,6/5/2011 22:21,http://stackoverflow.com,""Corvallis, OR"",7679,351,81,b437f461b3fd27387c5d8ab47a293d35,34";
using (var p = ChoCSVReader.LoadText(csv)
    )
{
    Console.WriteLine(p.Dump());
}

输出:

Key: Column1 [Type: String]
Value: 2
Key: Column2 [Type: String]
Value: 1016
Key: Column3 [Type: String]
Value: 7/31/2008 14:22
Key: Column4 [Type: String]
Value: Geoff Dalgas
Key: Column5 [Type: String]
Value: 6/5/2011 22:21
Key: Column6 [Type: String]
Value: http://stackoverflow.com
Key: Column7 [Type: String]
Value: Corvallis, OR
Key: Column8 [Type: String]
Value: 7679
Key: Column9 [Type: String]
Value: 351
Key: Column10 [Type: String]
Value: 81
Key: Column11 [Type: String]
Value: b437f461b3fd27387c5d8ab47a293d35
Key: Column12 [Type: String]
Value: 34

有关更多信息,请访问代码项目文章。

希望能有所帮助。