C# 分析字符串
本文关键字:字符串 | 更新日期: 2023-09-27 17:55:22
我有一堆字符串看起来像这样:
mc_gross=22.99发票=ff1ca57d9fa80cf93e6b300dd7f063e1protection_eligibility=Ineligibleaddress_status=confirmedpayer_id=SGA8X3TX9HCVYtax=0.00address_street=155 第五大道 sepayment_date=11月16:08:28 15, 2010 PSTpayment_status=已完成字符集=窗口-1252address_zip=98045first_name=jackobmc_fee=1.08address_country_code=USaddress_name=约翰 martinnotify_version=3.0自定义=ff1ca5asdf7d9fa80cf93e6b300dd7f063e1payer_status=未经验证的业务=gold-me@hotmail.comaddress_country=United Statesaddress_city=北 bendquantity=1verify_sign=AZussRXZRkuk7frhfirfxxTkj0BDJGA2dJF3eF263eEsjLixS.xRxCzfaYLpayer_email=me@gmail.comtxn_id=4DU53818WJ271531Mpayment_type=instantlast_name=Martinaddress_state=WAreceiver_email=cravbill@hotmail.compayment_fee=1.08receiver_id=QG8JPB4RZJGG4txn_type=web_acceptitem_name=Some 的项目 consequenceSpecifiemc_currency=USDitem_number=G10W151residence_country=UShandling_amount=0.00transaction_subject=ff1ca57d9fad80cf93e6b300dd7f063e1payment_gross=22.99运费=0.00
解析此内容的最佳方法是什么?你会认为创造它的人会给它带来某种突破......
无论如何,任何帮助将不胜感激。
编辑:
我感谢大家的帖子。我想知道我是否可以做这样的事情:
- 创建标签列表。
mc_gross=
,first_name=
, ... - 在字符串中执行替换:
thestring.replace("first_name","'r'nfirst_name")
我认为这会给我进一步解析它所需的休息时间。
你觉得怎么样?
除非这是固定宽度(高度怀疑),否则我会说您将需要获取指示字段的关键字列表。 将它们放在数据库中(SQL,XML,CSV等 - 在哪里并不重要),然后使用它们来解析文件。 希望这将以相同的顺序出现,并且不会遗漏任何标签。 如果是这样,请执行一个子字符串,该子字符串查找从标签后面的等号末尾到下一行标签开头的值。 这将为您提供与相应标签对应的值。
因此,例如,如果我们只取第一部分mc_gross=22.99invoice=ff1ca57d9fa80cf93e6b300dd7f063e1protection_eligibility=Ineligibleaddress_status=confirmed
,我们的标签将是mc_gross, invoice, protection_eligibility, and address_status
然后我们将从 mc_gross=
开始,使用 Substring 在字符串中找到它。 对于给出它的长度,我们会一直到找到下一个标签, invoice
.子字符串行会很复杂,但它应该可以完成这项工作。 遍历每个标记。 当您到达最后一个标签时,您需要找到字符串的末尾而不是另一个标签。
正如其他人所说,除非您可以获得原始数据以在适当的区域包含换行符,否则下一个最好的办法是获取键名称列表。
我假设其他 60K 行与您提供的一个示例行具有相同的键名? 如果是这样,如果有人无法为您提供列表,那么手动(而不是以编程方式)自己识别键名称似乎是唯一的方法。
我自己试过。这似乎还不错(最多几分钟),但可能仍然需要知识渊博的人来确认密钥列表是否正确。
获得列表后,您可以按键拆分,然后将它们重新组合成一个新列表:
string rawData =
"mc_gross=22.99invoice=ff1ca57d9fa80cf93e6b300dd7f063e1protection_eligibility=Ineligibleaddress_status=confirmedpayer_id=SGA8X3TX9HCVYtax=0.00address_street=155 5th ave sepayment_date=16:08:28 Nov 15, 2010 PSTpayment_status=Completedcharset=windows-1252address_zip=98045first_name=jackobmc_fee=1.08address_country_code=USaddress_name=john martinnotify_version=3.0custom=ff1ca5asdf7d9fa80cf93e6b300dd7f063e1payer_status=unverifiedbusiness=gold-me@hotmail.comaddress_country=United Statesaddress_city=north bendquantity=1verify_sign=AZussRXZRkuk7frhfirfxxTkj0BDJGA2dJF3eF263eEsjLixS.xRxCzfaYLpayer_email=me@gmail.comtxn_id=4DU53818WJ271531Mpayment_type=instantlast_name=Martinaddress_state=WAreceiver_email=cravbill@hotmail.compayment_fee=1.08receiver_id=QG8JPB4RZJGG4txn_type=web_acceptitem_name=Some item of consequenceSpecifiemc_currency=USDitem_number=G10W151residence_country=UShandling_amount=0.00transaction_subject=ff1ca57d9fad80cf93e6b300dd7f063e1payment_gross=22.99shipping=0.00";
string[] keys = {
"mc_gross", "invoice", "protection_eligibility", "address_status", "payer_id", "tax",
"address_street", "payment_date", "payment_status", "charset", "address_zip",
"first_name", "mc_fee", "address_country_code", "address_name", "notify_version",
"custom", "payer_status", "business", "address_country", "address_city", "quantity",
"verify_sign", "payer_email", "txn_id", "payment_type", "last_name", "address_state",
"receiver_email", "payment_fee", "receiver_id", "txn_type", "item_name",
"mc_currency", "item_number", "residence_country", "handling_amount",
"transaction_subject", "payment_gross", "shipping"
};
string[] values = rawData.Split(keys, StringSplitOptions.RemoveEmptyEntries);
IEnumerable<string> parsedList = keys.Zip(values, (key, value) => key + value);
foreach (string item in parsedList)
{
Console.WriteLine(item);
}
这将以以下格式输出数据:
mc_gross=22.99
invoice=ff1ca57d9fa80cf93e6b300dd7f063e1
protection_eligibility=Ineligible
address_status=confirmed
payer_id=SGA8X3TX9HCVY
tax=0.00
address_street=155 5th ave se
payment_date=16:08:28 Nov 15, 2010 PST
payment_status=Completed
charset=windows-1252
address_zip=98045
first_name=jackob
mc_fee=1.08
address_country_code=US
address_name=john martin
notify_version=3.0
custom=ff1ca5asdf7d9fa80cf93e6b300dd7f063e1
payer_status=unverified
business=gold-me@hotmail.com
address_country=United States
address_city=north bend
quantity=1
verify_sign=AZussRXZRkuk7frhfirfxxTkj0BDJGA2dJF3eF263eEsjLixS.xRxCzfaYL
payer_email=me@gmail.com
txn_id=4DU53818WJ271531M
payment_type=instant
last_name=Martin
address_state=WA
receiver_email=cravbill@hotmail.com
payment_fee=1.08
receiver_id=QG8JPB4RZJGG4
txn_type=web_accept
item_name=Some item of consequenceSpecifie
mc_currency=USD
item_number=G10W151
residence_country=US
handling_amount=0.00
transaction_subject=ff1ca57d9fad80cf93e6b300dd7f063e1
payment_gross=22.99
shipping=0.00
您可以通过用等号 ("=") 拆分每个项目来进一步解析列表,或者将原始数据字符串替换为现在包含缺失换行符的数据字符串:
string newData = parsedList.Aggregate((data, next) => data + Environment.NewLine + next);
研究使用 System.Text.RegularExpressions 它们会非常有帮助。
但是一个简单的方法是使用字符串类中的拆分函数。
string head = "mc_gross=22.99invoice=ff1ca57d9fa80cf93e6b300dd7f063e1protection_eligibility=Ineligibleaddress_status=confirmedpayer_id=SGA8X3TX9HCVYtax=0.00address_street=155 5th ave sepayment_date=16:08:28 Nov 15, 2010 PSTpayment_status=Completedcharset=windows-1252address_zip=98045first_name=jackobmc_fee=1.08address_country_code=USaddress_name=john martinnotify_version=3.0custom=ff1ca5asdf7d9fa80cf93e6b300dd7f063e1payer_status=unverifiedbusiness=gold-me@hotmail.comaddress_country=United Statesaddress_city=north bendquantity=1verify_sign=AZussRXZRkuk7frhfirfxxTkj0BDJGA2dJF3eF263eEsjLixS.xRxCzfaYLpayer_email=me@gmail.comtxn_id=4DU53818WJ271531Mpayment_type=instantlast_name=Martinaddress_state=WAreceiver_email=cravbill@hotmail.compayment_fee=1.08receiver_id=QG8JPB4RZJGG4txn_type=web_acceptitem_name=Some item of consequenceSpecifiemc_currency=USDitem_number=G10W151residence_country=UShandling_amount=0.00transaction_subject=ff1ca57d9fad80cf93e6b300dd7f063e1payment_gross=22.99shipping=0.00";
string splitStrings[] = new string[2];
splitString[0] = "mc_gross";
splitString[1] = "invoice";
string headArray[] = head.Split(splitStrings, StringSplitOptions.RemoveEmptyEntries);
你明白了,它把一切都分解成几部分。
等号是一个很好的指标。在等号之间,我建议使用一些带有某种类型推理引擎的词汇工具。