Regex从给定字符串中提取编码类型
本文关键字:提取 编码 类型 字符串 Regex | 更新日期: 2023-09-27 18:21:48
在stackoverflow中关注最近的一个线程后,我发布了一个新问题:我有几个字符串,我想从中提取编码类型。我愿意使用regex:
示例:
utf-8 quoted printable
string str = "=?utf-8?Q?=48=69=67=68=2d=45=6e=64=2d=44=65=73=69=67=6e=65=72=2d=57=61=74=63=68=2d=52=65=70=6c=69=63=61=73=2d=53=61=76=65=2d=54=48=4f=55=53=41=4e=44=53=2d=32=30=31=32=2d=4d=6f=64=65=6c=73?=";
utf-8 Base 64
string fld4 = "=?utf-8?B?VmFsw6lyaWUgTWVqc25lcm93c2tp?= <Valerie.renamed@company.com>";
Windows 1258 Base 64
string msg2= "=?windows-1258?B?UkU6IFRyIDogUGxhbiBkZSBjb250aW51aXTpIGQnYWN0aXZpdOkgZGVz?= =?windows-1258?B?IHNlcnZldXJzIFdlYiBHb1ZveWFnZXN=?=";
iso-8859-1 Quoted printable
string fld2 = "=?iso-8859-1?Q?Fr=E9d=E9ric_Germain?= <Frederic.Germain@company.com>";
等等。。。
为了编写一个通用的解码函数,我们需要提取:
字符集(utf-8、Windows1258等)
transfert编码类型(引用可打印或基本64)
编码字符串
知道如何提取两者之间的模式吗?xxx?Q或xxx?B
注意:这可以是大写或小写
谢谢。
这里有一个Rubular可以帮你做到这一点。简而言之,此Regex ='?(.*?)'?[QBqb]
将获取该编码。但需要注意的是,在获取结果时,你给出的第三个例子中有两场比赛,所以请确保你决定如何处理第二场比赛。
这是一个完整的工作解决方案
public class Encoded
{
public string Charset;
public string ContentTransfertEncoding;
public string Data;
}
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication2
{
public class Decoding
{
public Decoding()
{
}
public List<Encoded> Process(string data)
{
List<Encoded> list = new List<Encoded>();
var occurences = new Regex(@"='?[a-zA-Z0-9?=-]*'?[BbQq]'?[a-zA-Z0-9?=-]*'?=", RegexOptions.IgnoreCase);
var matches = occurences.Matches(data);
foreach (Match match in matches)
{
Encoded cls = new Encoded();
cls.Data = match.Groups[0].Value;
cls.Charset = GetCharset(cls.Data);
cls.ContentTransfertEncoding = GetContentTransfertEncoding(cls.Data);
// cleanup data
int pos = cls.Data.IndexOf("=?");
pos = cls.Data.IndexOf("?",pos+ 2);
cls.Data = cls.Data.Substring(pos + 3);
cls.Data = cls.Data.Replace("?=", "");
list.Add(cls);
}
return list;
}
private string GetContentTransfertEncoding(string data)
{
var occurences = new Regex(@"='?(.*?)'?[QBqb]", RegexOptions.IgnoreCase);
var matches = occurences.Matches(data);
foreach (Match match in matches)
{
int pos = match.Groups[0].Value.LastIndexOf('?');
return match.Groups[0].Value.Substring(pos+1);
}
return data;
}
public string GetCharset(string data)
{
var occurences = new Regex(@"='?(.*?)'?[QBqb]", RegexOptions.IgnoreCase);
var matches = occurences.Matches(data);
foreach (Match match in matches)
{
string str1 = match.Groups[0].Value.Replace("=?", "");
int pos = str1.IndexOf('?');
str1 = str1.Substring(0, pos);
return str1; // there should be only 1 match
}
return data;
}
public string Decodeetc...()
}