大型正则表达式匹配导致程序挂起
本文关键字:程序 挂起 正则表达式 大型 | 更新日期: 2023-09-27 18:34:55
前几天我试着问这个问题,诚然,一开始没有很好地表达问题或邮政编码,答案已经关闭。 所以在这里我再次尝试,因为老实说,这很快就会让我发疯。 :)
我正在尝试实现这个地址解析器,它最初是一个基于控制台的 c# 程序。 我已经成功地将其转换为一个独立的 WPF 程序,该程序仅包含一个用于输入的TextBox
、一个用于激活解析的Button
和一个用于显示结果的TextBlock
。 在编写本文时,我确实将输出截断为主程序中所需的输出,并且仍然可以正常工作。 我在下面包含了整个代码。
下一步是将其嫁接到我的主程序中,我实际上是通过使用复制/粘贴来完成的。 但是,运行此命令后,程序在按下按钮后挂起。 最终 VS 给出一个错误,指出进程太长时间没有抽出消息,并且任务管理器中的内存使用量从 ~70k 逐渐增加到 3,000,000。 针对此,我将Parsing
方法分配给后台工作线程,希望减轻主进程的工作量。 这确实解决了程序冻结的问题,但后台线程只是做了同样的事情,提高了 RAM 使用率并且没有返回任何内容。
所以现在我有点陷入僵局。 我知道问题出在 var result = parser.ParseAddress(input);
语句中的某个地方,因为当对每行代码使用断点时,这是最后一个触发的断点。 但基本上我不明白为什么这会导致一个 WPF 程序而不是另一个程序出现问题。
如果有必要,我很乐意在某个地方发布主程序的完整源代码,但我无法想象在这里发布~20个不同的类文件和项目代码是个好主意。 :)
独立 WPF 应用
namespace AddressParseWPF
{
/// <summary>
/// Interaction logic for MainWindow.xaml
/// </summary>
public partial class MainWindow : Window
{
public MainWindow()
{
InitializeComponent();
}
public void Execute()
{
AddressParser.AddressParser parser = new AddressParser.AddressParser();
var input = inputTextBox.Text;
var result = parser.ParseAddress(input);
if (result == null)
{
outputTextBlock.Text = "ERROR. Input could not be parsed.";
}
else
{
outputTextBlock.Text = (result.StreetLine + ", " + result.City + ", " + result.State + " " + result.Zip);
}
}
private void actionButton_Click(object sender, RoutedEventArgs e)
{
Execute();
}
}
}
将解析器嫁接到的主程序
public void ExecuteAddressParse()
{
AddressParser.AddressParser parser = new AddressParser.AddressParser();
var input = inputTextBox.Text;
var result = parser.ParseAddress(input);
if (result == null)
{
outputTextBlock.Text = "ERROR. Input could not be parsed.";
}
else
{
outputTextBlock.Text = (result.StreetLine + ", " + result.City + ", " + result.State + " " + result.Zip);
}
}
private void actionButton_Click(object sender, RoutedEventArgs e)
{
ExecuteAddressParse();
}
解析地址方法
public AddressParseResult ParseAddress(string input)
{
if (!string.IsNullOrWhiteSpace(input))
{
var match = addressRegex.Match(input.ToUpperInvariant());
if (match.Success)
{
var extracted = GetApplicableFields(match);
return new AddressParseResult(Normalize(extracted));
}
}
return null;
}
正则表达式匹配方法
private static void InitializeRegex()
{
var suffixPattern = new Regex(
string.Join(
"|",
new [] {
string.Join("|", suffixes.Keys),
string.Join("|", suffixes.Values.Distinct())
}),
RegexOptions.Compiled);
var statePattern =
@"'b(?:" +
string.Join(
"|",
new [] {
string.Join("|", states.Keys.Select(x => Regex.Escape(x))),
string.Join("|", states.Values)
}) +
@")'b";
var directionalPattern =
string.Join(
"|",
new [] {
string.Join("|", directionals.Keys),
string.Join("|", directionals.Values),
string.Join("|", directionals.Values.Select(x => Regex.Replace(x, @"('w)", @"$1'.")))
});
var zipPattern = @"'d{5}(?:-?'d{4})?";
var numberPattern =
@"(
((?<NUMBER>'d+)(?<SECONDARYNUMBER>(-[0-9])|('-?[A-Z]))(?='b)) # Unit-attached
|(?<NUMBER>'d+['-' ]?'d+'/'d+) # Fractional
|(?<NUMBER>'d+-?'d*) # Normal Number
|(?<NUMBER>[NSWE]' ?'d+' ?[NSWE]' ?'d+) # Wisconsin/Illinois
)";
var streetPattern =
string.Format(
CultureInfo.InvariantCulture,
@"
(?:
# special case for addresses like 100 South Street
(?:(?<STREET>{0})'W+
(?<SUFFIX>{1})'b)
|
(?:(?<PREDIRECTIONAL>{0})'W+)?
(?:
(?<STREET>[^,]*'d)
(?:[^'w,]*(?<POSTDIRECTIONAL>{0})'b)
|
(?<STREET>[^,]+)
(?:[^'w,]+(?<SUFFIX>{1})'b)
(?:[^'w,]+(?<POSTDIRECTIONAL>{0})'b)?
|
(?<STREET>[^,]+?)
(?:[^'w,]+(?<SUFFIX>{1})'b)?
(?:[^'w,]+(?<POSTDIRECTIONAL>{0})'b)?
)
)
",
directionalPattern,
suffixPattern);
var rangedSecondaryUnitPattern =
@"(?<SECONDARYUNIT>" +
string.Join("|", rangedSecondaryUnits.Keys) +
@")(?![a-z])";
var rangelessSecondaryUnitPattern =
@"(?<SECONDARYUNIT>" +
string.Join(
"|",
string.Join("|", rangelessSecondaryUnits.Keys)) +
@")'b";
var allSecondaryUnitPattern = string.Format(
CultureInfo.InvariantCulture,
@"
(
(:?
(?: (?:{0} 'W*)
| (?<SECONDARYUNIT>'#)'W*
)
(?<SECONDARYNUMBER>['w-]+)
)
|{1}
),?
",
rangedSecondaryUnitPattern,
rangelessSecondaryUnitPattern);
var cityAndStatePattern = string.Format(
CultureInfo.InvariantCulture,
@"
(?:
(?<CITY>[^'d,]+?)'W+
(?<STATE>{0})
)
",
statePattern);
var placePattern = string.Format(
CultureInfo.InvariantCulture,
@"
(?:{0}'W*)?
(?:(?<ZIP>{1}))?
",
cityAndStatePattern,
zipPattern);
var addressPattern = string.Format(
CultureInfo.InvariantCulture,
@"
^
# Special case for APO/FPO/DPO addresses
(
[^'w'#]*
(?<STREETLINE>.+?)
(?<CITY>[AFD]PO)'W+
(?<STATE>A[AEP])'W+
(?<ZIP>{4})
'W*
)
|
# Special case for PO boxes
(
'W*
(?<STREETLINE>(P['.' ]?O['.' ]?' )?BOX' [0-9]+)'W+
{3}
'W*
)
|
(
[^'w'#]* # skip non-word chars except # (eg unit)
( {0} )'W*
{1}'W+
(?:{2}'W+)?
{3}
'W* # require on non-word chars at end
)
$ # right up to end of string
",
numberPattern,
streetPattern,
allSecondaryUnitPattern,
placePattern,
zipPattern);
addressRegex = new Regex(
addressPattern,
RegexOptions.Compiled |
RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);
}
省略RegexOptions.Compiled
标志时正则表达式是否有效?
回答是肯定的。
为什么呢?
似乎正则表达式编译器在(一些?(大模式下很慢。
这是你必须做出的权衡。
一些正则表达式是固有的(如 @Justin Morgan 提到的(。
这通常是加入可重用的碎片正则表达式的结果,它使
我畏缩。
但是,如果您要使用/执行此方法,最好打印出
构造后的实际正则表达式。并且,格式化后,针对它进行测试
示例并独立于主程序进行操作。这样更容易修复。
如果您看到可疑的子表达式,请尝试使其在此时失败,或者
通常,尝试在示例末尾附近插入失败。如果需要超过
眨眼间失败,然后严重倒退。
不过回溯还不错。它有一个巨大的优势。没有它,有些东西
只是不匹配。诀窍是隔离不影响的
子表达式相对于周围事物的结果,然后限制它与BACTRACKING。
我去了USPS网站并获取了一些示例状态/后缀/方向/次要
状态样本,足以生成地址正则表达式。下面是一个
清理版本从代码生成的正则表达式。
祝你好运!
^
# Special case for APO/FPO/DPO addresses
(
[^'w'#]*
(?<STREETLINE> .+? )
(?<CITY> [AFD] PO )
'W+
(?<STATE> A [AEP] )
'W+
(?<ZIP> 'd{5} (?: -? 'd{4} )? )
'W*
)
|
# Special case for PO boxes
(
'W*
(?<STREETLINE> ( P ['.' ]? O ['.' ]? ' )? BOX ' [0-9]+ )
'W+
(?:
(?:
(?<CITY> [^'d,]+? )
'W+
(?<STATE>
'b
(?:AL|AK|AS|AZ|AR|Alabama|Alaska|American Samoa|Arizona|Arkansas)
'b
)
)
'W*
)?
(?:
(?<ZIP> 'd{5} (?: -? 'd{4} )? )
)?
'W*
)
|
(
[^'w'#]* # skip non-word chars except # (eg unit)
(
(
(
(?<NUMBER> 'd+ )
(?<SECONDARYNUMBER> (-[0-9]) | ('-?[A-Z]) )
(?='b)
) # Unit-attached
|
(?<NUMBER> 'd+ ['-' ]? 'd+ '/ 'd+ ) # Fractional
|
(?<NUMBER> 'd+ -? 'd* ) # Normal Number
|
(?<NUMBER>[NSWE]' ?'d+' ?[NSWE]' ?'d+) # Wisconsin/Illinois
)
)
'W*
(?:
# special case for addresses like 100 South Street
(?:
(?<STREET>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N'.|E'.|S'.|W'.|N'.E'.|S'.E'.|N'.W'.|S'.W'.)
'W+
(?<SUFFIX>ALLEY|ALY|ALLY|ALLEE|ALLEY|ALY)
'b
)
|
(?:
(?<PREDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N'.|E'.|S'.|W'.|N'.E'.|S'.E'.|N'.W'.|S'.W'.)
'W+
)?
(?:
(?<STREET> [^,]* 'd )
(?:
[^'w,]*
(?<POSTDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N'.|E'.|S'.|W'.|N'.E'.|S'.E'.|N'.W'.|S'.W'.)
'b
)
|
(?<STREET> [^,]+ )
(?:
[^'w,]+
(?<SUFFIX>ALLEY|ALY|ALLY|ALLEE|ALLEY|ALY)
'b
)
(?:
[^'w,]+
(?<POSTDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N'.|E'.|S'.|W'.|N'.E'.|S'.E'.|N'.W'.|S'.W'.)
'b
)?
|
(?<STREET> [^,]+? )
(?:
[^'w,]+
(?<SUFFIX>ALLEY|ALY|ALLY|ALLEE|ALLEY|ALY)
'b
)?
(?:
[^'w,]+
(?<POSTDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N'.|E'.|S'.|W'.|N'.E'.|S'.E'.|N'.W'.|S'.W'.)
'b
)?
)
)
'W+
(?:
(
(
:?
(?:
(?:
(?<SECONDARYUNIT>APT|BLDG|DEPT|FL|HNGR|LOT|PIER|RM|SLIP|SPC|STOP|STE|TRLR|UNIT)
(?! [a-z] )
'W*
)
|
(?<SECONDARYUNIT> '# )
'W*
)
(?<SECONDARYNUMBER> ['w-]+ )
)
|
(?<SECONDARYUNIT>BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR)
'b
)
,?
'W+
)?
(?:
(?:
(?<CITY> [^'d,]+? )
'W+
(?<STATE>
'b
(?:AL|AK|AS|AZ|AR|Alabama|Alaska|American Samoa|Arizona|Arkansas)
'b
)
)
'W*
)?
(?:
(?<ZIP> 'd{5} (?: -? 'd{4} )? )
)?
'W* # require on non-word chars at end
)
$ # right up to end of string
C# 代码
public static void InitializeRegex()
{
Dictionary<string, string> suffixes = new Dictionary<string, string>()
{
{"ALLEY", "ALLEE"},
{"ALY", "ALLEY"},
{"ALLY", "ALY"},
};
var suffixPattern = new Regex(
string.Join(
"|",
new[] {
string.Join("|", suffixes.Keys.ToArray()),
string.Join("|", suffixes.Values.Distinct().ToArray())
}),
RegexOptions.Compiled);
//Console.WriteLine("'n"+suffixPattern);
Dictionary<string, string> states = new Dictionary<string, string>()
{
{"AL", "Alabama"},
{"AK", "Alaska"},
{"AS", "American Samoa"},
{"AZ", "Arizona"},
{"AR", "Arkansas"}
};
var statePattern =
@"'b(?:" +
string.Join(
"|",
new[] {
string.Join("|", states.Keys.Select(x => Regex.Escape(x)).ToArray()),
string.Join("|", states.Values.ToArray())
}) +
@")'b";
//Console.WriteLine("'n" + statePattern);
Dictionary<string, string> directionals = new Dictionary<string, string>()
{
{"North", "N" },
{"East", "E" },
{"South", "S" },
{"West", "W" },
{"Northeast", "NE" },
{"Southeast", "SE" },
{"Northwest", "NW" },
{"Southwest", "SW" }
};
var directionalPattern =
string.Join(
"|",
new[] {
string.Join("|", directionals.Keys.ToArray()),
string.Join("|", directionals.Values.ToArray()),
string.Join("|", directionals.Values.Select(x => Regex.Replace(x, @"('w)", @"$1'.")).ToArray())
});
//Console.WriteLine("'n" + directionalPattern);
var zipPattern = @"'d{5}(?:-?'d{4})?";
//Console.WriteLine("'n" + zipPattern);
var numberPattern =
@"(
((?<NUMBER>'d+)(?<SECONDARYNUMBER>(-[0-9])|('-?[A-Z]))(?='b)) # Unit-attached
|(?<NUMBER>'d+['-' ]?'d+'/'d+) # Fractional
|(?<NUMBER>'d+-?'d*) # Normal Number
|(?<NUMBER>[NSWE]' ?'d+' ?[NSWE]' ?'d+) # Wisconsin/Illinois
)";
//Console.WriteLine("'n" + numberPattern);
var streetPattern =
string.Format(
CultureInfo.InvariantCulture,
@"
(?:
# special case for addresses like 100 South Street
(?:(?<STREET>{0})'W+
(?<SUFFIX>{1})'b)
|
(?:(?<PREDIRECTIONAL>{0})'W+)?
(?:
(?<STREET>[^,]*'d)
(?:[^'w,]*(?<POSTDIRECTIONAL>{0})'b)
|
(?<STREET>[^,]+)
(?:[^'w,]+(?<SUFFIX>{1})'b)
(?:[^'w,]+(?<POSTDIRECTIONAL>{0})'b)?
|
(?<STREET>[^,]+?)
(?:[^'w,]+(?<SUFFIX>{1})'b)?
(?:[^'w,]+(?<POSTDIRECTIONAL>{0})'b)?
)
)
",
directionalPattern,
suffixPattern);
//Console.WriteLine("'n" + streetPattern);
Dictionary<string, string> rangedSecondaryUnits = new Dictionary<string, string>()
{
{"APT", "APARTMENT"},
{"BLDG", "BUILDING"},
{"DEPT", "DEPARTMENT"},
{"FL", "FLOOR"},
{"HNGR", "HANGAR"},
{"LOT", "LOT"},
{"PIER", "PIER"},
{"RM", "ROOM"},
{"SLIP", "SLIP"},
{"SPC", "SPACE"},
{"STOP", "STOP"},
{"STE", "SUITE"},
{"TRLR", "TRAILER"},
{"UNIT", "UNIT"}
};
var rangedSecondaryUnitPattern =
@"(?<SECONDARYUNIT>" +
string.Join("|", rangedSecondaryUnits.Keys.ToArray()) +
@")(?![a-z])";
//Console.WriteLine("'n" + rangedSecondaryUnitPattern);
Dictionary<string, string> rangelessSecondaryUnits = new Dictionary<string, string>()
{
{"BSMT", "BASEMENT"},
{"FRNT", "FRONT"},
{"LBBY", "LOBBY"},
{"LOWR", "LOWER"},
{"OFC", "OFFICE"},
{"PH", "PENTHOUSE"},
{"REAR", "REAR"},
{"SIDE", "SIDE"},
{"UPPR", "UPPER"}
};
var rangelessSecondaryUnitPattern =
@"(?<SECONDARYUNIT>" +
string.Join("|", rangelessSecondaryUnits.Keys.ToArray()) +
@")'b";
//Console.WriteLine("'n" + rangelessSecondaryUnitPattern);
var allSecondaryUnitPattern = string.Format(
CultureInfo.InvariantCulture,
@"
(
(:?
(?: (?:{0} 'W*)
| (?<SECONDARYUNIT>'#)'W*
)
(?<SECONDARYNUMBER>['w-]+)
)
|{1}
),?
",
rangedSecondaryUnitPattern,
rangelessSecondaryUnitPattern);
//Console.WriteLine("'n" + allSecondaryUnitPattern);
var cityAndStatePattern = string.Format(
CultureInfo.InvariantCulture,
@"
(?:
(?<CITY>[^'d,]+?)'W+
(?<STATE>{0})
)
",
statePattern);
//Console.WriteLine("'n" + cityAndStatePattern);
var placePattern = string.Format(
CultureInfo.InvariantCulture,
@"
(?:{0}'W*)?
(?:(?<ZIP>{1}))?
",
cityAndStatePattern,
zipPattern);
//Console.WriteLine("'n" + placePattern);
var addressPattern = string.Format(
CultureInfo.InvariantCulture,
@"
^
# Special case for APO/FPO/DPO addresses
(
[^'w'#]*
(?<STREETLINE>.+?)
(?<CITY>[AFD]PO)'W+
(?<STATE>A[AEP])'W+
(?<ZIP>{4})
'W*
)
|
# Special case for PO boxes
(
'W*
(?<STREETLINE>(P['.' ]?O['.' ]?' )?BOX' [0-9]+)'W+
{3}
'W*
)
|
(
[^'w'#]* # skip non-word chars except # (eg unit)
( {0} )'W*
{1}'W+
(?:{2}'W+)?
{3}
'W* # require on non-word chars at end
)
$ # right up to end of string
",
numberPattern,
streetPattern,
allSecondaryUnitPattern,
placePattern,
zipPattern);
Console.WriteLine("'n-----------------------------'n'n" + addressPattern);
var addressRegex = new Regex(
addressPattern,
RegexOptions.Compiled |
RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);
}
像这样逐渐增加资源使用是灾难性回溯的吸烟枪。基本上,如果你有类似的东西,比如说,这部分:
(?<CITY>[^'d,]+?)'W+
。然后,关于输入的哪一部分与模式的哪一部分匹配,就会有歧义。几乎任何与'W
匹配的东西也可以匹配[^'d,]
.如果输入在第一次传递时不匹配,引擎将返回并尝试这两个组的不同排列,这会消耗资源。
例如,假设输入的"city"部分后面有一大堆空格。一长串空格将匹配[^'d,]+?
和'W+
,因此不清楚CITY组是否包含空格。基于这些量词的懒惰/贪婪行为,引擎将尝试仅将城市名称放入[^'d,]+?
并将所有空格放入'W+
中。然后它将继续前进并尝试匹配其余输入。
如果其余输入在第一次尝试时匹配,很好。但是,如果匹配失败,它将不得不返回并重试,这次其中一个空间被[^'d,]+?
匹配并捕获为CITY组的一部分。当失败时,它将再次尝试使用其中两个空格,依此类推。
您通常会看到这成为嵌套量词的问题,例如 ([ABC]+)*
.我在你的模式中没有看到任何正在发生的事情,但我可能在string.Format
的所有电话中都错过了一些东西。我的猜测是,这是一个如此长的模式,并且有如此多的量词和交流发电机要回溯(以及如此多的组要存储(,即使是单一级别的迭代也会杀死你。我敢打赌,使用与大多数模式匹配但无法匹配所有模式的长输入字符串,您会获得最大的性能影响。
在这种情况下,编译正则表达式可能会有所帮助,您应该这样做。但是,当你的应用程序同时有一千次(或多少次(点击时,我怀疑这会削减它。还会有一些输入字符串会导致大量的回溯,并在性能方面对你造成更大的打击。我最大的建议是找到并解决模式中的歧义。
查找具有大量量词(如*
和+
(彼此靠近的地方,并确保它们之间有清晰的、非可选的分隔符(例如,NUMBER 组中的'd+-?'d*
在'd+(-'d*)?
时表现更好,或者更好的'd+(-'d+)?'b
(。最后,确保分隔符不能与旁边的标记匹配。举个虚构的例子,如果你给它一长串空格,像'W+' 'W+
这样的东西会一直拖下去。