大型正则表达式匹配导致程序挂起

本文关键字:程序 挂起 正则表达式 大型 | 更新日期: 2023-09-27 18:34:55

前几天我试着问这个问题,诚然,一开始没有很好地表达问题或邮政编码,答案已经关闭。 所以在这里我再次尝试,因为老实说,这很快就会让我发疯。 :)

我正在尝试实现这个地址解析器,它最初是一个基于控制台的 c# 程序。 我已经成功地将其转换为一个独立的 WPF 程序,该程序仅包含一个用于输入的TextBox、一个用于激活解析的Button和一个用于显示结果的TextBlock。 在编写本文时,我确实将输出截断为主程序中所需的输出,并且仍然可以正常工作。 我在下面包含了整个代码。

我的

下一步是将其嫁接到我的主程序中,我实际上是通过使用复制/粘贴来完成的。 但是,运行此命令后,程序在按下按钮后挂起。 最终 VS 给出一个错误,指出进程太长时间没有抽出消息,并且任务管理器中的内存使用量从 ~70k 逐渐增加到 3,000,000。 针对此,我将Parsing方法分配给后台工作线程,希望减轻主进程的工作量。 这确实解决了程序冻结的问题,但后台线程只是做了同样的事情,提高了 RAM 使用率并且没有返回任何内容。

所以现在我有点陷入僵局。 我知道问题出在 var result = parser.ParseAddress(input); 语句中的某个地方,因为当对每行代码使用断点时,这是最后一个触发的断点。 但基本上我不明白为什么这会导致一个 WPF 程序而不是另一个程序出现问题。

如果有必要,我很乐意在某个地方发布主程序的完整源代码,但我无法想象在这里发布~20个不同的类文件和项目代码是个好主意。 :)

独立 WPF 应用

namespace AddressParseWPF
{
    /// <summary>
    /// Interaction logic for MainWindow.xaml
    /// </summary>
    public partial class MainWindow : Window
    {
        public MainWindow()
        {
            InitializeComponent();
        }
        public void Execute()
        {
            AddressParser.AddressParser parser = new AddressParser.AddressParser();
            var input = inputTextBox.Text;
            var result = parser.ParseAddress(input);
            if (result == null)
            {
                outputTextBlock.Text = "ERROR. Input could not be parsed.";
            }
            else
            {
                outputTextBlock.Text = (result.StreetLine + ", " + result.City + ", " + result.State + "  " + result.Zip);
            }
        }
        private void actionButton_Click(object sender, RoutedEventArgs e)
        {
            Execute();
        }
    }
}

将解析器嫁接到的主程序

public void ExecuteAddressParse()
{
    AddressParser.AddressParser parser = new AddressParser.AddressParser();
    var input = inputTextBox.Text;
    var result = parser.ParseAddress(input);
    if (result == null)
    {
        outputTextBlock.Text = "ERROR. Input could not be parsed.";
    }
    else
    {
        outputTextBlock.Text = (result.StreetLine + ", " + result.City + ", " + result.State + "  " + result.Zip);
    }
}       
private void actionButton_Click(object sender, RoutedEventArgs e)
{
    ExecuteAddressParse();
}

解析地址方法

public AddressParseResult ParseAddress(string input)
{
    if (!string.IsNullOrWhiteSpace(input))
    {
        var match = addressRegex.Match(input.ToUpperInvariant());
        if (match.Success)
        {
            var extracted = GetApplicableFields(match);
            return new AddressParseResult(Normalize(extracted));
        }
    }
    return null;
}

正则表达式匹配方法

private static void InitializeRegex()
{
    var suffixPattern = new Regex(
        string.Join(
            "|",
            new [] {
                string.Join("|", suffixes.Keys), 
                string.Join("|", suffixes.Values.Distinct())
            }),
        RegexOptions.Compiled);
    var statePattern = 
        @"'b(?:" + 
        string.Join(
            "|",
            new [] {
                string.Join("|", states.Keys.Select(x => Regex.Escape(x))),
                string.Join("|", states.Values)
            }) +
        @")'b";
    var directionalPattern =
        string.Join(
            "|",
            new [] {
                string.Join("|", directionals.Keys),
                string.Join("|", directionals.Values),
                string.Join("|", directionals.Values.Select(x => Regex.Replace(x, @"('w)", @"$1'.")))
            });
    var zipPattern = @"'d{5}(?:-?'d{4})?";
    var numberPattern =
        @"(
            ((?<NUMBER>'d+)(?<SECONDARYNUMBER>(-[0-9])|('-?[A-Z]))(?='b))    # Unit-attached
            |(?<NUMBER>'d+['-' ]?'d+'/'d+)                                   # Fractional
            |(?<NUMBER>'d+-?'d*)                                             # Normal Number
            |(?<NUMBER>[NSWE]' ?'d+' ?[NSWE]' ?'d+)                          # Wisconsin/Illinois
          )";
    var streetPattern =
        string.Format(
            CultureInfo.InvariantCulture,
            @"
                (?:
                  # special case for addresses like 100 South Street
                  (?:(?<STREET>{0})'W+
                     (?<SUFFIX>{1})'b)
                  |
                  (?:(?<PREDIRECTIONAL>{0})'W+)?
                  (?:
                    (?<STREET>[^,]*'d)
                    (?:[^'w,]*(?<POSTDIRECTIONAL>{0})'b)
                   |
                    (?<STREET>[^,]+)
                    (?:[^'w,]+(?<SUFFIX>{1})'b)
                    (?:[^'w,]+(?<POSTDIRECTIONAL>{0})'b)?
                   |
                    (?<STREET>[^,]+?)
                    (?:[^'w,]+(?<SUFFIX>{1})'b)?
                    (?:[^'w,]+(?<POSTDIRECTIONAL>{0})'b)?
                  )
                )
            ",
            directionalPattern,
            suffixPattern);
    var rangedSecondaryUnitPattern =
        @"(?<SECONDARYUNIT>" +
        string.Join("|", rangedSecondaryUnits.Keys) +
        @")(?![a-z])";
    var rangelessSecondaryUnitPattern =
        @"(?<SECONDARYUNIT>" +
        string.Join(
            "|",
            string.Join("|", rangelessSecondaryUnits.Keys)) +
        @")'b";
    var allSecondaryUnitPattern = string.Format(
        CultureInfo.InvariantCulture,
        @"
            (
                (:?
                    (?: (?:{0} 'W*)
                        | (?<SECONDARYUNIT>'#)'W*
                    )
                    (?<SECONDARYNUMBER>['w-]+)
                )
                |{1}
            ),?
        ",
         rangedSecondaryUnitPattern,
         rangelessSecondaryUnitPattern);
    var cityAndStatePattern = string.Format(
        CultureInfo.InvariantCulture,
        @"
            (?:
                (?<CITY>[^'d,]+?)'W+
                (?<STATE>{0})
            )
        ",
        statePattern);
    var placePattern = string.Format(
        CultureInfo.InvariantCulture,
        @"
            (?:{0}'W*)?
            (?:(?<ZIP>{1}))?
        ",
        cityAndStatePattern,
        zipPattern);
    var addressPattern = string.Format(
        CultureInfo.InvariantCulture,
        @"
            ^
            # Special case for APO/FPO/DPO addresses
            (
                [^'w'#]*
                (?<STREETLINE>.+?)
                (?<CITY>[AFD]PO)'W+
                (?<STATE>A[AEP])'W+
                (?<ZIP>{4})
                'W*
            )
            |
            # Special case for PO boxes
            (
                'W*
                (?<STREETLINE>(P['.' ]?O['.' ]?' )?BOX' [0-9]+)'W+
                {3}
                'W*
            )
            |
            (
                [^'w'#]*    # skip non-word chars except # (eg unit)
                (  {0} )'W*
                   {1}'W+
                (?:{2}'W+)?
                   {3}
                'W*         # require on non-word chars at end
            )
            $           # right up to end of string
        ",
        numberPattern,
        streetPattern,
        allSecondaryUnitPattern,
        placePattern,
        zipPattern);
    addressRegex = new Regex(
        addressPattern,
        RegexOptions.Compiled | 
        RegexOptions.Singleline | 
        RegexOptions.IgnorePatternWhitespace);
}

大型正则表达式匹配导致程序挂起

省略RegexOptions.Compiled标志时正则表达式是否有效?

回答是肯定的。

为什么呢?

似乎正则表达式编译器在(一些?(大模式下很慢。

这是你必须做出的权衡。

一些正则表达式是固有的(如 @Justin Morgan 提到的(。
这通常是加入可重用的碎片正则表达式的结果,它使
我畏缩。

但是,如果您要使用/执行此方法,最好打印出
构造后的实际正则表达式。并且,格式化后,针对它进行测试
示例并独立于主程序进行操作。这样更容易修复。
如果您看到可疑的子表达式,请尝试使其在此时失败,或者
通常,尝试在示例末尾附近插入失败。如果需要超过
眨眼间失败,然后严重倒退。

不过回溯还不错。它有一个巨大的优势。没有它,有些东西
只是不匹配。诀窍是隔离不影响的
子表达式相对于周围事物的结果,然后限制它与BACTRACKING。

我去了USPS网站并获取了一些示例状态/后缀/方向/次要
状态样本,足以生成地址正则表达式。下面是一个
清理版本从代码生成的正则表达式。

祝你好运!

 ^
   # Special case for APO/FPO/DPO addresses
   (
      [^'w'#]*
      (?<STREETLINE> .+? )
      (?<CITY> [AFD] PO )
      'W+
      (?<STATE> A [AEP] )
      'W+
      (?<ZIP> 'd{5} (?: -? 'd{4} )? )
      'W*
   )
 |         
   # Special case for PO boxes
   (
      'W*
      (?<STREETLINE> ( P ['.' ]? O ['.' ]? '  )? BOX '  [0-9]+ )
      'W+
      (?:
          (?:
              (?<CITY> [^'d,]+? )
              'W+
              (?<STATE>
                 'b
                 (?:AL|AK|AS|AZ|AR|Alabama|Alaska|American Samoa|Arizona|Arkansas)
                 'b
              )
          )
          'W*
      )?
      (?:
          (?<ZIP> 'd{5} (?: -? 'd{4} )? )
      )?
      'W*
   )
 |          
   (
       [^'w'#]*    # skip non-word chars except # (eg unit)
       (
         (
              (
                (?<NUMBER> 'd+ )
                (?<SECONDARYNUMBER> (-[0-9]) | ('-?[A-Z]) )
                (?='b)
              )                                                  # Unit-attached
           |          
             (?<NUMBER> 'd+ ['-' ]? 'd+ '/ 'd+ )                 # Fractional
           |
             (?<NUMBER> 'd+ -? 'd* )                             # Normal Number
           |
             (?<NUMBER>[NSWE]' ?'d+' ?[NSWE]' ?'d+)              # Wisconsin/Illinois
         )
       )
       'W*
       (?:
           # special case for addresses like 100 South Street
           (?:
               (?<STREET>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N'.|E'.|S'.|W'.|N'.E'.|S'.E'.|N'.W'.|S'.W'.)
               'W+
               (?<SUFFIX>ALLEY|ALY|ALLY|ALLEE|ALLEY|ALY)
               'b
           )
         |
           (?:
               (?<PREDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N'.|E'.|S'.|W'.|N'.E'.|S'.E'.|N'.W'.|S'.W'.)
               'W+
           )?
           (?:
                (?<STREET> [^,]* 'd )
                (?:
                   [^'w,]*
                   (?<POSTDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N'.|E'.|S'.|W'.|N'.E'.|S'.E'.|N'.W'.|S'.W'.)
                   'b
                )
             |
                (?<STREET> [^,]+ )
                (?:
                    [^'w,]+
                    (?<SUFFIX>ALLEY|ALY|ALLY|ALLEE|ALLEY|ALY)
                    'b
                )
                (?:
                    [^'w,]+
                    (?<POSTDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N'.|E'.|S'.|W'.|N'.E'.|S'.E'.|N'.W'.|S'.W'.)
                    'b
                )?
             |
                (?<STREET> [^,]+? )
                (?:
                    [^'w,]+
                    (?<SUFFIX>ALLEY|ALY|ALLY|ALLEE|ALLEY|ALY)
                    'b
                )?
                (?:
                    [^'w,]+
                    (?<POSTDIRECTIONAL>North|East|South|West|Northeast|Southeast|Northwest|Southwest|N|E|S|W|NE|SE|NW|SW|N'.|E'.|S'.|W'.|N'.E'.|S'.E'.|N'.W'.|S'.W'.)
                    'b
                )?
           )
       )           
       'W+        
       (?:      
           (
               (
                  :?
                  (?:
                      (?:
                         (?<SECONDARYUNIT>APT|BLDG|DEPT|FL|HNGR|LOT|PIER|RM|SLIP|SPC|STOP|STE|TRLR|UNIT)
                         (?! [a-z] )
                         'W*
                       )
                    |
                       (?<SECONDARYUNIT> '# )
                       'W*
                  )
                  (?<SECONDARYNUMBER> ['w-]+ )
               )
             |
               (?<SECONDARYUNIT>BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR)
               'b
           )
           ,?
           'W+
       )?
       (?:
           (?:
               (?<CITY> [^'d,]+? )
               'W+
               (?<STATE>
                  'b
                  (?:AL|AK|AS|AZ|AR|Alabama|Alaska|American Samoa|Arizona|Arkansas)
                  'b
               )
           )
           'W*
       )?
       (?:
           (?<ZIP> 'd{5} (?: -? 'd{4} )? )
       )?
       'W*         # require on non-word chars at end
   )
 $           # right up to end of string

C# 代码

   public static void InitializeRegex()
    {
        Dictionary<string, string> suffixes = new Dictionary<string, string>()
        {
          {"ALLEY",  "ALLEE"},
          {"ALY",  "ALLEY"},
          {"ALLY",  "ALY"},
        };
        var suffixPattern = new Regex(
            string.Join(
                "|",
                new[] {
            string.Join("|", suffixes.Keys.ToArray()), 
            string.Join("|", suffixes.Values.Distinct().ToArray())
        }),
            RegexOptions.Compiled);
        //Console.WriteLine("'n"+suffixPattern);
        Dictionary<string, string> states = new Dictionary<string, string>()
        {
           {"AL", "Alabama"},
           {"AK", "Alaska"},
           {"AS",  "American Samoa"},
           {"AZ",  "Arizona"},
           {"AR", "Arkansas"}
        };
        var statePattern =
            @"'b(?:" +
            string.Join(
                "|",
                new[] {
            string.Join("|", states.Keys.Select(x => Regex.Escape(x)).ToArray()),
            string.Join("|", states.Values.ToArray())
        }) +
            @")'b";
        //Console.WriteLine("'n" + statePattern);
        Dictionary<string, string> directionals = new Dictionary<string, string>()
        {
           {"North", "N" },
           {"East", "E" },
           {"South", "S" },
           {"West", "W" },
           {"Northeast", "NE" },
           {"Southeast", "SE" },
           {"Northwest", "NW" },
           {"Southwest", "SW" }
        };
        var directionalPattern =
            string.Join(
                "|",
                new[] {
            string.Join("|", directionals.Keys.ToArray()),
            string.Join("|", directionals.Values.ToArray()),
            string.Join("|", directionals.Values.Select(x => Regex.Replace(x, @"('w)", @"$1'.")).ToArray())
        });
        //Console.WriteLine("'n" + directionalPattern);
        var zipPattern = @"'d{5}(?:-?'d{4})?";
        //Console.WriteLine("'n" + zipPattern);
        var numberPattern =
            @"(
                ((?<NUMBER>'d+)(?<SECONDARYNUMBER>(-[0-9])|('-?[A-Z]))(?='b))    # Unit-attached
                |(?<NUMBER>'d+['-' ]?'d+'/'d+)                                   # Fractional
                |(?<NUMBER>'d+-?'d*)                                             # Normal Number
                |(?<NUMBER>[NSWE]' ?'d+' ?[NSWE]' ?'d+)                          # Wisconsin/Illinois
             )";
        //Console.WriteLine("'n" + numberPattern);
        var streetPattern =
            string.Format(
                CultureInfo.InvariantCulture,
                @"
                    (?:
                      # special case for addresses like 100 South Street
                      (?:(?<STREET>{0})'W+
                         (?<SUFFIX>{1})'b)
                      |
                      (?:(?<PREDIRECTIONAL>{0})'W+)?
                      (?:
                        (?<STREET>[^,]*'d)
                        (?:[^'w,]*(?<POSTDIRECTIONAL>{0})'b)
                       |
                        (?<STREET>[^,]+)
                        (?:[^'w,]+(?<SUFFIX>{1})'b)
                        (?:[^'w,]+(?<POSTDIRECTIONAL>{0})'b)?
                       |
                        (?<STREET>[^,]+?)
                        (?:[^'w,]+(?<SUFFIX>{1})'b)?
                        (?:[^'w,]+(?<POSTDIRECTIONAL>{0})'b)?
                      )
                    )
                ",
                directionalPattern,
                suffixPattern);
        //Console.WriteLine("'n" + streetPattern);

        Dictionary<string, string> rangedSecondaryUnits = new Dictionary<string, string>()
        {
            {"APT",  "APARTMENT"},
            {"BLDG", "BUILDING"}, 
            {"DEPT", "DEPARTMENT"}, 
            {"FL",   "FLOOR"}, 
            {"HNGR", "HANGAR"}, 
            {"LOT",  "LOT"}, 
            {"PIER", "PIER"}, 
            {"RM",   "ROOM"}, 
            {"SLIP", "SLIP"}, 
            {"SPC",  "SPACE"}, 
            {"STOP", "STOP"}, 
            {"STE",  "SUITE"}, 
            {"TRLR", "TRAILER"}, 
            {"UNIT", "UNIT"} 
        };
        var rangedSecondaryUnitPattern =
            @"(?<SECONDARYUNIT>" +
            string.Join("|", rangedSecondaryUnits.Keys.ToArray()) +
            @")(?![a-z])";
        //Console.WriteLine("'n" + rangedSecondaryUnitPattern);

        Dictionary<string, string> rangelessSecondaryUnits = new Dictionary<string, string>()
        {
            {"BSMT", "BASEMENT"},
            {"FRNT", "FRONT"},
            {"LBBY", "LOBBY"},
            {"LOWR", "LOWER"},
            {"OFC",  "OFFICE"},
            {"PH",   "PENTHOUSE"},
            {"REAR", "REAR"},
            {"SIDE", "SIDE"},
            {"UPPR", "UPPER"}
        };
        var rangelessSecondaryUnitPattern =
            @"(?<SECONDARYUNIT>" +
            string.Join("|", rangelessSecondaryUnits.Keys.ToArray()) +
            @")'b";
        //Console.WriteLine("'n" + rangelessSecondaryUnitPattern);
        var allSecondaryUnitPattern = string.Format(
            CultureInfo.InvariantCulture,
            @"
                (
                    (:?
                        (?: (?:{0} 'W*)
                            | (?<SECONDARYUNIT>'#)'W*
                        )
                        (?<SECONDARYNUMBER>['w-]+)
                    )
                    |{1}
                ),?
            ",
             rangedSecondaryUnitPattern,
             rangelessSecondaryUnitPattern);
        //Console.WriteLine("'n" + allSecondaryUnitPattern);
        var cityAndStatePattern = string.Format(
            CultureInfo.InvariantCulture,
            @"
                (?:
                    (?<CITY>[^'d,]+?)'W+
                    (?<STATE>{0})
                )
            ",
            statePattern);
        //Console.WriteLine("'n" + cityAndStatePattern);
        var placePattern = string.Format(
            CultureInfo.InvariantCulture,
            @"
                (?:{0}'W*)?
                (?:(?<ZIP>{1}))?
            ",
            cityAndStatePattern,
            zipPattern);
        //Console.WriteLine("'n" + placePattern);
        var addressPattern = string.Format(
            CultureInfo.InvariantCulture,
            @"
                ^
                # Special case for APO/FPO/DPO addresses
                (
                    [^'w'#]*
                    (?<STREETLINE>.+?)
                    (?<CITY>[AFD]PO)'W+
                    (?<STATE>A[AEP])'W+
                    (?<ZIP>{4})
                    'W*
                )
                |
                # Special case for PO boxes
                (
                    'W*
                    (?<STREETLINE>(P['.' ]?O['.' ]?' )?BOX' [0-9]+)'W+
                    {3}
                    'W*
                )
                |
                (
                    [^'w'#]*    # skip non-word chars except # (eg unit)
                    (  {0} )'W*
                       {1}'W+
                    (?:{2}'W+)?
                       {3}
                    'W*         # require on non-word chars at end
                )
                $           # right up to end of string
            ",
            numberPattern,
            streetPattern,
            allSecondaryUnitPattern,
            placePattern,
            zipPattern);
        Console.WriteLine("'n-----------------------------'n'n" + addressPattern);
        var addressRegex = new Regex(
            addressPattern,
            RegexOptions.Compiled |
            RegexOptions.Singleline |
            RegexOptions.IgnorePatternWhitespace);
    }

像这样逐渐增加资源使用是灾难性回溯的吸烟枪。基本上,如果你有类似的东西,比如说,这部分:

(?<CITY>[^'d,]+?)'W+

。然后,关于输入的哪一部分与模式的哪一部分匹配,就会有歧义。几乎任何与'W匹配的东西也可以匹配[^'d,] .如果输入在第一次传递时不匹配,引擎将返回并尝试这两个组的不同排列,这会消耗资源。

例如,假设输入的"city"部分后面有一大堆空格。一长串空格将匹配[^'d,]+?'W+,因此不清楚CITY组是否包含空格。基于这些量词的懒惰/贪婪行为,引擎将尝试仅将城市名称放入[^'d,]+?并将所有空格放入'W+中。然后它将继续前进并尝试匹配其余输入。

如果其余输入在第一次尝试时匹配,很好。但是,如果匹配失败,它将不得不返回并重试,这次其中一个空间被[^'d,]+?匹配并捕获为CITY组的一部分。当失败时,它将再次尝试使用其中两个空格,依此类推。

您通常会看到这成为嵌套量词的问题,例如 ([ABC]+)* .我在你的模式中没有看到任何正在发生的事情,但我可能在string.Format的所有电话中都错过了一些东西。我的猜测是,这是一个如此长的模式,并且有如此多的量词和交流发电机要回溯(以及如此多的组要存储(,即使是单一级别的迭代也会杀死你。我敢打赌,使用与大多数模式匹配但无法匹配所有模式的长输入字符串,您会获得最大的性能影响。

在这种情况下,编译正则表达式可能会有所帮助,您应该这样做。但是,当你的应用程序同时有一千次(或多少次(点击时,我怀疑这会削减它。还会有一些输入字符串会导致大量的回溯,并在性能方面对你造成更大的打击。我最大的建议是找到并解决模式中的歧义。

查找具有大量量词(如*+(彼此靠近的地方,并确保它们之间有清晰的、非可选的分隔符(例如,NUMBER 组中的'd+-?'d*'d+(-'d*)?时表现更好,或者更好的'd+(-'d+)?'b(。最后,确保分隔符不能与旁边的标记匹配。举个虚构的例子,如果你给它一长串空格,像'W+' 'W+这样的东西会一直拖下去。