字符串解析和匹配算法

本文关键字：匹配算法字符串 | 更新日期: 2023-09-27 18:21:48

我正在解决以下问题：

假设我有一个软件包列表，它们的名称可能看起来像这样（唯一已知的是，这些名称的形式像SOMETHING + VERSION，这意味着版本总是在名称之后）：

Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED
Efficient.Exclusive.Zip.Archiver.123.01
Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip.Archiver14.06
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08
Custom.Zip.Archiver1

现在，我需要解析这个列表，并只选择每个包的最新版本。对于这个例子，预期的结果是：

Efficient-Exclusive.Zip.Archiver(2011)-126.24-X
Zip-Archiver.v15.08-T
Custom.Zip.Archiver1.08

我目前使用的方法可以用以下方式描述：

Split the initial strings into groups by their starting letter,
ignoring spaces, case and special symbols.
(`E`, `Z`, `C` for the example list above)
Foreach element {
    Apply the regular expression (or a set of regular expressions),
    which tries to deduce the version from the string and perform
    the following conversion `STRING -> (VERSION, STRING_BEFORE_VERSION)`
    // Example for this step:
    // 'Efficient.Exclusive.Zip.Archiver-PROPER.v.122.24-EXTENDED' ->
    // (122.24, Efficient.Exclusive.Zip.Archiver-PROPER)
    Search through the corresponding group (in this example - the 'E' group)
    and find every other strings, which starts from the 'STRING_BEFORE_VERSION' or
    from it's significant part. This comparison is performed in ignore-case and
    ignore-special-symbols mode.
    // The matches for this step:
    // Efficient.Exclusive.Zip.Archiver-PROPER, {122.24}
    // Efficient.Exclusive.Zip.Archiver, {123.01}
    // Efficient-Exclusive.Zip.Archiver, {126.24, 2011}
    // The last one will get picked, because year is ignored.
    Get the possible version from each match, ***pick the latest, yield that match.***
    Remove every possible match (including the initial element) from the list.
}

这个算法（正如我所假设的）应该适用于O(N * V + N lg N * M)，其中M代表平均字符串匹配时间，V代表版本regexp工作时间。

然而，我怀疑有更好的解决方案（总是有！），可能是特定的数据结构或更好的匹配方法

如果可以提出一些建议或对当前方法做一些笔记，请毫不犹豫地这样做。

字符串解析和匹配算法

这个怎么样？（伪码）

Dictionary<string,string> latestPackages=new Dictionary<string,string>(packageNameComparer);
foreach element
{
    (package,version)=applyRegex(element);
    if(!latestPackages.ContainsKey(package) || isNewer)
    {
        latestPackages[package]=version;
    }
}
//print out latestPackages

字典操作是O（1），所以总运行时间是O（n）。不需要预先分组，您只存储当前最新的匹配项，而不是存储所有匹配项。

Dictionary有一个接受IEqualityComparer对象的构造函数。在那里，您可以实现自己的包名称之间相等的语义。但是请记住，您需要在此IEqualityComparer中实现一个GetHashCode方法，该方法应该为您认为相等的对象返回相同的值。为了重现上面的例子，您可以为字符串中的第一个字符返回一个哈希代码，这将重现您在字典中的分组。然而，使用更智能的哈希代码，您将获得更高的性能，它不会有太多冲突。如果仍然能产生好的结果，也许可以使用更多的字符。

我认为您可能会使用DAWG(http://en.wikipedia.org/wiki/Directed_acyclic_word_graph)这里的效果很好。我认为你可以简单地循环每个节点，直到你找到一个只有1个"子"的节点。在这个节点上，您将在树和下面的版本字符串中使用公共前缀"up"。从那里，通过删除所有不是数字或句点的内容来解析版本字符串，将字符串按句点拆分，并将数组的每个元素转换为整数。这应该为每个版本字符串提供一个int数组。识别最高版本，将其记录下来，然后转到只有1个子节点的下一个节点。

编辑：填充一个大的DAWG是一个相当昂贵的操作，但查找确实很快。