将大型整数列表与较小的整数列表进行比较的最有效方法是什么

本文关键字：列表整数比较有效方法是什么大型 | 更新日期: 2023-09-27 18:35:10

目前我有一个 100 万integers的list，我根据 2000 integer s 的黑名单检查每个integer。这大约需要 2 分钟。

for(int i = 0; i< MillionIntegerList.Length ; i++)
{
    for(int blacklisted = 0; blacklisted < TwoThousandIntegerList.Length ; blacklisted++)
        if(i==blacklisted)
            i = 0; //Zero is a sentinel value 
}

这总共进行了 2，000，

000，000 次迭代（循环）。有没有更好的方法我不看到？谢谢

将大型整数列表与较小的整数列表进行比较的最有效方法是什么

现在有三个选项 - 前两个更通用，因为它们不依赖于MillionIntegerList被排序（最初未指定）。在大型列表已排序的情况下，第三种更可取。

选项 1

是的，肯定有更好的方法，使用 LINQ：

var common = MillionIntegerList.Intersect(TwoThousandIntegerList).ToList();

这将在内部使用通过TwoThousandIntegerList构建的HashSet<int>，然后查找其中的每个MillionIntegerList元素 - 这将比每次遍历整个TwoThousandIntegerList要有效得多。

如果您只想要未列入黑名单的人，则需要：

var valid = MillionIntegerList.Except(TwoThousandIntegerList).ToList();

请注意，如果您只需要迭代一次结果，则应删除ToList调用 - 我已将其包含在内以具体化结果，以便可以廉价地多次检查它们。如果只是迭代，则 Intersect 或 Except 的返回值将只流式传输结果，从而在内存使用方面便宜得多。

选项 2

如果您不想依赖 LINQ to Objects 的实现细节，但仍需要基于哈希的方法：

var hashSet = new HashSet<int>(TwoThousandIntegerList);
hashSet.IntersectWith(MillionIntegerList);
// Now use hashSet

选项 3

使用大列表排序的事实的方法肯定是有用的。

假设您不介意先对黑名单列表进行排序，您可以编写如式（和通用）实现（未经测试）：

// Note: to use this, you'd need to make sure that *both* sequences are sorted.
// You could either sort TwoThousandIntegerList in place, or use LINQ's OrderBy
// method.
public IEnumerable<T> SortedIntersect<T>(this IEnumerable<T> first,
    IEnumerable<T> second) where T : IComparable<T>
{
    using (var firstIterator = first.GetEnumerator())
    {
        if (!firstIterator.MoveNext())
        {
            yield break;
        }
        using (var secondIterator = second.GetEnumerator())
        {
            if (!secondIterator.MoveNext())
            {
                yield break;
            }
            T firstValue = firstIterator.Current;
            T secondValue = secondIterator.Current;
            while (true)
            {
                int comparison = firstValue.CompareTo(secondValue);
                if (comparison == 0) // firstValue == secondValue
                {
                    yield return firstValue;
                }
                else if (comparison < 0) // firstValue < secondValue
                {
                    if (!firstIterator.MoveNext())
                    {
                        yield break;
                    }
                    firstValue = firstIterator.Current;
                }
                else // firstValue > secondValue
                {
                    if (!secondIterator.MoveNext())
                    {
                        yield break;
                    }
                    secondValue = secondIterator.Current;
                }  
            }                
        }
    }
}

（如果你愿意，你可以采取IComparer<T>，而不是依赖T的可比性。

由于大列表已排序。通过对小列表进行排序（非常快），然后进行线性合并，您可能会获得最佳结果。您只需要查看大（和小）列表中的每个项目一次，并且无需在后台创建哈希表。

请参阅MergeSort的合并功能部分，以获取有关如何执行此操作的想法。

你需要的是Enumerable.除了方法（IEnumerable，IEnumerable）在我看来

查看此处 http://msdn.microsoft.com/en-us/library/bb300779.aspx

你的方法需要O（n*n）时间。请考虑以下优化：

1）

如果整数不是太大，则可以使用 bool 数组（例如，如果最大可能的整数为 1000000，则使用 bool[] b = new bool[1000000]）。现在要将数字 K 添加到黑名单，请使用 b[K] = true。检查是微不足道的。这在 O（n）中有效。您也可以使用位数组
2）

整数可以足够大。使用二叉搜索树来存储黑名单（例如 SortedSet）。它具有O（logN）插入和检索时间。所以总而言之，它是O（N*logN）。语法与List（add（int K），Contains（int K））相同，重复项被忽略

我认为最好的解决方案是使用 Bloom 过滤器，在这种情况下，Bloom 过滤器说某个元素可能在黑名单中，只需检查是否不是误报（如果黑名单已排序，可以在 O（Log（n））中完成）。该解决方案具有时间效率，并且几乎不使用额外的空间，这使得它比使用哈希集要好得多。

这是谷歌用于Chrome黑名单的解决方案。

在

较长的列表上进行二叉搜索怎么样，因为它是排序的。

foreach(integer blacklisted in TwoThousandIntegerList)
{
    integer i  = MillionIntegerList.binarySearch(blacklisted)
    if(i==blacklisted){
          //Do your stuff
    } 
}

此解决方案仅花费 O（m log n） 时间，其中 m 是小列表的大小，n 是较长列表的大小。警告：此解决方案假定MillionIntegerList没有重复值。

如果不是这种情况，那么您可以遍历重复，因为它们必须位于一个连续的块中。为此，我将假设MillionInterList是一个记录列表，每个记录都有一个value和一个index。

foreach(integer blacklisted in TwoThousandIntegerList)
{
    integer index = MillionIntegerList.binarySearch(blacklisted)
    //Find the index of the first occurrence of blacklisted value
    while(index > 0 && MillionIntegerList[index - 1].value == blacklisted){
          --index;
    }
    while(MillionIntegerList[index].value == blacklisted){
          //Do your stuff
          ++index;
    } 
}

此解决方案的成本为 O（m log n + mk），其中 k 是在 MillionInterList 中找到的每个列入黑名单的整数的平均重复数。

对阻止列表使用 HashSet。

foreach(integer i in MillionIntegerList)
{
        //check if blockedlist contains i
        //do what ever you like. 
}

对

List 使用Except方法。这将起作用