在1-多配置中基于公共字段数据点筛选两个列表的最快方法

本文关键字：两个列表方法筛选配置于公共数据字段 | 更新日期: 2023-09-27 18:03:09

这个都是关于performance的。我有两个主要的lists of objects(在这里，我将使用PEOPLE/PERSON作为替身)。首先，我需要通过First_Name property来创建filter one list -然后我需要创建two filtered lists from each master list based on shared date -一个列表只有一个名称，另一个列表有每个名称，但两个列表只包含匹配的日期条目(一个列表中的日期不存在于另一个列表中)。我已经写了一个pseudo-code来简化这个问题到下面的核心问题。阅读时请理解，"生日"不是最好的选择，因为每个人都有多个日期条目。因此，在阅读下面的代码时，请假设每个人大约有5000个"生日":

public class Person
{
    public string first_Name;
    public string last_Name;
    public DateTime birthday;
}
public class filter_People
{
    List<Person> Group_1 = new List<Person>();// filled from DB Table "1982 Graduates" Group_1 contains all names and all dates
    List<Person> Group_2 = new List<Person>();// filled from DB Table "1983 Graduates" Group_2 contains all names and all dates
    public void filter(List<Person> group_One, List<Person> group_Two)
    {
        Group_1 = group_One;
        Group_2 = group_Two;
        //create a list of distinct first names from Group_1
        List<string> distinct_Group_1_Name = Group_1.Select(p => p.first_Name).Distinct().ToList();
        //Compare each first name in Group_1 to EVERY first name in Group 2, using only records with matching birthdays
        Parallel.For(0, distinct_Group_1_Name.Count, dI => {
            //Step 1 - create a list of person out of group_1 that match the first name being iterated
            List<Person> first_Name_List_1 = Group_1.Where(m => m.first_Name == distinct_Group_1_Name[dI]).ToList();
            //first_Name_List_1 now contains a list of everyone named X (Tom). We need to find people from group 2 who match Tom's birthday - regardless of name
            //step 2 - find matching birthdays by JOINing the filtered name list against Group_2  
            DateTime[] merged_Dates = first_Name_List_1.Join(Group_2, d => d.birthday, b => b.birthday, (d, b) => b.birthday).ToArray();
            //Step 3 - create filtered lists where Filtered_Group_1 contains ONLY people named Tom, and Filtered_Group_2 contains people with ANY name sharing Tom's birthday. No duplicates, no missing dates.
            List<Person> Filtered_Group_1 = first_Name_List_1.Where(p => p.birthday.In(merged_Dates)).ToList();
            List<Person> Filtered_Group_2 = Group_2.Where(p => p.birthday.In(merged_Dates)).ToList();
            //Step 4 -- move on adn process the two filtered lists (outside scope of question)
            //each name in Group_1 will then be compared to EVERY name in Group_2 sharing the same birthday
            //compare_Groups(Filtered_Group_1,Filtered_Group_2)
        });
    }
}
public static class Extension
{
    public static bool In<T>(this T source, params T[] list)
    {
        return list.Contains(source);
    }
}

这里，我们的想法是从DB中获取two different master name lists，并创建日期匹配的子列表(一个只有一个名字，另一个有所有名字)，允许基于datasets的one-to-many comparison具有相同长度的匹配日期索引。最初的想法是简单地从DB加载列表，但是列表很长并且加载所有名称数据并且使用SELECT/WHERE/JOIN要快得多。我说"快得多"，但那是相对的。

我已经尝试将Group_1和Group_2转换为字典和使用键匹配日期。没有太大的改善。Group_1 has about 12Million records(每个about 4800 distinct names都有多个日期)和Group_2具有大致相同的数据，因此这里的输入是12Million条记录，输出是无数条记录。尽管我将此方法作为一个单独的Task运行，并将结果排队等待另一个线程处理，但分割这些列表并保持同步需要花费永远的时间。

另外，我意识到这段代码使用类Person没有多大意义，但它实际上只是使用伪代码的问题的一个代表。实际上，该方法对日期上的多个数据集进行排序，并对多个数据集进行相关性比较。

任何关于如何以更有效的方式过滤这种一对多比较的帮助将不胜感激。

谢谢!

在1-多配置中基于公共字段数据点筛选两个列表的最快方法

在当前格式的代码中，我看到了太多的问题，使它成为您提到的数据类型的性能导向。Parallelism对于可怜的algorithm和data structure来说并不是灵丹妙药。

目前，对于每个比较，它都是linear search O(N)，因此M个操作都是M*O(N)，即使我们将这些操作设置为O(logN)，甚至更好的O(1)，执行时间也会有很大的改善。

代替取Distinct，然后在Parallel loop中使用Where子句进行搜索，使用GroupBy来aggregate / group记录，并在相同的操作中创建字典，这将确保轻松搜索给定名称的记录

var nameGroupList = Group_1.GroupBy(p => p.first_Name).ToDictionary(p => p.Key, p => p);

这将帮助您摆脱原始代码中的以下两个操作(其中一个在Parallel中是重复操作，这会严重损害性能)

List<string> distinct_Group_1_Name = Group_1.Select(p => p.first_Name).Distinct().ToList();
List<Person> first_Name_List_1 = Group_1.Where(m => m.first_Name == distinct_Group_1_Name[dI]).ToList();

Dictionary将属于Dictionary<string,IEnumerable<Person>>类型，因此您在O(1)时间内获得按名称排列的人员列表，并且没有重复搜索。这将处理的代码的另一个问题是在搜索原始列表/data时重新创建列表。

下一个需要处理的部分，它会损害性能，就是像这样的代码

p.birthday.In(merged_Dates)

因为在扩展方法中，每次都将list.Contains作为O(N)操作运行，这会降低性能。以下是可能的选项:

将以下操作也从Parallel循环中取出:

DateTime[] merged_Dates = first_Name_List_1.Join(Group_2, d => d.birthday, b => b.birthday, (d, b) => b.birthday).ToArray();

而不是创建类型为Dictionary<string, Hashset<DateTime>>的另一个Dictionary，通过交叉Dictionary<string,IEnumerable<Person>>先前创建的数据，使用来自Group2的数据，您可以为DateTime使用适当的IEqualityComparer，因此日期列表/数组的现成计算器将可用，无需每次都创建:

personDictionary["PersonCode"].Intersect(Group2,IEqualityComparer(using Date))

对于最终结果，请注意，您应该将结果存储为HashSet而不是List。好处是Contains将是O(log(N))操作而不是O(N)，从而使其更快。事实上，像Dictionary<string, Dictionary<DateTime,DateTime>>这样的结构也很好，这将使它成为O(1)操作。

尝试这些点，并建议是否有任何改进在代码的工作。