我怎样才能使这个代码更快??List< custom_class> (); (lambda_expression)

本文关键字:class expression lambda custom 代码 List | 更新日期: 2023-09-27 18:15:00

我需要匹配电子邮件发送与电子邮件反弹,所以我可以找到他们是否交付。问题是,我必须将反弹限制在发送后的4天内,以消除将错误的发送匹配到反弹。发送记录的周期为30天。

LinkedList<event_data> sent = GetMyHugeListOfSends(); //for example 1M+ records
List<event_data> bounced = GetMyListOfBounces(); //for example 150k records
bounced = bounced.OrderBy(o => o.event_date).ToList(); //this ensures the most accurate match of bounce to send (since we find the first match)
List<event_data> delivered = new List<event_data>();
event_data deliveredEmail = new event_data();
foreach (event_data sentEmail in sent)
{
     event_data bounce = bounced.Find(item => item.email.ToLower() == sentEmail.email.ToLower() && (item.event_date > sentEmail.event_date && item.event_date < sentEmail.event_date.AddDays(deliveredCalcDelayDays)));
     //create delivered records
     if (bounce != null)
     {
          //there was a bounce! don't add a delivered record!
     }
     else
     {
          //if sent is not bounced, it's delivered
          deliveredEmail.sid = siteid;
          deliveredEmail.mlid = mlid;
          deliveredEmail.mid = mid;
          deliveredEmail.email = sentEmail.email;
          deliveredEmail.event_date = sentEmail.event_date;
          deliveredEmail.event_status = "Delivered";
          deliveredEmail.event_type = "Delivered";
          deliveredEmail.id = sentEmail.id;
          deliveredEmail.number = sentEmail.number;
          deliveredEmail.laststoretransaction = sentEmail.laststoretransaction;
          delivered.Add(deliveredEmail);   //add the new delivered
          deliveredEmail = new event_data();
          //remove bounce, it only applies to one send!
          bounced.Remove(bounce);
     }
     if (bounced.Count() == 0)
     {
          break; //no more bounces to match!
     }
}
所以我做了一些测试,它每秒处理大约12条发送的记录。处理100万张以上的记录需要25个小时以上的时间!

两个问题:

  1. 我怎样才能找到花费时间最多的那条线?
  2. 我假设这是lambda表达式找到反弹,这是花费最长的时间,因为这是快得多,我把它放在那里之前。我怎样才能加快速度?

谢谢!

编辑

——思想——

  1. 我刚刚想到的一个想法是按日期对发送进行排序,就像我对反弹进行排序一样,这样通过反弹进行搜索将更有效,因为早期发送也可能遇到早期反弹。
  2. 我刚刚想到的另一个想法是并行运行这些进程中的几个,尽管我讨厌多线程这个简单的应用程序。

我怎样才能使这个代码更快??List< custom_class> (); (lambda_expression)

我可以相当有信心地说,是的,是你的发现在花费时间。

看起来你肯定find方法只会返回0或1条记录(不是一个列表),在这种情况下,加快速度的方法是创建一个查找(字典),而不是为你的反弹var创建一个List<event_data>,创建一个Dictionary<key, event_data>,然后你可以通过键来查找值,而不是做一个查找。

诀窍在于创建你的密钥(我不太了解你的应用程序来帮助),但本质上是相同的标准,在你的发现。

编辑。(添加一些伪代码)

void Main()
{
    var hugeListOfEmails = GetHugeListOfEmails();
    var allBouncedEmails = GetAllBouncedEmails();
    IDictionary<string, EmailInfo> CreateLookupOfBouncedEmails = CreateLookupOfBouncedEmails(allBouncedEmails);
    foreach(var info in hugeListOfEmails)
    {
        if(CreateLookupOfBouncedEmails.ContainsKey(info.emailAddress))
        {
            // Email is bounced;
        }
        else
        {
            // Email is not bounced
        }
    }
}
public IEnumerable<EmailInfo> GetHugeListOfEmails()
{
    yield break;
}
public IEnumerable<EmailInfo> GetAllBouncedEmails()
{
    yield break;
}
public IDictionary<string, EmailInfo> CreateLookupOfBouncedEmails(IEnumerable<EmailInfo> emailList)
{
    var result = new Dictionary<string, EmailInfo>();
    foreach(var e in emailList)
    {
        if(!result.ContainsKey(e.emailAddress))
        {
            if(//satisfies the date conditions)
            {
                result.Add(e.emailAddress, e);
            }
        }
    }
    return result;
}
public class EmailInfo
{
    public string emailAddress { get; set; }
    public DateTime DateSent { get; set; }
}

您应该使用ToLookup方法来创建电子邮件地址查找表

var bouncedLookup = bounced.ToLookup(k => k.email.ToLower());

并在循环中使用它查找电子邮件的第一个

var filteredBounced = bouncedLookup[sent_email.email.ToLower()];
// mini optimisation here
var endDate = sentEmail.event_date.AddDays(deliveredCalcDelayDays);
event_data bounce = filteredBounced.Find(item => item.event_date > sentEmail.event_date && item.event_date < endDate));

我不能编译它,但我认为应该这样做。请尝尝。

您正在查找列表中的项。这意味着它必须遍历整个链表所以这是一个(n)阶的操作。你能不能把那些发送的电子邮件存储在字典中,键是你正在搜索的电子邮件地址。通过跳转链接回到字典中的电子邮件。查找将是常数时间,你将通过跳跃,所以它将是o (n)总的来说。您当前的方法是order (n²)

将弹跳转换为sortedlist可能是一个很好的解决方案

SortedList<string,data> sl = new SortedList<string,event_data>(bounced.ToDictionary(s=>s.email,s=>s));
and to find a bounce use
sl.Select(c=>c.Key.Equals(item => item.email,StringComparison.OrdinalIgnoreCase) && ...).FirstOrDefault();

关于你的代码还有一个问题,我想指出来。

内存消耗。我不知道你的机器配置,但这里有一些关于代码的想法:

  1. 最初为event_data的1,2m +对象分配空间类型。我不能看到event_data完整的类型定义,但假设电子邮件都是独一无二的,而且这种类型有很多属性,我可以假设这样的集合相当(可能几百兆)。
  2. 接下来,你正在分配另一堆event_data对象(如果我没数错的话,差不多有100万)。越来越重在内存消耗方面
  3. 我不知道其他对象,它们存在于你的应用程序的数据模型中,但是考虑到我提到的所有事情,你很容易接近内存限制对于32位进程,因此迫使GC经常工作。事实上您可以在每次调用之后轻松地进行GC收集bounced.Remove(bounce);并且它真的会显著地减慢你的应用程序。

所以,即使你有足够的内存和/或你的应用程序是64位的,我会尽量减少内存消耗。我很确定它会让你的代码运行得更快。例如,您可以完成deliveredEmail的完整处理,而不存储它,或者将初始event_data加载为块等

考虑到反弹的数量相对较小,因此,

为什么不尽可能地预先优化反弹查找呢?这段代码为每个可能的反弹创建了一个委托,并将它们分组到一个字典中,以便通过电子邮件键进行访问。

private static DateInRange(
    DateTime sendDate,
    DateTime bouncedDate,
    int deliveredCalcDelayDays)
{
    if (sentDate < bouncedDate)
    {
        return false;
    }
    return sentDate < bouncedDate.AddDays(deliveredCalcDelayDays);
}
static IEnumerable<event_data> GetDeliveredMails(
           IEnumerable<event_data> sent,
           IEnumerable<event_data> bounced,
           int siteId,
           int mlId,
           int mId,
           int deliveredCalcDelayDays)
{
    var grouped = bounced.GroupBy(
        b => b.email.ToLowerInvariant());
    var lookup = grouped.ToDictionary(
        g => g.Key,
        g => g.OrderBy(e => e.event_date).Select(
            e => new Func<DateTime, bool>(
                s => DateInRange(s, e.event_date, deliveredCalcDelayDays))).ToList());
    foreach (var s in sent)
    {
        var key = s.email.ToLowerInvariant();
        List<Func<DateTime, nool>> checks;
        if (lookup.TryGetValue(key, out checks))
        {
            var match = checks.FirstOrDefault(c => c(s.event_date));
            if (match != null)
            {
                checks.Remove(match);
                continue;
            }
        }
        yield return new event_data
            {
                .sid = siteid;
                .mlid = mlid;
                .mid = mid;
                .email = s.email;
                .event_date = s.event_date;
                .event_status = "Delivered";
                .event_type = "Delivered";
                .id = s.id;
                .number = s.number;
                .laststoretransaction = s.laststoretransaction
            };
    }
}

如果速度不够快,可以尝试在查找中预编译委托。

好的,我找到的最终解决方案是一个字典的反弹。

发送的LinkedList按sent_date排序,因此它将按时间顺序循环。这很重要,因为我必须匹配正确的发送和正确的反弹。

我制作了一个Dictionary<string,<List<event_data>>,所以键是电子邮件,值是电子邮件地址的所有<event_data>反弹列表。列表是按event_date排序的,因为我想确保第一次反弹与发送相匹配。

最终结果……

从700条记录/分钟上升到500k+记录/秒。下面是最后的代码:

LinkedList sent = getmyhugelistofsent ();IEnumerable sentOrdered = sent。order (send => send.event_date);

Dictionary> bouncesasdictionary ();

List = new List();event_data deliveredEmail = new event_data();

List bounces = null;

foreach (event_data sendemail in sentOrdered){match = false;

 //create delivered records
 if (bounced.TryGetValue(sentEmail.email, out bounces))
 {
      //there was a bounce! find out if it was within 4 days after the send!
      foreach (event_data bounce in bounces)
      {
           if (bounce.event_date > sentEmail.event_date &&
               bounce.event_date <= sentEmail.event_date.AddDays(4))
           {
               matchedBounce = true;
               //remove the record because a bounce can only match once back to a send
               bounces.Remove(bounce);
               if(bounces.Count == 0) //no more bounces for this email
               {
                    bounced.Remove(sentEmail.email);
               }
               break;
          }
     }
     if (matchedBounce == false) //no matching bounces in the list!
     {
          //if sent is not bounced, it's delivered
          deliveredEmail.sid = siteid;
          deliveredEmail.mlid = mlid;
          deliveredEmail.mid = mid;
          deliveredEmail.email = sentEmail.email;
          deliveredEmail.event_date = sentEmail.event_date;
          deliveredEmail.event_status = "Delivered";
          deliveredEmail.event_type = "Delivered";
          deliveredEmail.id = sentEmail.id;
          deliveredEmail.number = sentEmail.number;
          deliveredEmail.laststoretransaction = sentEmail.laststoretransaction;
          delivered.Add(deliveredEmail);   //add the new delivered
          deliveredEmail = new event_data();
     }
 }
 else
 {
      //if sent is not bounced, it's delivered
      deliveredEmail.sid = siteid;
      deliveredEmail.mlid = mlid;
      deliveredEmail.mid = mid;
      deliveredEmail.email = sentEmail.email;
      deliveredEmail.event_date = sentEmail.event_date;
      deliveredEmail.event_status = "Delivered";
      deliveredEmail.event_type = "Delivered";
      deliveredEmail.id = sentEmail.id;
      deliveredEmail.number = sentEmail.number;
      deliveredEmail.laststoretransaction = sentEmail.laststoretransaction;
      delivered.Add(deliveredEmail);   //add the new delivered
      deliveredEmail = new event_data();
 }
 if (bounced.Count() == 0)
 {
      break; //no more bounces to match!
 }
}