获取表中所有记录的样本

本文关键字:记录 样本 获取 | 更新日期: 2023-09-27 18:18:14

我有一个包含几百万条记录的表。我想要得到一个抽样,返回表中每n条记录。我当前的解决方案是这样的:

myQuery.Where((rec, index) => index % interval == 0);

然而,Linq to Entities不支持此操作并抛出异常。我还试了这个:

myQuery.Select((rec, index) => new { Index = index, Record = rec })
       .Where(x => x.Index % interval == 0);

但是Linq to Entities也不支持。

即使它是这样工作的,它看起来也不是很优雅。还有别的办法吗?

获取表中所有记录的样本

这里有一个类似的Linq to SQL的讨论,Linq to SQL每n行从表,看看这个。

我认为SQL是正确的选择。在EF你可以把你的SQL定义查询

您似乎没有定义一个顺序,其中n是表的索引。由于表中的数据是无序的,不指定顺序的记录号是没有意义的,因此n第th记录是未定义的。这是否意味着您只想要某个随机百分比的记录?如果是这样,那么随机抽取1%的数据样本的示例如下:

Customer[] onePercentSample = db.Customers.Take(db.Customer.Count() / 100).ToArray();

实际上不能保证这是一个随机样本。它只是从数据库中的所有记录中返回一个未定义的样本,它可以在多个查询中是相同的集合,也可以不是。同样,这是由于表本身是一个无序的记录集。

尝试按索引%interval排序。如果有效,可以将第一个间隔/总记录作为示例。

直接的SQL方式可能是我要怎么做,但如果你有顺序的id,你可以利用最小和最大:

注意,我用Linqpad测试了这个,它使用linq to sql -但我认为结果在EF中是一样的。

var a = ReceivedPayments.Select(c=>c.Id);
int interval = 50;
var ids = new List<int>();
int min = a.Min();
int max = a.Max();
for (int i = min; i < max; i++)
{
    if((i % interval) == 0)
    {
        ids.Add(i);
    } 
}
var b = ReceivedPayments.Where(c=>ids.Contains(c.Id));
Sql生成

SELECT MIN([t0].[Id]) AS [value]
FROM [ReceivedPayments] AS [t0]
GO
SELECT MAX([t0].[Id]) AS [value]
FROM [ReceivedPayments] AS [t0]
GO
-- Region Parameters
DECLARE @p0 Int = 50
DECLARE @p1 Int = 100
DECLARE @p2 Int = 150
DECLARE @p3 Int = 200
DECLARE @p4 Int = 250
DECLARE @p5 Int = 300
DECLARE @p6 Int = 350
DECLARE @p7 Int = 400
DECLARE @p8 Int = 450
DECLARE @p9 Int = 500
DECLARE @p10 Int = 550
DECLARE @p11 Int = 600
DECLARE @p12 Int = 650
DECLARE @p13 Int = 700
DECLARE @p14 Int = 750
DECLARE @p15 Int = 800
DECLARE @p16 Int = 850
DECLARE @p17 Int = 900
DECLARE @p18 Int = 950
DECLARE @p19 Int = 1000
DECLARE @p20 Int = 1050
DECLARE @p21 Int = 1100
DECLARE @p22 Int = 1150
DECLARE @p23 Int = 1200
DECLARE @p24 Int = 1250
DECLARE @p25 Int = 1300
DECLARE @p26 Int = 1350
DECLARE @p27 Int = 1400
DECLARE @p28 Int = 1450
DECLARE @p29 Int = 1500
DECLARE @p30 Int = 1550
DECLARE @p31 Int = 1600
DECLARE @p32 Int = 1650
DECLARE @p33 Int = 1700
DECLARE @p34 Int = 1750
DECLARE @p35 Int = 1800
DECLARE @p36 Int = 1850
DECLARE @p37 Int = 1900
DECLARE @p38 Int = 1950
DECLARE @p39 Int = 2000
DECLARE @p40 Int = 2050
DECLARE @p41 Int = 2100
DECLARE @p42 Int = 2150
DECLARE @p43 Int = 2200
DECLARE @p44 Int = 2250
DECLARE @p45 Int = 2300
DECLARE @p46 Int = 2350
DECLARE @p47 Int = 2400
DECLARE @p48 Int = 2450
DECLARE @p49 Int = 2500
DECLARE @p50 Int = 2550
DECLARE @p51 Int = 2600
DECLARE @p52 Int = 2650
DECLARE @p53 Int = 2700
DECLARE @p54 Int = 2750
DECLARE @p55 Int = 2800
DECLARE @p56 Int = 2850
DECLARE @p57 Int = 2900
DECLARE @p58 Int = 2950
DECLARE @p59 Int = 3000
DECLARE @p60 Int = 3050
DECLARE @p61 Int = 3100
DECLARE @p62 Int = 3150
DECLARE @p63 Int = 3200
DECLARE @p64 Int = 3250
DECLARE @p65 Int = 3300
DECLARE @p66 Int = 3350
DECLARE @p67 Int = 3400
DECLARE @p68 Int = 3450
DECLARE @p69 Int = 3500
DECLARE @p70 Int = 3550
DECLARE @p71 Int = 3600
DECLARE @p72 Int = 3650
DECLARE @p73 Int = 3700
DECLARE @p74 Int = 3750
DECLARE @p75 Int = 3800
DECLARE @p76 Int = 3850
DECLARE @p77 Int = 3900
DECLARE @p78 Int = 3950
DECLARE @p79 Int = 4000
DECLARE @p80 Int = 4050
DECLARE @p81 Int = 4100
DECLARE @p82 Int = 4150
DECLARE @p83 Int = 4200
DECLARE @p84 Int = 4250
DECLARE @p85 Int = 4300
DECLARE @p86 Int = 4350
DECLARE @p87 Int = 4400
DECLARE @p88 Int = 4450
DECLARE @p89 Int = 4500
DECLARE @p90 Int = 4550
DECLARE @p91 Int = 4600
DECLARE @p92 Int = 4650
DECLARE @p93 Int = 4700
DECLARE @p94 Int = 4750
DECLARE @p95 Int = 4800
DECLARE @p96 Int = 4850
DECLARE @p97 Int = 4900
DECLARE @p98 Int = 4950
DECLARE @p99 Int = 5000
DECLARE @p100 Int = 5050
DECLARE @p101 Int = 5100
DECLARE @p102 Int = 5150
DECLARE @p103 Int = 5200
DECLARE @p104 Int = 5250
DECLARE @p105 Int = 5300
DECLARE @p106 Int = 5350
DECLARE @p107 Int = 5400
DECLARE @p108 Int = 5450
DECLARE @p109 Int = 5500
DECLARE @p110 Int = 5550
DECLARE @p111 Int = 5600
DECLARE @p112 Int = 5650
DECLARE @p113 Int = 5700
DECLARE @p114 Int = 5750
DECLARE @p115 Int = 5800
DECLARE @p116 Int = 5850
DECLARE @p117 Int = 5900
DECLARE @p118 Int = 5950
DECLARE @p119 Int = 6000
DECLARE @p120 Int = 6050
DECLARE @p121 Int = 6100
DECLARE @p122 Int = 6150
DECLARE @p123 Int = 6200
-- EndRegion
SELECT [t0].[Id], [t0].[TxnID], [t0].[TxnDate], [t0].[TotalAmount], [t0].[Memo], [t0].[AppliedToTxnTxnID], [t0].[AppliedToTxnTxnType], [t0].[AppliedToTxnAmount], [t0].[FQPrimaryKey], [t0].[RefNumber], [t0].[ARAccount_Id], [t0].[Customer_Id]
FROM [ReceivedPayments] AS [t0]
WHERE [t0].[Id] IN (@p0, @p1, @p2, @p3, @p4, @p5, @p6, @p7, @p8, @p9, @p10, @p11, @p12, @p13, @p14, @p15, @p16, @p17, @p18, @p19, @p20, @p21, @p22, @p23, @p24, @p25, @p26, @p27, @p28, @p29, @p30, @p31, @p32, @p33, @p34, @p35, @p36, @p37, @p38, @p39, @p40, @p41, @p42, @p43, @p44, @p45, @p46, @p47, @p48, @p49, @p50, @p51, @p52, @p53, @p54, @p55, @p56, @p57, @p58, @p59, @p60, @p61, @p62, @p63, @p64, @p65, @p66, @p67, @p68, @p69, @p70, @p71, @p72, @p73, @p74, @p75, @p76, @p77, @p78, @p79, @p80, @p81, @p82, @p83, @p84, @p85, @p86, @p87, @p88, @p89, @p90, @p91, @p92, @p93, @p94, @p95, @p96, @p97, @p98, @p99, @p100, @p101, @p102, @p103, @p104, @p105, @p106, @p107, @p108, @p109, @p110, @p111, @p112, @p113, @p114, @p115, @p116, @p117, @p118, @p119, @p120, @p121, @p122, @p123)

你总是可以用Skip and Take来做。我敢打赌它的效率非常低。

int n = 10;
int currentIndex = 0;
yourEntity current;
List<yourEntity> lstSampling = new List<yourEntity>();
while((current = context.yourEntities.Skip(currentIndex).Take(1).FirstOrDefault()) != null)
{
    listSampling.Add(current);
    currentIndex += n;
}