如何划分一个LINQ到对象查询?对象、LINQ

2023-09-04 23:53:40 作者:不拽怎能夺你心

这是一个资源分配的问题。我的目标是运行一个查询,以获取最优先转变为任何时隙。

该数据集是非常大的。在这个例子中,假设有各100班1000家公司(尽管真实数据还要大)。它们都加载到内存中,我需要对他们运行一个单一的LINQ to Objects查询:

  VAR topShifts =
            (从s轮班
            其中,(从S2轮班
                   其中,s2.CompanyId == s.CompanyId和放大器;&安培; s.TimeSlot == s2.TimeSlot
                   排序依据s2.Priority
                   选择S2)。首先()。等于(S)
            选择S).ToList();
 

现在的问题是,如果没有优化的LINQ to Objects将比较每一个对象在这两个集合,做一个交叉连接的所有1000×100对1000×100,这相当于10个十亿(百亿)的比较。我想是每一个公司内部只比较对象(如如果公司被收录在SQL表)。这应导致1000组100×100个对象,总共1000万美元(10,000,000)的比较。稍后,将线性扩展,而不是指数的公司数量的增长。

如 I4o技术会允许我做这样的事情,但不幸的是,我不吨有在这我执行这个查询的环境中使用自定义集合的奢侈品。另外,我只期望在任何给定的数据集运行此查询一次,所以一个持久索引的值是有限的。我期待使用扩展方法,将集团由公司的数据,然后运行对每个组的前pression。

LINQ的演变及其对C 设计的影响

全样本code:

 公共结构转变
{
    公共静态长迭代;

    私人诠释companyId;
    公众诠释CompanyId
    {
        获得{迭代++;返回companyId; }
        集合{companyId =价值; }
    }

    公众诠释标识;
    公众诠释时隙;
    公众诠释优先;
}

类节目
{
    静态无效的主要(字串[] args)
    {
        const int的公司= 1000;
        const int的班次= 100;
        Console.WriteLine(的String.Format({0}连×{1}班次,公司,班次));
        VAR定时器= Stopwatch.StartNew();

        Console.WriteLine(填充数据);
        VAR的变化=新的名单,其中,移>();
        对于(INT companyId = 0; companyId<公司; companyId ++)
        {
            对于(INT shiftId = 0; shiftId<班次; shiftId ++)
            {
                shifts.Add(新Shift键(){CompanyId = companyId,ID = shiftId,时隙= shiftId / 3,优先= shiftId%5});
            }
        }
        Console.WriteLine(的String.Format(已完成在{0:N}毫秒,timer.ElapsedMilliseconds));
        timer.Restart();

        Console.WriteLine(计算顶档);
        VAR topShifts =
                (从s轮班
                其中,(从S2轮班
                       其中,s2.CompanyId == s.CompanyId和放大器;&安培; s.TimeSlot == s2.TimeSlot
                       排序依据s2.Priority
                       选择S2)。首先()。等于(S)
                选择S).ToList();
        Console.WriteLine(的String.Format(已完成在{0:N}毫秒,timer.ElapsedMilliseconds));
        timer.Restart();

        Console.WriteLine(\ nShifts:);
        的foreach(在shifts.Take变种移(20))
        {
            Console.WriteLine(的String.Format(C {0}标识{1} T {2} p {3},shift.CompanyId,shift.Id,shift.TimeSlot,shift.Priority));
        }

        Console.WriteLine(\ NTOP班次:);
        的foreach(在topShifts.Take变种移(10))
        {
            Console.WriteLine(的String.Format(C {0}标识{1} T {2} p {3},shift.CompanyId,shift.Id,shift.TimeSlot,shift.Priority));
        }

        Console.WriteLine(的String.Format(\ n总计比较:{0:N},Shift.Iterations / 2));

        Console.WriteLine(任意键继续);
        Console.ReadKey();
    }
}
 

示例输出:

  

1000公司×100班次   填充数据   完成10.00ms   计算顶班次   完成520,721.00ms      班次:   C 0编号0 T 0 P0   C 0 ID 1 T 0的P1   C 0编号2 T 0 P2   C 0标识3个T 1 P3   C 0 ID 4牛逼1 P4   C 0标识5吨1 P0   C 0标识6吨2 P1   C 0 ID 7牛逼2 P2   C 0标识8,T 2 P3   C 0 ID 9 T第3 P4   C 0编号10吨3 P0   C 0 ID 11 T第3 P1   C 0编号12吨4 P2   C 0编号13吨4 P3   C 0编号14吨4 P4   C 0编号15吨5 P0   C 0编号16吨5 P1   C 0编号17牛逼5 P2   C 0编号18 T 6 P3   C 0编号19 T 6 P4      热门班次:   C 0编号0 T 0 P0   C 0标识5吨1 P0   C 0标识6吨2 P1   C 0编号10吨3 P0   C 0编号12吨4 P2   C 0编号15吨5 P0   C 0编号20吨6 P0   C 0编号21 T 7 P1   C 0编号25 T 8 P0   C 0 ID 27牛逼9 P2      总的比较:10,000,000,015.00   任意键继续

问题:

如何分区查询(同时仍然执行作为一个单一的LINQ查询),以获得比较下降10十亿10万? 是否有解决一个子查询的问题,而不是一种更有效的方式是什么? 解决方案

如何

  VAR topShifts =从S在shifts.GroupBy(S => s.CompanyId)
                        从在s.GroupBy(B => b.T​​imeSlot)
                        选择a.OrderBy(P => p.Priority)。首先();
 

似乎得到了相同的输出,但100015的比较

与@杰夫的编辑,他只是我的一半比较: - )

This is a resource-allocation problem. My goal is to run a query to fetch the top-priority shift for any time-slot.

The dataset is very large. For this example, let’s say there are 100 shifts each for 1000 companies (though the real dataset is even larger). They are all loaded into memory, and I need to run a single LINQ to Objects query against them:

    var topShifts =
            (from s in shifts
            where (from s2 in shifts
                   where s2.CompanyId == s.CompanyId && s.TimeSlot == s2.TimeSlot
                   orderby s2.Priority
                   select s2).First().Equals(s)
            select s).ToList();

The problem is that without optimization, LINQ to Objects will compare each and every object in both sets, doing a cross-join of all 1,000 x 100 against 1,000 x 100, which amounts to 10 billion (10,000,000,000) comparisons. What I want is to compare only objects within each company (as if Company were indexed in a SQL table). This should result in 1000 sets of 100 x 100 objects for a total of 10 million (10,000,000) comparisons. The later would scale linearly rather than exponentially as the number of companies grows.

Technologies like I4o would allow me to do something like this, but unfortunately, I don’t have the luxury of using a custom collection in the environment in which I’m executing this query. Also, I only expect to run this query once on any given dataset, so the value of a persistent index is limited. I’m expecting to use an extension method which would group the data by company, then run the expression on each group.

Full Sample code:

public struct Shift
{
    public static long Iterations;

    private int companyId;
    public int CompanyId
    {
        get { Iterations++; return companyId; }
        set { companyId = value; }
    }

    public int Id;
    public int TimeSlot;
    public int Priority;
}

class Program
{
    static void Main(string[] args)
    {
        const int Companies = 1000;
        const int Shifts = 100;
        Console.WriteLine(string.Format("{0} Companies x {1} Shifts", Companies, Shifts));
        var timer = Stopwatch.StartNew();

        Console.WriteLine("Populating data");
        var shifts = new List<Shift>();
        for (int companyId = 0; companyId < Companies; companyId++)
        {
            for (int shiftId = 0; shiftId < Shifts; shiftId++)
            {
                shifts.Add(new Shift() { CompanyId = companyId, Id = shiftId, TimeSlot = shiftId / 3, Priority = shiftId % 5 });
            }
        }
        Console.WriteLine(string.Format("Completed in {0:n}ms", timer.ElapsedMilliseconds));
        timer.Restart();

        Console.WriteLine("Computing Top Shifts");
        var topShifts =
                (from s in shifts
                where (from s2 in shifts
                       where s2.CompanyId == s.CompanyId && s.TimeSlot == s2.TimeSlot
                       orderby s2.Priority
                       select s2).First().Equals(s)
                select s).ToList();
        Console.WriteLine(string.Format("Completed in {0:n}ms", timer.ElapsedMilliseconds));
        timer.Restart();

        Console.WriteLine("\nShifts:");
        foreach (var shift in shifts.Take(20))
        {
            Console.WriteLine(string.Format("C {0} Id {1} T {2} P{3}", shift.CompanyId, shift.Id, shift.TimeSlot, shift.Priority));
        }

        Console.WriteLine("\nTop Shifts:");
        foreach (var shift in topShifts.Take(10))
        {
            Console.WriteLine(string.Format("C {0} Id {1} T {2} P{3}", shift.CompanyId, shift.Id, shift.TimeSlot, shift.Priority));
        }

        Console.WriteLine(string.Format("\nTotal Comparisons: {0:n}", Shift.Iterations/2));

        Console.WriteLine("Any key to continue");
        Console.ReadKey();
    }
}

Sample output:

1000 Companies x 100 Shifts Populating data Completed in 10.00ms Computing Top Shifts Completed in 520,721.00ms Shifts: C 0 Id 0 T 0 P0 C 0 Id 1 T 0 P1 C 0 Id 2 T 0 P2 C 0 Id 3 T 1 P3 C 0 Id 4 T 1 P4 C 0 Id 5 T 1 P0 C 0 Id 6 T 2 P1 C 0 Id 7 T 2 P2 C 0 Id 8 T 2 P3 C 0 Id 9 T 3 P4 C 0 Id 10 T 3 P0 C 0 Id 11 T 3 P1 C 0 Id 12 T 4 P2 C 0 Id 13 T 4 P3 C 0 Id 14 T 4 P4 C 0 Id 15 T 5 P0 C 0 Id 16 T 5 P1 C 0 Id 17 T 5 P2 C 0 Id 18 T 6 P3 C 0 Id 19 T 6 P4 Top Shifts: C 0 Id 0 T 0 P0 C 0 Id 5 T 1 P0 C 0 Id 6 T 2 P1 C 0 Id 10 T 3 P0 C 0 Id 12 T 4 P2 C 0 Id 15 T 5 P0 C 0 Id 20 T 6 P0 C 0 Id 21 T 7 P1 C 0 Id 25 T 8 P0 C 0 Id 27 T 9 P2 Total Comparisons: 10,000,000,015.00 Any key to continue

Questions:

How can I partition the query (while still executing as a single LinQ query) in order to get the comparisons down from 10 billion to 10 million? Is there a more efficient way of solving the problem instead of a sub-query?

解决方案

How about

            var topShifts = from s in shifts.GroupBy(s => s.CompanyId)
                        from a in s.GroupBy(b => b.TimeSlot)
                        select a.OrderBy(p => p.Priority).First();

Seems to get the same output but 100015 comparisons

with @Geoff's edit he just halved my comparisons :-)

 
精彩推荐