最快,以确定在Perl范围重叠的方式范围、最快、方式、Perl

2023-09-11 04:57:56 作者:放下所有痛

我有两套范围。每个范围是对整数(开始和结束)重新presenting某些子范围的单个较大范围的。两套范围是在一个结构与此类似(当然... s就与实际的数字代替)。

I have two sets of ranges. Each range is a pair of integers (start and end) representing some sub-range of a single larger range. The two sets of ranges are in a structure similar to this (of course the ...s would be replaced with actual numbers).

$a_ranges =
{
  a_1 =>
  {
    start => ...,
    end   => ...,
  },
  a_2 =>
  {
    start => ...,
    end   => ...,
  },
  a_3 =>
  {
    start => ...,
    end   => ...,
  },
  # and so on
};

$b_ranges =
{
  b_1 =>
  {
    start => ...,
    end   => ...,
  },
  b_2 =>
  {
    start => ...,
    end   => ...,
  },
  b_3 =>
  {
    start => ...,
    end   => ...,
  },
  # and so on
};

我需要确定从设置范围从集合B给定两个区域中的重叠范围,可以很容易地确定它们是否重叠。我只是一直在使用一个双循环,要做到这一点 - 环穿过外环其中那些重叠的集合A中的所有元素,遍历集合B​​中的内循环的所有元素,并跟踪

I need to determine which ranges from set A overlap with which ranges from set B. Given two ranges, it's easy to determine whether they overlap. I've simply been using a double loop to do this--loop through all elements in set A in the outer loop, loop through all elements of set B in the inner loop, and keep track of which ones overlap.

我有两个问题,这种方法。首先是重叠空间极为稀少 - 即使有成千上万的范围在每一套,我希望每个范围从集合A与集合B也许1或2的范围重叠我的方法列举了每一个可能性,这是矫枉过正。这导致了我的第二个问题 - 事实上,它做的非常糟糕。在code完成得很快(分分钟)时,有数百个在每组范围,但需要很长的时间(+/- 30分钟)时,有成千上万的范围在每一套。

I'm having two problems with this approach. First is that the overlap space is extremely sparse--even if there are thousands of ranges in each set, I expect each range from set A to overlap with maybe 1 or 2 ranges from set B. My approach enumerates every single possibility, which is overkill. This leads to my second problem--the fact that it scales very poorly. The code finishes very quickly (sub-minute) when there are hundreds of ranges in each set, but takes a very long time (+/- 30 minutes) when there are thousands of ranges in each set.

有没有更好的办法,我可以索引这些范围,使我没有做那么多不必要的检查重叠?

Is there a better way I can index these ranges so that I'm not doing so many unnecessary checks for overlap?

更新:我在寻找的输出是两个散列(每个组范围),其中键是范围ID和值的范围从另一组的ID的重叠使用给定的范围在这个集合

Update: The output I'm looking for is two hashes (one for each set of ranges) where the keys are range IDs and the values are the IDs of the ranges from the other set that overlap with the given range in this set.

推荐答案

这听起来像是一个完美的用例的区间树,它是专门设计用于支持此操作的数据结构。如果你有两套大小m和n,那么你就可以建立他们中的一个成时间为O(M LG米)的间隔树,然后就ñ交叉口时间为O查询(N LG M + K),其中间隔k是交点你发现的总数。这给邻网运行时((M + N)LG M + K)。请记住,在最坏的情况下,K = O(nm)的,所以这不是任何比你有更好的,但对于情况下,交叉点的数目是稀疏这可能大大超过了O(MN)运行时,你有正确的更好现在。

This sounds like the perfect use case for an interval tree, which is a data structure specifically designed to support this operation. If you have two sets of intervals of sizes m and n, then you can build one of them into an interval tree in time O(m lg m) and then do n intersection queries in time O(n lg m + k), where k is the total number of intersections you find. This gives a net runtime of O((m + n) lg m + k). Remember that in the worst case k = O(nm), so this isn't any better than what you have, but for cases where the number of intersections is sparse this can be substantially better than the O(mn) runtime you have right now.

我没有太多的经验,区间树工作(并在Perl零经验,对不起!),但是从描述好像他们不应该是很难建立。我会是pretty的惊讶,如果一个没存在。

I don't have much experience working with interval trees (and zero experience in Perl, sorry!), but from the description it seems like they shouldn't be that hard to build. I'd be pretty surprised if one didn't exist already.

希望这有助于!