整数范围的数据结构构造和查找集数据结构、整数、范围

2023-09-11 23:04:49 作者:在家吃辣条总要偷偷摸摸的

我有一组 UINT32 整数,可能有几百万的一组项目。其中50-70%是连续的,但在输入流中出现的未predictable订单。

I have a set of uint32 integers, there may be millions of items in the set. 50-70% of them are consecutive, but in input stream they appear in unpredictable order.

我需要:

的COM preSS这组为范围,以达到节省空间的重新presentation。已经实施这种利用琐碎的算法,因为范围只计算一次的速度在这里并不重要。之后产生的范围这一转变数量通常在5 000-10 000人,其中不少是当然的单项。

Compress this set into ranges to achieve space efficient representation. Already implemented this using trivial algorithm, since ranges computed only once speed is not important here. After this transformation number of resulting ranges is typically within 5 000-10 000, many of them are single-item, of course.

一些整数的试验的成员,则不需要关于特定范围中的设置信息。这必须非常快 - O(1)。想着最小完美哈希函数的,但他们不与范围内发挥好。 位集很浪费空间。其他结构,如二叉树,有O(log n)的复杂性,最糟糕的事情与他们的实施,使得许多条件跳转和处理器不能predict他们也给表现不佳。

Test membership of some integer, information about specific range in the set is not required. This one must be very fast -- O(1). Was thinking about minimal perfect hash functions, but they do not play well with ranges. Bitsets are very space inefficient. Other structures, like binary trees, has complexity of O(log n), worst thing with them that implementation make many conditional jumps and processor can not predict them well giving poor performance.

有什么数据结构或者算法专业整数范围内解决这个任务?

Is there any data structure or algorithm specialized in integer ranges to solve this task?

推荐答案

关于第二个问题:

您可以查找在布鲁姆过滤器。布鲁姆过滤器是专门设计来回答O(1)的成员资格问题,虽然反应要么是没有也许 (这是不那么明朗的是/否:P)。

You could look-up on Bloom Filters. Bloom Filters are specifically designed to answer the membership question in O(1), though the response is either no or maybe (which is not as clear cut as a yes/no :p).

也许的情况下,当然,您需要进一步处理,以实际回答这个问题(除非概率的回答是足够你的情况),但即使是这样的布鲁姆滤波器可以充当看门,并拒绝大部分查询的彻底

In the maybe case, of course, you need further processing to actually answer the question (unless a probabilistic answer is sufficient in your case), but even so the Bloom Filter may act as a gate keeper, and reject most of the queries outright.

此外,您可能想保留的实际范围和不同的结构简并的范围(单元素)。

Also, you might want to keep actual ranges and degenerate ranges (single elements) in different structures.

单个元件可以最好存储在哈希表 实际范围可以存储在排序后的数组

此减少存储在排序后的数组中元素的数量,并且因此在那里进行二进制搜索的复杂性。因为你的国家,许多范围是堕落,我认为你只是有一些500-1000的范围(幅度以内即一个数量级),和日志(1000)〜10

This diminishes the number of elements stored in the sorted array, and thus the complexity of the binary search performed there. Since you state that many ranges are degenerate, I take it that you only have some 500-1000 ranges (ie, an order of magnitude less), and log(1000) ~ 10

我会因此建议以下步骤:

I would therefore suggest the following steps:

在布隆过滤器:如果没有,停止 排序的真正范围数组:如果是的话,停止 单一元素的哈希表

数组排序测试首先进行,因为从一些你给(百万数量的合并在AA几千范围)如果一个数字被包含,没准它会在一个范围内,而不是单一的:)

The Sorted Array test is performed first, because from the number you give (millions of number coalesced in a a few thousands of ranges) if a number is contained, chances are it'll be in a range rather than being single :)

最后一个注意:谨防O(1),尽管它可能看起来有吸引力,你是不是在这里的渐进情况。勉强5000-10000范围是少数,因为日志(10000)是这样13.因此,不要通过得到O(1)解决方案,以如此高的常数因子,它实际上运行速度比Ø慢pessimize你的实现(日志N )解决方案:)

One last note: beware of O(1), while it may seem appealing, you are not here in an asymptotic case. Barely 5000-10000 ranges is few, as log(10000) is something like 13. So don't pessimize your implementation by getting a O(1) solution with such a high constant factor that it actually runs slower than a O(log N) solution :)