最佳品种索引数据结构的非常大的时间序列数据结构、非常大、序列、索引

2023-09-11 02:52:30 作者:浪货界扛把子

我想问问老乡SO'ers他们就同类最佳的数据结构的意见被用于索引的时间序列(即列方向的数据,也称为平线)。

I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear).

两种基本类型的时间序列存在基于所述采样/离散特性:

Two basic types of time-series exist based on the sampling/discretisation characteristic:

常规离散(每个样品采取一个共同的频率)

Regular discretisation (Every sample is taken with a common frequency)

不规则离散(样品分别取自arbitary时间点)

Irregular discretisation(Samples are taken at arbitary time-points)

查询:

在时间范围内的所有值[T0,T1] 大数据分析如何使用pandas进行时间序列分析

All values in the time range [t0,t1]

在时间范围[T0,T1]是更大的全部价值/小于V0

All values in the time range [t0,t1] that are greater/less than v0

在时间范围内的所有值[T0,T1]是在数值范围[V0,V1]

All values in the time range [t0,t1] that are in the value range[v0,v1]

该数据集包括汇总的时间序列(在某种程度上它克服了不规则离散化),和多变量的时间序列。有问题的数据集(S)的大约15-20TB在大小,因此被以分布式方式执行的处理 - 因上述一些查询将导致数据集比的存储器上可用的任何一个系统中的物理量。

The data sets consist of summarized time-series (which sort of gets over the Irregular discretisation), and multivariate time-series. The data set(s) in question are about 15-20TB in size, hence processing is performed in a distributed manner - because some of the queries described above will result in datasets larger than the physical amount of memory available on any one system.

在这种情况下的分布式处理装置同样调度所需数据特定计算随着时间序列查询,以便作为能够计算可发生接近数据 - 以便减少节点到节点的通信(有点类似到的map / reduce模式) - 在计算和数据的短附近是非常关键的。

Distributed processing in this context also means dispatching the required data specific computation along with the time-series query, so that the computation can occur as close to the data as is possible - so as to reduce node to node communications (somewhat similar to map/reduce paradigm) - in short proximity of computation and data is very critical.

另一个问题是该指数应该能够应付,是绝大多数的数据是静态/历史(99.999 ......%),但对新数据添加每天,在该领域想到 senors或市场数据。思想/要求是能够更新任何正在运行的计算(平均值,GARCH的等)用尽可能低的等待时间尽可能,一些这些运行计算需要的历史数据,其中一些将超过什么可以合理缓存。

Another issue that the index should be able to cope with, is that the overwhelming majority of data is static/historic (99.999...%), however on a daily basis new data is added, think of "in the field senors" or "market data". The idea/requirement is to be able to update any running calculations (averages, garch's etc) with as low a latency as possible, some of these running calculations require historical data, some of which will be more than what can be reasonably cached.

我已经考虑HDF5,它工作得很好/有效为较小的数据集,但开始拖曳作为数据集变大,也没有从前端机并行处理能力。

I've already considered HDF5, it works well/efficiently for smaller datasets but starts to drag as the datasets become larger, also there isn't native parallel processing capabilities from the front-end.

寻找建议,联系,进一步阅读等。(C或C ++的解决方案,图书馆)

Looking for suggestions, links, further reading etc. (C or C++ solutions, libraries)

推荐答案

您可能会想使用某种类型的大,平衡树。像托拜厄斯提到的,B树将是用于解决第一个问题的标准选择。如果你还在乎越来越快的插入和更新,有很多新的工作正在做的地方,如MIT和CMU到这些新的缓存无视B树。对于这些东西的实施进行了一些讨论,查找 Tokutek DB ,他们已经得到了一些不错的presentations像以下内容:

You would probably want to use some type of large, balanced tree. Like Tobias mentioned, B-trees would be the standard choice for solving the first problem. If you also care about getting fast insertions and updates, there is a lot of new work being done at places like MIT and CMU into these new "cache oblivious B-trees". For some discussion of the implementation of these things, look up Tokutek DB, they've got a number of good presentations like the following:

http://tokutek.com/downloads/mysqluc-2010-fractal-trees.pdf

问题2和3是在一般困难得多,因为它们涉及更高的尺寸范围的搜索。标准数据结构这样做将是范围树的(它给出为O(log ^ {D-1}(正))查询时间,在澳的成本(N日志^ D(N))存储)。您通常会的没有的要使用kd树这样的事情。虽然这是事实,KD树有最佳,为O(n),存储成本,这是一个事实,你不能评价范围内查询任何速度比为O(n ^ {(D-1)/ D})如果你只用O(n)的存储空间。对于D = 2,这将是O(的sqrt(n))的时间复杂度;并坦言这是不会削减它,如果你有10 ^ 10个数据点(谁愿意等待O(10 ^ 5)磁盘读取完成一个简单的范围查询?)

Questions 2 and 3 are in general a lot harder, since they involve higher dimensional range searching. The standard data structure for doing this would be the range tree (which gives O(log^{d-1}(n)) query time, at the cost of O(n log^d(n)) storage). You generally would not want to use a k-d tree for something like this. While it is true that kd trees have optimal, O(n), storage costs, it is a fact that you can't evaluate range queries any faster than O(n^{(d-1)/d}) if you only use O(n) storage. For d=2, this would be O(sqrt(n)) time complexity; and frankly that isn't going to cut it if you have 10^10 data points (who wants to wait for O(10^5) disk reads to complete on a simple range query?)

幸运的是,这听起来像你的情况,你真的不必太担心,一般情况下。因为所有的数据都来自于一个时间序列,你永远只能有每人每次最多一个坐标值。可以想像,你可以做的仅仅是使用范围查询来拉分的某个区间,然后作为后期处理经过,并应用购买V限制逐点。这将是第一件事,我会尝试(得到一个很好的数据库实施后),如果它的工作原理,那么你完成了!它实际上只有意义的尝试优化后两者的查询,如果你继续运行到的情况下点的[T0,T1]×[-infty + infty]的数目要比点的数量较大的订单[T0中,t1]×[V0,V1]

Fortunately, it sounds like your situation you really don't need to worry too much about the general case. Because all of your data comes from a time series, you only ever have at most one value per each time coordinate. Hypothetically, what you could do is just use a range query to pull some interval of points, then as a post process go through and apply the v constraints pointwise. This would be the first thing I would try (after getting a good database implementation), and if it works then you are done! It really only makes sense to try optimizing the latter two queries if you keep running into situations where the number of points in [t0, t1] x [-infty,+infty] is orders of magnitude larger than the number of points in [t0,t1] x [v0, v1].