需要算法的集合和子集的快速存储和检索(搜索)子集、算法、快速

2023-09-11 22:56:02 作者:吣無杂念

我需要存储套任意大小的后面快速查询的方法。 我将需要查询的结果数据结构已存储的子集或设置。

I need a way of storing sets of arbitrary size for fast query later on. I'll be needing to query the resulting data structure for subsets or sets that are already stored.

=== 后来编辑:为了澄清,一个公认的答案,这个问题将是一个链接到一个研究,提出了一种解决这个问题。我不期待为人们开发自己的算法。 我一直在寻找在元组聚类算法发现此处,但它不正是我想要的,因为从我的理解是集群中的元组成更简单的,离散/ aproximate形式和失去原有的元组。

=== Later edit: To clarify, an accepted answer to this question would be a link to a study that proposes a solution to this problem. I'm not expecting for people to develop the algorithm themselves. I've been looking over the tuple clustering algorithm found here, but it's not exactly what I want since from what I understand it 'clusters' the tuples into more simple, discrete/aproximate forms and loses the original tuples.

现在,一个更简单的例子:

Now, an even simpler example:

[α,β,γ,δ] [α,ε,δ[伽玛,牛,ω-ω-,β]

查询:

[α,δ]

结果:

[α,β,伽马,三角洲] [α,ε,增量]

因此​​,集合中的元素是正义的,独特的,不相关的元素。忘掉类型和值。这些元素之间能够测试的平等,就是这样。我在寻找一个既定的算法(它可能有一个名称,就可以了科学论文)不仅仅是创造1现在,当场。

So the set elements are just that, unique, unrelated elements. Forget about types and values. The elements can be tested among them for equality and that's it. I'm looking for an established algorithm (which probably has a name and a scientific paper on it) more than just creating one now, on the spot.

== 原始的例子:

例如,假设数据库中包含这些集合

For example, say the database contains these sets

[A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1] 

如果我用 [A1,C1] 作为查询条件,这两组应返回的结果:

If I use [A1, C1] as a query, these two sets should be returned as a result:

[A1, B1, C1, D1], [A1, D3, C1]

例2:

数据库:

[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
[number of car seats: 2, Gasoline amount: 2L]

查询:

[Distance to berlin: 240km]

结果

[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]

可以有无限多的领域,如汽油量。的溶液可能会涉及到数据库的分组和连接具有共同的状态集(例如汽油量:240 )。以这样的方式,该查询是尽可能高效

There can be an unlimited number of 'fields' such as Gasoline amount. A solution would probably involve the database grouping and linking sets having common states (such as Gasoline amount: 240) in such a way that the query is as efficient as possible.

什么算法是否有这样的需求呢?

What algorithms are there for such needs?

我希望有已经是一个既定的解决这个问题,而不是只是试图找到我自己就在现场,这可能无法做到高效为一体的测试和改进后,被其他人随着时间的推移。

I am hoping there is already an established solution to this problem instead of just trying to find my own on the spot, which might not be as efficient as one tested and improved upon by other people over time.

澄清:

如果它有助于回答这个问题,我打算使用它们用于存储状态: 简单的例子: [有奶,无蛋,有糖] 我想这样的要求可能需要的图形或多维数组,但我不知道 If it helps answer the question, I'm intending on using them for storing states: Simple example: [Has milk, Doesn't have eggs, Has Sugar] I'm thinking such a requirement might require graphs or multidimensional arrays, but I'm not sure

结论 我实现了答案提出的两种算法,即设置 - 特里和倒排索引的,并做了一些初步的分析它们。下面示出为每个算法在给定的查询的持续时间。这两种算法的工作组成的成套相同的整数随机生成的数据集。该算法看起来相当的(或几乎)在性能方面:

Conclusion I've implemented the two algorithms proposed in the answers, that is Set-Trie and Inverted Index and did some rudimentary profiling on them. Illustrated below is the duration of a query for a given set for each algorithm. Both algorithms worked on the same randomly generated data set consisting of sets of integers. The algorithms seem equivalent (or almost) performance wise:

推荐答案

我相信,我现在可以向解决方案。一个可能比较有效的方法是:

I'm confident that I can now contribute to the solution. One possible quite efficient way is a:

特里通过的 Frankling马亮

这种特殊的树用于例如拼写检查或自动完成与实际接近你想要的行为,特别是允许搜索的子集非常方便。

Such a special tree is used for example in spell checking or autocompletion and that actually comes close to your desired behavior, especially allowing to search for subsets quite conveniently.

在你的情况不同的是,你不感兴趣,你的属性/功能的顺序。对于你的情况下,集 - 特里发明了Iztok萨维尼克。

The difference in your case is that you're not interested in the order of your attributes/features. For your case a Set-Trie was invented by Iztok Savnik.

什么是集树?一棵树,其中除了根的每个节点包含单个属性值(数字)和一个标记(布尔)如果在这个节点有一个数据输入。每个子树仅包含属性,其值比父节点的属性值大。设置树的根是空的。检索关键字是从根的路径树的某些节点。搜索结果是一组从根到包含标记所有节点路径,当你去了树,同时向上搜索键(见下文),您到达。

What is a Set-Tree? A tree where each node except the root contains a single attribute value (number) and a marker (bool) if at this node there is a data entry. Each subtree contains only attributes whose values are larger than the attribute value of the parent node. The root of the Set-Tree is empty. The search key is the path from the root to a certain node of the tree. The search result is the set of paths from the root to all nodes containing a marker that you reach when you go down the tree and up the search key simultaneously (see below).

但首先绘制由我:

的属性是{1,2,3,4,5}这才能真正是什么,但我们只是列举出来,所以自然获得订单。的数据为{{1,2,4},{1,3},{1,4},{2,3,5},{2,4}}这在图像是从根的一组路径任何圈子。圆圈是标记在图像中的数据。

The attributes are {1,2,3,4,5} which can be anything really but we just enumerate them and therefore naturally obtain an order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}} which in the picture is the set of paths from the root to any circle. The circles are the markers for the data in the picture.

请注意,从根的右子树不包含属性1可言。这就是线索。

Please note that the right subtree from root does not contain attribute 1 at all. That's the clue.

搜索包括子集说你要搜索你命令他们的属性4和1.首先,搜索键是{1,4}。现在,挑动从根你同时去了搜索键,下树。这意味着你通过所有子节点,其属性为小于或等于1,在里面你采取的下一个属性的关键(4),并参观只有一个,即1.取的关键(1)第一属性去所有子节点,其属性值小于4,即都是。您可以继续,直到有没有什么可以做,收集具有属性值正好是4(或关键中的最后一个属性)各界(数据项)。这些都是{1,2,4}和{1,4},但没有{1,3}(4号)和{2,4}(1号)。

Searching including subsets Say you want to search for attributes 4 and 1. First you order them, the search key is {1,4}. Now startin from root you go simultaneously up the search key and down the tree. This means you take the first attribute in the key (1) and go through all child nodes whose attribute is smaller or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all child nodes whose attribute value is smaller than 4, that are all. You continue until there is nothing left to do and collect all circles (data entries) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4} but not {1,3} (no 4) or {2,4} (no 1).

插入是很容易的。下去的树,并存储在适当的位置上的数据输入。例如数据输入{2.5}将被存储为子{2}

Insertion Is very easy. Go down the tree and store a data entry at the appropriate position. For example data entry {2.5} would be stored as child of {2}.

添加属性,动态自然是准备好了,你可以立即插入{1,4,6}。它会来,当然低于{1,4}。

Add attributes dynamically Is naturally ready, you could immediately insert {1,4,6}. It would come below {1,4} of course.

我希望你明白我想说一下设置发。在论文Iztok萨维尼克它在更详细的解释。他们可能是非常高效的。

I hope you understand what I want to say about Set-Tries. In the paper by Iztok Savnik it's explained in much more detail. They probably are very efficient.

我不知道,如果你仍然想将数据存储在数据库中。我认为这将进一步复杂化的东西,我不知道什么是最好做的话。

I don't know if you still want to store the data in a database. I think this would complicate things further and I don't know what is the best to do then.