数据结构为大量的模式数据结构、模式

2023-09-11 22:43:03 作者:福大運好

在接受记者采访时,有人问我拿出可容纳数以百万计的模式,并实现了快速搜索通过他们找到最长匹配的一种数据结构。

In an interview, I was asked to come up with data structure that can hold millions of patterns and enables fast searching through them to find the longest matching one.

例如,模式是这样的:

1- 8876 8893 87          | true
2- 8876 889              | false
3- 8876 8                | false
4- 887                   | true

的输入是一个数字与至少2个和最多18个数字,我们需要找到从数据结构最长的匹配,并提取布尔在端

The input is a number with at least 2 and at most 18 digits and we need to find the longest matching pattern from the data structure and extract the boolean at the end.

例如, 8876 8893 9943 53 将匹配 1 返回。 8876 8397 5430 74 将匹配 3 是回来了。

For example, 8876 8893 9943 53 would match the 1 and true is returned. 8876 8397 5430 74 would match the 3 and false is returned.

我的回答是使用一棵树,并具有键值对每个级别的列表。主要是数字和值为null或等于取决于如果它是一个模式或不月底为布尔。像:

My answer was to use a tree and having a list of key value pair at each level. Key being the digits and value is either null or equal to boolean depending on if it's the end of a pattern or not. Like:

# matching 8875
# start the search by first digit
[..., (7, null), (8, null), (9, null)] 
                  ^
                 [..., (7, null), (8, null), (9, null)] 
                                   ^
                                   [..., (7, true), (8, null), ...]
# at the last step because we don't have a pattern 
# to match the digit 5, we return the `true` from (7, true)

具有挑战性的部分在于,该模式是相当多的。数以百万计人。这是什么好?如果没有,你有什么建议。

The challenging part is that, the patterns are quite a lot. Millions of them. Is this any good? If not, what is your suggestion.

推荐答案

这非常符合您所描述的问题,一个很漂亮的数据结构,即集合结构,其中许多项目都有一个共同的preFIX(和/或后缀),并在您完成基于共享preFIX搜索是一个特里 。

A very nice data structure that fits very well with the problem you described, i.e. a collection structure where many of the entries share a common prefix (and/or suffix), and where you perform searches based upon a shared prefix is a Trie.

在计算机科学,一个线索,也称为数码树有时 基数树 或 preFIX树(因为它们可以通过prefixes被搜索),是用于存​​储一组动态或关联数组,其中键是通常串的有序树数据结构。不像二叉查找树,在树没有节点存储与该节点相关联的密钥;相反,其在树中的位置定义与它相关联的键。一个节点的所有后代都有一个共同的 preFIX 关联的字符串该节点,和根与空字符串相关联。值通常不与每一节点相关联,仅与叶片和对应于感兴趣键一些内的节点。对于preFIX树的优化空间presentation,请参见 紧凑型preFIX树 。

In computer science, a trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest. For the space-optimized presentation of prefix tree, see compact prefix tree.

特别紧凑preFIX树或帕特里夏·特里似乎是非常适合您的问题。

Specifically the compact prefix tree or patricia trie seems to be well suited for your problem.

鉴于提到的类型的尝试经常用于与键相关联存储值,如果是不需要为您的问题(即要没有存储输入图案串的原始索引,并返回一个搜索所需),还有一种可能适合甚至更好密切相关的解决方案。正如评论 阿霍Corasick字符串匹配算法注意到@JimMischel 建立状结构与内部节点之间的附加链路的线索。如果要匹配的图案组是固定的,并且数据结构建造,那么对于一个搜索其运行时间是线性的输入的长度加上匹配条目的数量。

Given that the mentioned types of tries are often used to store values associated with keys, if that is not required for your problem (i.e. you are not required to store the original index of the input patterns string and return that on a search), there is a closely related solution that may fit even better. As noted by @JimMischel in the comments the Aho–Corasick string matching algorithm builds a trie like structure with additional links between the internal nodes. If the set of patterns to be matched is fixed, and the data structure built, then for a search its run time is linear in the length of the input plus the number of matched entries.

正是在这样的讨论过质疑阿霍Corasick算法

It is discussed in this SO question Aho Corasick algorithm

您可以在例如发现它的一些实现在线 C#或的Java 或的哈斯克尔。

You can find some implementations of it online in for example C# or Java or Haskell.