扫描文本,并检查是否包含文字从指定列表文本、文字、列表

2023-09-12 21:23:47 作者:Moment丶初晴

要求

在目前,我们有一个包含超过一万的关键词或句​​子(数为N)名单 输入是一个长字符串,长度为L

问:检查字符串是否包含在给定的列表中的关键字或句子

Question: Check whether the character string contains keywords or sentences in list given

这个问题可以被描述为词过滤器在维基百科的,但我没有发现任何算法,该网页上。解决这个问题最简单的方法就是迭代器的所有关键字或句子,每一次检查长文本是否包含这样子。因为我们有很多的关键词,也考虑了很长的文字,表现非常糟糕。它使用O(NL)时间

The question can be described as word filter on wikipedia, but I didn't find any algorithm on that page. The simplest way to fix this problem is iterator all the keywords or sentences and each time check whether the long text contains such substring. As we have a number of keywords, also considering the long text, the performance is very bad. It uses O(NL) time

似乎是更好的解决方案应该在O(L)来完成。任何人都可以给这个些建议?

Seems the better solution should be finished in O(L). Could anyone give some suggestion about this?

推荐答案

有几种方法对这个问题有时间复杂度为O(M + L),其中L为字符串的长度和M组合的所有模式的长度:

There are several approaches to this problem having time complexity O(M + L), where L is length of the string and M is combined length of all patterns:

阿霍Corasick字符串匹配算法。 构造一个后缀树使用的 Ukkonen算法,然后找到匹配每个模式这个后缀树。 构造一个广义后缀树的一套模式,然后找到输入字符串之间的匹配这个后缀树。 构造一个后缀阵列字符串,并用它来搜索每个模式。这种方法具有时间复杂度为O(M + L + N日志L),其中N为模式的数目。 Commentz - 瓦尔特·算法。 Aho–Corasick string matching algorithm. Construct a suffix tree for the string using Ukkonen's algorithm, then find the match for each pattern to this suffix tree. Construct a generalized suffix tree for set of patterns, then find the match between input string and this suffix tree. Construct a Suffix array for the string, and use it to search each pattern. This approach has time complexity O(M + L + N log L), where N is the number of patterns. Commentz-Walter algorithm.

您可以找到详细信息所有这些算法(除了Commentz - 瓦尔特·算法)在这本书:算法对字符串,树木和序列丹Gusfield 。

You can find details for all these algorithms (except Commentz-Walter algorithm) in this book: Algorithms on Strings, Trees and Sequences by Dan Gusfield.

有几种不同的(简单的)方法可以使用​​,如果你能明确地提取单独的单词/句子从输入字符串。

Several different (simpler) approaches may be used if you can unambiguously extract separate words/sentences from input string.

prepare一个布鲁姆过滤器,大小,大到足以保证虚警概率低阳性对于N模式。添加到布隆过滤器位,按关键字/句子哈希函数来确定。然后扫描串,提取连续字/句子,并检查是否这些单词/句子可以在布隆过滤器发现。只有当一个单词/句子是present在布隆过滤器,在模式列表中搜索。该算法已经预计的时间复杂度为O(M + L),是节省空间的。 将所有的模式成为哈希集合。然后扫描串,提取连续字/句子,并检查是否任何人在哈希集合。这具有相同的预期时间复杂度为O(M + L),比所述布隆过滤器的方法更简单,但不是空间高效。 将所有的模式为基数树(线索)。然后用它来从字符串搜索词/句子。这不是从广义后缀树方法有很大不同,但是更简单的,并具有更好的性能。它具有最坏情况下的时间复杂度为O(M + L),可能比布隆过滤器或散集方法略慢,并且可以被制成非常空间高效如果必要的。 Prepare a Bloom filter with size, large enough to guarantee low probability of false positives for N patterns. Add to Bloom filter bits, determined by hash functions of keywords/sentences. Then scan the string, extracting consecutive words/sentences, and check if these words/sentences can be found in the Bloom filter. Only if a word/sentence is present in the Bloom filter, search it in the list of patterns. This algorithm has expected time complexity O(M + L) and is space-efficient. Insert all patterns into a hash set. Then scan the string, extracting consecutive words/sentences, and check if any of them is in the hash set. This has the same expected time complexity O(M + L), is simpler than the Bloom filter approach, but is not space-efficient. Insert all patterns into a radix tree (trie). Then use it to search words/sentences from the string. This is not very different from the generalized suffix tree approach, but is simpler and has better performance. It has worst-case time complexity O(M + L), is probably somewhat slower than Bloom filter or hash set approaches, and may be made very space-efficient if necessary.