什么是找到一个子串出现的所有的最快方法?有的、个子、最快、方法

2023-09-11 04:50:19 作者:眼幕半度微凉、

这纯粹是出于好奇。我浏览通过比较各种字符串搜索算法的文章,发现他们都旨在找到一个匹配子字符串。这让我思考......如果我想找到一个子串出现的所有?

This is purely out of curiosity. I was browsing through an article comparing various string search algorithms and noticed they were all designed to find the first matching substring. This got me thinking... What if I wanted to find all occurrences of a substring?

我敢肯定,我可以创建一个使用KMP或BM的变体和倾倒每发现出现到一个数组的循环,但这似乎很难像它会是最快的。

I'm sure I could create a loop that used a variant of KMP or BM and dumped each found occurrence into an array but this hardly seems like it would be the fastest.

难道不是分而治之算法是优越?

Wouldn't a divide and conquer algorithm be superior?

比如让说你寻找一个字符串abbcacabbcabcacbccbabc序列ABC。

For instance lets say your looking for the sequence "abc" in a string "abbcacabbcabcacbccbabc".

在第一遍找到的第一个字符的所有事件和存储他们的位置。 在每个额外的传球使用来自preceding传球的位置找到下一个字符的所有匹配,降低了考生的下传与每个迭代。

考虑到与我想出了这个想法容易我假设已经有人想出了它,并在30年前后,它提高了。

Considering the ease with which I came up with this idea I assume someone already came up with it and improved upon it 30 years ago.

推荐答案

请参阅后缀阵列

应用程序

字符串的后缀阵列可   用作索引来快速定位   在每次发生一个子   该字符串。发现每发生   子串的相当于   发现每个后缀开头   子串。归功于   按字典序,这些   后缀将分组在   后缀数组,可以发现   有效地与一个二进制搜索。如果   直截了当地实施,这   二进制搜索需要O(mlogn)时,   其中m是的长度   子。为了避免重做   比较,额外的数据结构   提供有关最长信息   普通$ P $后缀pfixes(LCP材料)是   构造,使O(M + LOGN)搜索   时间。

The suffix array of a string can be used as an index to quickly locate every occurrence of a substring within the string. Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search. If implemented straightforwardly, this binary search takes O(mlogn) time, where m is the length of the substring. To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCPs) of suffixes are constructed, giving O(m + logn) search time.