快速算法的字符串搜索子字符串、算法、快速

2023-09-10 22:44:22 作者:娇软甜伤

我想一个高效的算法(或库),我可以使用在Java中搜索字符串中的子串。

I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string.

我想这样做的是:

由于输入的字符串 - INSTR

Given an input string - INSTR:

BCDEFGH

和一组候选串 - CAND

And a set of candidate strings - CAND:

AB,CDE,FG,H,IJ

"AB", "CDE", "FG", "H", "IJ"

新搜索 CAND 字符串匹配的内子的 INSTR 的

Find any CAND strings that match as substrings within INSTR

在这个例子中,我将匹配CDE,FG和H(而不是AB和IJ)

In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")

有可能是几千候选字符串(以CAND),但更重要的是我会做这个搜索数百万次,所以我需要它要快。

There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.

我想和字符数组的工作。另外,我不是intested建筑解决方案,如分发搜索 - 仅仅是最有效的功能/算法做局部

I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.

此外,在CAND和INSTR所有的弦都将是相对较小(小于50个字符) - 即目标字符串INSTR是不是相对于候选串长

Additionally, all the strings in CAND and INSTR will all be relatively small (< 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.

更新我应该提到,集CAND字符串是整个INSTR的所有值不变。

Update I should have mentioned, the set of CAND strings is invariant across all values of INSTR.

更新我只需要知道,有一个比赛 - 我并不需要知道比赛是

Update I only need to know that there was a match - and i don't need to know what the match was.

最后更新 我选择尝试AhoCorsick和拉宾,卡普,由于实现简单。 因为我有可变长度模式我用修改后的拉宾,卡普的散列每个模式,其中n是最小模式长度的前n个字符,N当时我的滚动字符串搜索窗口的长度。 对于阿霍Corsick我用这

Final Update I opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation. Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window. For the Aho Corsick I used this

在我的测试中我搜索了1000模式在两个文件报纸文章,跨越1000次迭代等平均... 正常化次完成为:

In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc... Normalised times to complete were:

AhoCorsick :1

RabinKarp :1.8

天真搜索(检查每个模式和放大器;使用string.contains):50

Naive Search (check each pattern & use string.contains): 50

*描述中提到的回答下面的交易算法的一些资源:

*Some resources describing the algos mentioned in the answers below:

http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html

http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf

http://www-igm.univ-mlv.fr/~lecroq/string/index.html*

推荐答案

您最好的选择是阿霍Corasick算法和拉宾,卡普算法。因为我没有Java开发者,我不能告诉你有任何的即装即用的功能框架或库。

Your best options are the Aho-Corasick algorithm and the Rabin-Karp algorithm. Because I am no Java developer I cannot tell if there are any out-of-the-box framework functions or libraries.

我想补充 ​​- 如果你输入的是不是很大,你不想重复搜索了很多次,你没有很多的模式,这可能是连一个好主意,使用单个图案的算法几次。对搜索算法的维基百科的文章使很多算法,运行和preprocessing次这样你就可以判断选项​​。

Just to add - if your input is not that large, you don't want to repeat the search many times and you do not have many patterns, it might be even a good idea to use a single pattern algorithm several times. The Wikipedia article on search algorithms gives many algorithms with running and preprocessing times so you can judge the options.