您所在的位置：首页 > 最新热文 > 计算机探索

快速算法的字符串搜索子字符串、算法、快速

2023-09-10 22:44:22 作者：娇软甜伤

我想一个高效的算法（或库），我可以使用在Java中搜索字符串中的子串。

I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string.

我想这样做的是：

由于输入的字符串 - INSTR

Given an input string - INSTR:

BCDEFGH

和一组候选串 - CAND

And a set of candidate strings - CAND:

AB，CDE，FG，H，IJ

"AB", "CDE", "FG", "H", "IJ"

新搜索 CAND 字符串匹配的内子的 INSTR 的

Find any CAND strings that match as substrings within INSTR

在这个例子中，我将匹配CDE，FG和H（而不是AB和IJ）

In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")

有可能是几千候选字符串（以CAND），但更重要的是我会做这个搜索数百万次，所以我需要它要快。

There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.

我想和字符数组的工作。另外，我不是intested建筑解决方案，如分发搜索 - 仅仅是最有效的功能/算法做局部

I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.

此外，在CAND和INSTR所有的弦都将是相对较小（小于50个字符） - 即目标字符串INSTR是不是相对于候选串长

Additionally, all the strings in CAND and INSTR will all be relatively small (< 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.

更新我应该提到，集CAND字符串是整个INSTR的所有值不变。

Update I should have mentioned, the set of CAND strings is invariant across all values of INSTR.

更新我只需要知道，有一个比赛 - 我并不需要知道比赛是

Update I only need to know that there was a match - and i don't need to know what the match was.

最后更新我选择尝试AhoCorsick和拉宾，卡普，由于实现简单。因为我有可变长度模式我用修改后的拉宾，卡普的散列每个模式，其中n是最小模式长度的前n个字符，N当时我的滚动字符串搜索窗口的长度。对于阿霍Corsick我用这

Final Update I opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation. Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window. For the Aho Corsick I used this

在我的测试中我搜索了1000模式在两个文件报纸文章，跨越1000次迭代等平均... 正常化次完成为：

In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc... Normalised times to complete were:

AhoCorsick ：1

RabinKarp ：1.8

天真搜索（检查每个模式和放大器;使用string.contains）：50

Naive Search (check each pattern & use string.contains): 50

*描述中提到的回答下面的交易算法的一些资源：

*Some resources describing the algos mentioned in the answers below:

http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html

http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf

http://www-igm.univ-mlv.fr/~lecroq/string/index.html*