我想一个高效的算法(或库),我可以使用在Java中搜索字符串中的子串。
I'd like an efficient algorithm (or library) that I can use in Java to search for substrings in a string.
我想这样做的是:
由于输入的字符串 - INSTR
Given an input string - INSTR:
BCDEFGH
和一组候选串 - CAND
And a set of candidate strings - CAND:
AB,CDE,FG,H,IJ
"AB", "CDE", "FG", "H", "IJ"
新搜索 CAND 字符串匹配的内子的 INSTR 的
Find any CAND strings that match as substrings within INSTR
在这个例子中,我将匹配CDE,FG和H(而不是AB和IJ)
In this example I would match "CDE", "FG", and "H" (but not "AB" and "IJ")
有可能是几千候选字符串(以CAND),但更重要的是我会做这个搜索数百万次,所以我需要它要快。
There could be many thousand candidate strings (in CAND), but more importantly I will be doing this search many millions of times so I need it to be FAST.
我想和字符数组的工作。另外,我不是intested建筑解决方案,如分发搜索 - 仅仅是最有效的功能/算法做局部
I'd like to work with char arrays. Also, I'm not intested in architectural solutions, like distributing the search - just the most efficient function/algorithm for doing it locally.
此外,在CAND和INSTR所有的弦都将是相对较小(小于50个字符) - 即目标字符串INSTR是不是相对于候选串长
Additionally, all the strings in CAND and INSTR will all be relatively small (< 50 chars) - i.e. the target string INSTR is NOT long relative to the candidate strings.
更新我应该提到,集CAND字符串是整个INSTR的所有值不变。
Update I should have mentioned, the set of CAND strings is invariant across all values of INSTR.
更新我只需要知道,有一个比赛 - 我并不需要知道比赛是
Update I only need to know that there was a match - and i don't need to know what the match was.
最后更新 我选择尝试AhoCorsick和拉宾,卡普,由于实现简单。 因为我有可变长度模式我用修改后的拉宾,卡普的散列每个模式,其中n是最小模式长度的前n个字符,N当时我的滚动字符串搜索窗口的长度。 对于阿霍Corsick我用这
Final Update I opted to try AhoCorsick and Rabin-Karp, due to simplicity of implementation. Because I have variable length patterns I used a modified Rabin-Karp that hashes the first n characters of each pattern, where n is the length of the smallest pattern, N was then the length of my rolling substring search window. For the Aho Corsick I used this
在我的测试中我搜索了1000模式在两个文件报纸文章,跨越1000次迭代等平均... 正常化次完成为:
In my test i searched for 1000 patterns in two documents news paper articles, averaged across 1000 iterations etc... Normalised times to complete were:
AhoCorsick :1
RabinKarp :1.8
天真搜索(检查每个模式和放大器;使用string.contains):50
Naive Search (check each pattern & use string.contains): 50
*描述中提到的回答下面的交易算法的一些资源:
*Some resources describing the algos mentioned in the answers below:
http://www.seas.gwu.edu/~simhaweb/cs151/lectures/module5/module5.html
http://www.cs.princeton.edu/courses/archive/spr09/cos226/lectures/18SubstringSearch-2x2.pdf
http://www-igm.univ-mlv.fr/~lecroq/string/index.html*
您最好的选择是阿霍Corasick算法和拉宾,卡普算法。因为我没有Java开发者,我不能告诉你有任何的即装即用的功能框架或库。
Your best options are the Aho-Corasick algorithm and the Rabin-Karp algorithm. Because I am no Java developer I cannot tell if there are any out-of-the-box framework functions or libraries.
我想补充 - 如果你输入的是不是很大,你不想重复搜索了很多次,你没有很多的模式,这可能是连一个好主意,使用单个图案的算法几次。对搜索算法的维基百科的文章使很多算法,运行和preprocessing次这样你就可以判断选项。
Just to add - if your input is not that large, you don't want to repeat the search many times and you do not have many patterns, it might be even a good idea to use a single pattern algorithm several times. The Wikipedia article on search algorithms gives many algorithms with running and preprocessing times so you can judge the options.