如何做一个有效的检查,如果字符串部分存在于一个更大的集?更大、字符串、有效、部分

2023-09-11 06:34:23 作者:颓废人物

说我有一组字符串:

Set<String> things = new HashSet<String>();
things.add("coffee cup");
things.add("smartphone");
things.add("inkjet printer");
//   :
// list could be quite large (100K or so, perhaps loaded from a database)
//   :

现在我要检查,如果另一个字符串完全包含字符串在上面的设置。所以:

Now I want to check if another string completely contains any of the Strings in the above set. So:

"a coffee cup" - matches
"android smartphone" - matches
"inkjet printer for sale" - matches
"laser printer" - does not match
"printer" - does not match

我能想到的是通过一系列的迭代(及盈亏荷兰国际集团如果找到)的唯一方法。是否有一个更高效,更优雅的方式来做到这一点?

The only way I can think of is iterating through the set (and break-ing if found). Is there a more efficient and elegant way to do this?

推荐答案

您需要阿霍Corasick算法。 http://en.wikipedia.org/wiki/Aho​​%E2%80%93Corasick_string_matching_algorithm

You need Aho-Corasick algorithm. http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

https://github.com/raymanrt/aho-corasick

时间复杂度是O(米)为preprocessing(其中m是串集合中的总长度)和O(n)的匹配(其中n是匹配的字符串的长度)。因此,它是渐近最优的。

Time complexity is O(m) for preprocessing (where m is total length of strings in the set) and O(n) for matching (where n is length of matched string). So it's asymptotically optimal.