找到最长的重复子在一组字符串字符串、最长

2023-09-11 06:12:18 作者:Nuisance(讨厌鬼)

我试图找到一种方法来查找一组字符串的最大重复子。该最长重复子串问题通常适用于单个字符串,而不是一组字符串。什么类型的算法将是寻找最大的重复子在一组字符串有用吗?

查找最大的重复字符串中的一组文件(以删除重复的code在大型软件库)是主要的用例,我有一点,但仍然会有许多其他的用例这种算法为好。

例如,我想找到这组串的最长重复子:

 世界,你好,这是第一个字符串。
你好走向世界,这是第二个字符串。
世界,你好,这是第三个字符串。
这是第三个字符串。
 

在这种情况下,这是第三个字符串。将是最长的重复的字符串(例如,出现在一个以上的这些字符串的最长的字符串)

解决方案 创建一个特里数据结构(又名一个preFIX树)每串 让我们把它叫做 T(I)字符串 创建与关键的一个空哈希映射字符串和值 INT 让我们把它叫做 M 对于每个特里 T(I),对每个节点 P (其中是preFIX字符串)在 T(I), 如果键 P 已经在 M ,然后增加 M [P] 一样,插入 M [P] = 1 找到的(P * C *)对在 M 这样: C *> = 2 (*) 长度(P *)是最大所有这些对中 P * 是您要的字符串 怎样用C 找到字符串中的最长回文子串

(*)如果你想获得共同的 K 的字符串,你将取代 2 与 K

I'm trying to find a way to find the largest duplicate substring in a group of strings. The longest duplicate substring problem usually applies to a single string, instead of a group of strings. What type of algorithm would be useful for finding the largest duplicate substring in a group of strings?

Finding the largest duplicate string in a group of files (in order to remove duplicate code in large software libraries) is the main use case that I have in mind, but there would be many other use cases for this algorithm as well.

For example, I'd want to find the longest duplicate substring in this group of strings:

"Hello world, this is the first string."
"Hello to the world, this is the second string."
"Hello world.  This is the third string."
"This is the third string."

In this case, "This is the third string." would be the longest repeated string (i. e., the longest string that appears in more than one of these strings).

解决方案

Create a Trie data structure (a.k.a. a prefix tree) for each string Let's call it T(i) for string i Create an empty hash map with key string and value int Let's call it M For each Trie T(i), for each node P (where P is a prefix string) in T(i), if key P is already in M, then increment M[P] else, insert M[P] = 1 Find the (P*,C*) pair in M such that: C* >= 2 (*) length(P*) is maximum among all such pairs P* is the string that you want

(*) If you wanted to get the longest substring common to K of the strings, you would replace the 2 with K