算法找到在N个字符串的常见字符串字符串、算法、常见

2023-09-11 01:53:03 作者:Stubborn.(顽固)

我熟悉的LCS算法2串。寻找寻找共同的子字符串第2..N字符串建议。有可能在每一对多个共同子串。可以有不同的共同子串中的串的子集

字符串:(ABCDEFGHIJKL)(DEF)(ABCDEF)(BIJKL)(FGH)

常见的字符串:

  1/2(DEF)
1/3(ABCDEF)
1/4(IJKL)
1/5(FGH)
三分之二(DEF)
 

最长的字符串:

  1/3(ABCDEF)
 

最常见的字符串:

  1/2/3(DEF)
 
流量e魔病毒分析报告

解决方案

这样的事情做在DNA序列分析所有的时间。你可以找到各种各样的为它的算法。一个合理的集合上市这里 。

还有制作表格的蛮力方法的每次的子串(如果你只对短期的人感兴趣):形成N叉树(N = 26封,256在每个节点的计数的ASCII码)在每个级别,并存储直方图。如果你剪掉很少使用节点(保持内存的要求合理),你最终得到的一种算法,发现长度可达m的全部子序列类似N * M ^ 2 *日志(M)时间长度的输入N.如果不是拆分这件事成K单独字符串,你可以建立树状结构,只是读出通过树一个阶段的答案(S)。

I'm familiar with LCS algorithms for 2 strings. Looking for suggestions for finding common substrings in 2..N strings. There may be multiple common substrings in each pair. There can be different common substrings in subsets of the strings.

strings: (ABCDEFGHIJKL) (DEF) (ABCDEF) (BIJKL) (FGH)

common strings:

1/2 (DEF)
1/3 (ABCDEF)
1/4 (IJKL)
1/5 (FGH)
2/3 (DEF)

longest common strings:

1/3 (ABCDEF)

most common strings:

1/2/3 (DEF)

解决方案

This sort of thing is done all the time in DNA sequence analysis. You can find a variety of algorithms for it. One reasonable collection is listed here.

There's also the brute-force approach of making tables of every substring (if you're interested only in short ones): form an N-ary tree (N=26 for letters, 256 for ASCII) at each level, and store histograms of the count at every node. If you prune off little-used nodes (to keep the memory requirements reasonable), you end up with an algorithm that finds all subsequences of length up to M in something like N*M^2*log(M) time for input of length N. If you instead split this up into K separate strings, you can build the tree structure and just read off the answer(s) in a single pass through the tree.