如何让在子长重复的字符串的后缀树子长、后缀、字符串

2023-09-12 21:19:21 作者:命不断心不变°

我需要找到子最长的重复字符串。比方说,我有字符串bannana

维基百科 以下称:

  

在计算机科学中,最长的重复子问题是   发现出现在一个字符串的最长子串的问题   至少两次。在字符串ATCGATCGA $,最长的身影   重复子是ATCGA

所以,我认为,对于字符串bannana有两个同样长的子串(如不正确我好吗):NA

维基百科也说,为了这个目的,后缀树使用。为了更具体here报价是如何做到这一点(这在我看来比定义更understable在维基百科):

  

建立一个后缀树,然后找到最高点至少有2   后代。

我发现了后缀树几种实现。继code取自这里:

 使用严格的;
使用警告;
使用数据::自卸车;

子分类{
    我($ F,$ H)=(移位,{});
    用于(@_){推@ {$ H-> {$ F->($ _)}},$ _}
    返回$ H;
}
子后缀{
    我的$海峡=移位;
    图{SUBSTR $海峡,$ _} 0 ..长度($ STR) -  1;
}
子suffix_tree {
    返回+ {}如果@_ == 0;
    返回+ {$ _ [0] => + {}}如果@_ == 1;
    我的$ H = {};
    我的$ classif =分类子{SUBSTR转变,0,1},@_;
    我的$键(排序键%$ classif){
        我的$子树= suffix_tree(
            grep的$ _,图{SUBSTR $ _,1} @ {$ classif-> {$键}}
        );
        我@subkeys =键%$子树;
        如果(@subkeys == 1){
            我的$子项=移@subkeys;
            $ H-> {$键$子项} = $ subtree-> {$}子键;
        }其他{$ H-> {$键} = $子树}
    }
    返回$ H;
}

打印+自卸车suffix_tree后缀bannana $';
 

字符串bannana返回下面的树:

  $ VAR1 = {
          $=> {},
          'N'=> {
                   'A'=> {
                            呐$'=> {},
                            $=> {}
                          },
                   娜娜$'=> {}
                 },
          'A'=> {
                   $=> {},
                   'N'=> {
                            一个$'=> {},
                            娜娜$'=> {}
                          }
                 },
          bannana $'=> {}
        };
 
C C 简单界面求字符串的前缀 后缀 不与前后缀重复的子串

另一种实现是在网上从这里,对于字符串bannana 返回下面的树:

  7:一
 5:ANA
 2:annana
 1:bannana
 6:NA
 4:娜娜
 3:nnana

     |(1:bannana)|叶
树:|
     | |(4:娜娜)|叶
     |(2:一)|
     | |(7:1)|叶
     |
     | |(4:娜娜)|叶
     |(3:N)|
     | |(5:ANA)|叶
3分支节点
 

问题:

如何从这些图表获得NA字符串? 正如你所看到的树是不同的,他们是等价与否,如果是的话,为什么他们是不同的,如果不是它的算法是正确的? 如果perl的实现是错的是没有任何工作实施的Perl / Python的? 在我读过有关这也被提及与第二个示例页面上Ukkonen的算法,是否有使用这种算法提到的例子(如网络版采用这种算法还是没有我没赶上)?如果没有,则使用的算法较慢或有任何缺点相比,Ukkonen? 解决方案

1。我怎样才能从这些图表得到一个和娜琴弦?的

  

建立一个后缀树,然后找到最高点至少有2后裔。

字符串节点是连接字符串为每个节点从根到该节点。 最高的节点为节点,最大长度字符串节点

见树,我对第二个问题的答案。 (3:N)有2个子孙和路径节点是(2:1) - >(3:N),连击是。也为。(5:1) GET

2。正如你所看到的树是不同的,他们是等价与否,如果是的话,为什么他们是不同的,如果不是它的算法是正确的?的

这些树是不同的。重建第二棵树的​​字符串bannana $( 如在第一树):

  8:$
 7:A $
 5:ANA $
 2:annana $
 1:bannana $
 6:NA $
 4:娜娜$
 3:nnana $

     |(1:bannana $)|叶
树:|
     | | |(4:娜娜$)|叶
     | |(3:N)|
     | | |(7:A $)|叶
     |(2:1)|
     | |(8:$)|叶
     |
     | |(4:娜娜$)|叶
     |(3:N)|
     | | |(6:NA $)|叶
     | |(5:1)|
     | | |(8:$)|叶
     |
     |(8:$)|叶
5分支节点
 

3。如果perl的实现是错的是没有任何工作实施的Perl / Python的?的

我不知道Perl的,但树是否正确建立。

4。我读过有关这也被提及与第二个示例页面上Ukkonen的算法,不使用任何提到的例子,该算法(如网络版采用这种算法还是没有我没赶上)?如果没有,则使用的算法较慢或有任何缺点相比,Ukkonen?的

我前面说的,我不知道Perl的,但它在第一algorthim线意味着它可以至少为O(n ^ 2) N 是长度的字符串):

 地图{SUBSTR $海峡,$ _} 0 ..长度($ STR) -  1;
 

Ukkonen的算法工作的线性时间 O(N)

首先算法还递归这可能会影响使用的内存。

I need to find the longest repeating string in substring. Let's say I have string "bannana"

Wikipedia says following:

In computer science, the longest repeated substring problem is the problem of finding the longest substring of a string that occurs at least twice. In the figure with the string "ATCGATCGA$", the longest repeated substring is "ATCGA"

So I assume that for string "bannana" there are two equally long substrings (if not correct me please): "an" and "na".

Wikipedia also says that for this purpose suffix trees are used. To be more specific here is quotation how to do it (this seems to me more understable than definition on wikipedia):

build a Suffix tree, then find the highest node with at least 2 descendants.

I've found several implementations of suffix trees. Following code is taken from here:

use strict;
use warnings;
use Data::Dumper;

sub classify {
    my ($f, $h) = (shift, {});
    for (@_) { push @{$h->{$f->($_)}}, $_ }
    return $h;
}
sub suffixes {
    my $str = shift;
    map { substr $str, $_ } 0 .. length($str) - 1;
}
sub suffix_tree {
    return +{} if @_ == 0;
    return +{ $_[0] => +{} } if @_ == 1;
    my $h = {};
    my $classif = classify sub { substr shift, 0, 1 }, @_;
    for my $key (sort keys %$classif) {
        my $subtree = suffix_tree(
            grep "$_", map { substr $_, 1 } @{$classif->{$key}}
        );
        my @subkeys = keys %$subtree;
        if (@subkeys == 1) {
            my $subkey = shift @subkeys;
            $h->{"$key$subkey"} = $subtree->{$subkey};
        } else { $h->{$key} = $subtree }
    }
    return $h;
}

print +Dumper suffix_tree suffixes 'bannana$';

for string "bannana" it returns following tree:

$VAR1 = {
          '$' => {},
          'n' => {
                   'a' => {
                            'na$' => {},
                            '$' => {}
                          },
                   'nana$' => {}
                 },
          'a' => {
                   '$' => {},
                   'n' => {
                            'a$' => {},
                            'nana$' => {}
                          }
                 },
          'bannana$' => {}
        };

Another implementation is online from here, for string "bannana" it returns following tree:

 7: a
 5: ana
 2: annana
 1: bannana
 6: na
 4: nana
 3: nnana

     |(1:bannana)|leaf
tree:|
     |      |(4:nana)|leaf
     |(2:an)|
     |      |(7:a)|leaf
     |
     |     |(4:nana)|leaf
     |(3:n)|
     |     |(5:ana)|leaf
3 branching nodes

Questions:

How can I get from those graphs "an" and "na" strings? As you can see trees are different, are they equivalent or not, if yes why they are different, if not which algorithm is correct? If perl implementation is wrong is there any working implementation for perl/python? I've read about Ukkonen's algorithm which is also mentioned on page with 2nd example (I did not catch if the online version is using this algorithm or not), does any of the mentioned examples using this algorithm? If not, is used algorithm slower or has any drawbacks compared to Ukkonen?

解决方案

1. How can I get from those graphs "an" and "na" strings?

build a Suffix tree, then find the highest node with at least 2 descendants.

string-node is concatenate strings for every node from root to this node. highest node is node with maximum length string-node.

See tree in my answer for second question. (3:n) have 2 descendants and path to node is (2:a)->(3:n), concatenate is an. And also for (5:a) get na.

2. As you can see trees are different, are they equivalent or not, if yes why they are different, if not which algorithm is correct?

These trees are different. Rebuild second tree for string "bannana$" ( as in the first tree):

 8: $
 7: a$
 5: ana$
 2: annana$
 1: bannana$
 6: na$
 4: nana$
 3: nnana$

     |(1:bannana$)|leaf
tree:|
     |     |     |(4:nana$)|leaf
     |     |(3:n)|
     |     |     |(7:a$)|leaf
     |(2:a)|
     |     |(8:$)|leaf
     |
     |     |(4:nana$)|leaf
     |(3:n)|
     |     |     |(6:na$)|leaf
     |     |(5:a)|
     |     |     |(8:$)|leaf
     |
     |(8:$)|leaf
5 branching nodes

3. If perl implementation is wrong is there any working implementation for perl/python?

I don't know Perl, but the tree is built correctly.

4. I've read about Ukkonen's algorithm which is also mentioned on page with 2nd example (I did not catch if the online version is using this algorithm or not), does any of the mentioned examples using this algorithm? If not, is used algorithm slower or has any drawbacks compared to Ukkonen?

I said earlier that I don't know Perl, but it's a line in first algorthim means that it works at least O(n^2) (n it is length string):

map { substr $str, $_ } 0 .. length($str) - 1;

Ukkonen's algorithm works linear time O(n).

First algorithm also recursive which may affect used memory.