寻找最长的重复子无后缀的阵列或后缀树后缀、阵列、最长

2023-09-11 03:06:58 作者:口出恶言

借助最长的重复子问题如下:

给出一个字符串的是W 的,找到的的最长子是W 的出现至少在两个位置。

Given a string w, find the longest substring of w that appears in at least two locations.

这问题可以在线性时间内通过使用增强后缀数组的后缀树和线性的时间来解决。

This problem can be solved in linear time using suffix trees and in linear time using enhanced suffix arrays.

我的问题是这样的 - 都存在这个问题不涉及后缀树木或后缀阵列的任何线性时间的算法?我很好奇,因为后缀树和后缀数组是著名的难以code和操纵,这将是很好,如果有一个算法对于这个问题,并不需要其它结构的编码或内存开销。

My question is this - are there any linear-time algorithms for this problem that do not involve suffix trees or suffix arrays? I'm curious because suffix trees and suffix arrays are famously difficult to code up and manipulate and it would be nice if there were an algorithm for this problem that didn't require the coding or memory overhead of these other structures.

谢谢!

推荐答案

挖这一段时间之后,我也找到一种替代后缀树和后缀数组。执行本身很简单,就像你想要的,但即使code为(大约)高的简洁,这是非常低的直觉。

After digging on this for a while, I did find an alternative to suffix trees and suffix arrays. The implementation itself is simple, like you wanted, but even though the code is (roughly) high on brevity, it is extremely low on intuition.

有一个文件,的的后缀二叉搜索树和后缀AVL树的,由罗伯特·欧文和罗娜的爱,来自格拉斯哥大学,即presents的替代解决方案后缀树和后缀数组。作者声称相比,后缀树的管理和建设的后缀二叉搜索树基本上是简单,表明结构可以保证时间不超过 O(N日志(N))的后缀AVL树的情况下(这也适用于标准的,非均衡的原BST在平均情况下,但它可能会降低到为O(n ^ 2)在最坏的情况下)。

There is a paper, The Suffix Binary Search Tree and Suffix AVL Tree, by Robert W. Irving and Lorna Love, from University of Glasgow, that presents an alternative solution to suffix trees and suffix arrays. The authors claim that managing and building a suffix binary search tree is substantially simpler when compared to suffix trees, and show that construction can be guaranteed to take no longer than O(n log(n)) in the case of a suffix AVL tree (this is also true for the standard, non-balanced original BST in the average case, but it can degrade to O(n^2) in the worst case).

这是当然的,比传统的后缀树更糟的是,由于时间找到在SBST最长重复子串通过建立它的时间限制。所以,严格来说,这并不能回答你的问题,但我决定发布此以供将来参考,并为有兴趣的读者。

This is, of course, worse than traditional suffix trees, since the time to find the longest repeated substring in an SBST is bounded by the time of building it. So, strictly speaking, this doesn't answer your question, but I decided to post this for future reference and for interested readers.

此外,该文件指出,结果树是一个强有力的候选人替代传统的后缀树和后缀数组。纸张被写入周围发现在对数时间的图案构建树后的问题。您可以从鏁下载。

Furthermore, the paper states that the resulting tree is a strong candidate alternative to traditional suffix trees and suffix arrays. The paper is written around the problem of finding a pattern in logarithmic time after building the tree. You can download it from CiteSeerX.

我们的想法是,每个节点都与一个整数,其中是第一偏移有关字符的后缀的节点重新presents的目标文本。对于一个给定的字符串 [A1,A2,A3,...,一] 长度 N ,将有 N 节点和节点重presents后缀 [AI,AI + 1,...,一]

The idea is that each node is associated with an integer i, where i is the offset of the first character in the target text of the suffix that that node represents. For a given string [a1,a2,a3,...,an] of length n, there will be n nodes, and node i represents the suffix [ai,ai+1,...,an].

本文不涉及确定最长重复子串的问题,但所提出的方式来建立的二叉搜索树可以很容易地解决了最长的重复子问题。我将通过这个扩展一旦我谈谈建设进程。这关键是要了解如何构建树,以了解扩展解决最长重复的字符串问题(否则code看起来像变魔术一样)。

The paper doesn't address the problem of determining the longest repeated substring, but the proposed way to build the binary search tree makes it easy to solve the longest repeated substring problem. I will go through that extension once I talk about the building process. It is crucial to understand how to build the tree in order to understand the extension to solve the longest repeated string problem (and otherwise the code will look like magic).

因此​​,我将首先解释了如何树的建成,将这样做的基础上,我从纸拿到了想法。正式的证明,以及其他详细信息(包括如何把它变成一个AVL树)都进行了讨论。你可以自由前进跳到部分,在那里我谈谈我是如何扩展算法,但请记住,这可能很难理解,如果你这样做。

Hence, I will explain first how the tree is built, and will do so based on the ideas I got from the paper. Formal proofs, along with other details (including how to turn this into an AVL tree) are discussed in the paper. You are free to jump ahead to the section where I talk about how I extended the algorithm, but keep in mind that it may be hard to understand if you do so.

本文介绍了搜索在避免了节点从头比较带有后缀的模式,我们访问每个节点上的后缀二叉搜索树图案的优化算法。这样做会,​​在最坏的情况下,需要 L * M中字符比较,其中 M 是图案的大小和是搜索路径。我们不希望比较每个节点上从头字符决定分支哪个方向;相反,它是希望的至多一次模式中的每个字符比较。这是可能的,如果我们每个存储节点的两个附加属性。构建树的过程中紧密结合本优化(和我提出的扩展名),所以它是非常重要的,了解它。

The paper describes an optimized algorithm for searching a pattern in a suffix binary search tree that avoids comparing the pattern with the suffix from a node from scratch on each node we visit. Doing so would, in the worst case, require l*m character comparisons, where m is the size of the pattern, and l is the search path. We don't want to compare characters from scratch on each node to decide on which direction to branch; instead, it is desirable to compare each character in the pattern at most once. This is possible if we store two additional attributes per node. The process of building the tree is tightly bound with this optimization (and with my proposed extension), so it is very important to understand it.

首先,一些术语:给定节点,节点Ĵ被认为是一个合适的祖先如果 j的左子树 。左祖先类似地被定义。在最接近正确的的节点Ĵ,使得没有后代的Ĵ是我;再次,类似的定义适用于最接近的左侧祖先。从现在起,我将把的使用缩写形式 CRA(我)最接近正确的祖先,和最接近左祖我缩写为 CLA(我)。我们还定义了两个节点之间的最长的共同preFIX Ĵ,我们重新present为 LCP(I,J)

First, some terminology: for a given node i, a node j is said to be a right ancestor of i if i is in the left subtree of j. The left ancestor is defined similarly. The closest right ancestor of i is the node j such that no descendant of j is a right ancestor of i; again, a similar definition applies for the closest left ancestor. From now on, I will refer to the closest right ancestor of i using the the abbreviated form cra(i), and the closest left ancestor of i is abbreviated to cla(i). We also define the longest common prefix between two nodes j and i, and we represent it as lcp(i, j).

对于一个给定的节点,我们将存储两个属性, M(1) D(I) M(1)重presents为其 LCP(I,J)是最高的,其中的值Ĵ是集的祖先。需要注意的是,由于二叉树的特性,节点Ĵ CLA(我) CRA(我) D(I)是一个属性来跟踪,其中 M(1)来自;如果Ĵ== CLA(我),然后 D(I)(即是节点的右子树Ĵ为其LCP(我, j)的最大化),否则, D(I)

For a given node i, we will store two attributes, m(i) and d(i). m(i) represents the value for which lcp(i, j) is highest, where j is the set of ancestors of i. Note that, due to the binary tree's properties, node j is either cla(i) or cra(i). d(i) is an attribute to keep track of where m(i) comes from; if j == cla(i), then d(i) is RIGHT (meaning that i is in the right subtree of the node j for which lcp(i, j) is maximized), otherwise, d(i) is LEFT.

下面是一组定理,共同打造的基本算法,在SBST执行搜索特定模式的。该定理描述当搜索的模式到达一个节点,该怎么办。请参考本文的形式化证明,我会尽量提供的,为什么这些规则是这样的一个直观的证明。总之,这些定理形成的一组规则,其允许算法由最多一次每个字符图案相比较来搜索模式,这是pretty的整齐!

What follows is a set of theorems that together build the basic algorithm to perform a search of a given pattern in an SBST. The theorems describe what to do when the search for a pattern arrives at a node i. Please refer to the paper for the formal proofs, I will try to provide an intuitive proof of why the rules are like this. Together, these theorems form a set of rules that allows the algorithm to search for a pattern by comparing each character in the pattern at most once, which is pretty neat!

当搜索到达节点,我们利用2值, LLCP RLCP LLCP 的值,使得 LCP(图案,J)是最高的,其中Ĵ是集我的。 RLCP 是一样的,但最大接管所有左侧的祖先。此外,由于做二叉树的属性, LLCP 只是 LCP(图案,CRA(I)) RLCP LCP(图案,CLA(I))

When a search arrives at node i, we make use of 2 values, llcp and rlcp. llcp is the value for which lcp(pattern, j) is highest, where j is the set of all right ancestors of i. rlcp is the same, but the maximum is taken over all left ancestors of i. Again, due do the binary tree's properties, llcp is just lcp(pattern, cra(i)), and rlcp is lcp(pattern, cla(i)).

在深入定理之前,我认为这是借鉴了一张纸的样品SBST和可视化树上的每个定理的语义含义是个好主意。

Before diving into the theorems, I think it is a good idea to draw a sample SBST on a piece of paper and visualize the semantical meaning of each theorem on the tree.

定理1是最简单和覆盖壳体米(ⅰ)> MAX(LLCP,RLCP)。如果出现这种情况, LLCP RLCP 不改,因为我们将能够匹配尽可能多的与,因为我们与它的祖先,和搜索树枝做在同一个方向 D(I)。要知道为什么,认为 D(I)的情况下==左。如果 D(I)= = LEFT ,那么就意味着 M(1)来自比赛与 CRA(我)。如果我们访问,这是因为我们已经知道的模式比低CRA(我),和 LCP(CRA(我),图案)LT; LCP(我,CRA(I)),这使得它比少的格局,因此,我们将离开。同样的过程可以应用于案件 D(I)= = RIGHT ,所以,事实上,我们只需要按照 D的方向(我)

Theorem 1 is the simplest and covers the case m(i) > max(llcp, rlcp). If that happens, llcp and rlcp don't change, because we will be able to match as much with i as we did with its ancestors, and the search branches in the same direction as d(i). To see why, consider the case that d(i) == LEFT. If d(i) == LEFT, then it means that m(i) comes from a match with cra(i). If we're visiting i, it is because we already know that the pattern is lower than cra(i), and lcp(cra(i), pattern) < lcp(i, cra(i)), which makes the pattern less than i, so we move left. The same process can be applied for the case d(i) == RIGHT, so, indeed, we just need to follow the direction of d(i).

定理2涉及的情况 MI&LT; MAX(LLCP,RLCP)。这个人是很难理解的。让我们看看会发生什么时, MAX(LLCP,RLCP)== LLCP MAX(LLCP,RLCP)== RLCP

Theorem 2 deals with the case mi < max(llcp, rlcp). This one is harder to understand. Let's see what happens when max(llcp, rlcp) == llcp and when max(llcp, rlcp) == rlcp.

案例1:最大(LLCP,RLCP)== LLCP

如果 MAX(LLCP,RLCP)== LLCP ,那么就意味着该模式有更多的共同点与节点i的最接近正确的祖先比其最接近的左祖先。此外,因为 LLCP&GT; M(1),该模式具有更多的共同点 CRA(我)比节点 CRA(我)。这,与事实在一起的 CRA(我)(通过BST的定义),意味着更低该模式大于。因此,我们分支的权利。

If max(llcp, rlcp) == llcp, then it means that the pattern has more in common with node i's closest right ancestor than with its closest left ancestor. Furthermore, because llcp > m(i), the pattern has more in common with cra(i) than node i has with cra(i). This, together with the fact that i is lower than cra(i) (by definition of BST), implies that the pattern is greater than i. Thus, we branch right.

怎么样的更新 LLCP RLCP ?因为我们的分支权, CRA(我)将是下一个迭代相同,因此 LLCP 保持不变。更新 RLCP 是一个有点比较麻烦。当我们分支权,新的 CLA 将节点i。接下来会发生什么取决于 m是否(一)来自 CRA(我) CLA (一)。我们可以使用 D(I)要知道:如果 M(1)来自 CRA(我),然后 D(I)= = LEFT ,否则, D(I)= = RIGHT

What about the updates for llcp and rlcp? Because we're branching right, cra(i) will be the same in the next iteration, so llcp remains unchanged. Updating rlcp is a bit more tricky. When we branch right, the new cla will be node i. What happens next depends on whether m(i) came from cra(i) or cla(i). We can use d(i) to know that: if m(i) comes from cra(i), then d(i) == LEFT, otherwise, d(i) == RIGHT.

案例1.1:最大(LLCP,RLCP)== LLCP和放大器;&安培; D(I)= = RIGHT

在这种情况下,我们知道, M(1)来自 CLA(我),这意味着节点有更多的共同点 CLA(我) CRA(我)(记住,该模式具有更多的共同点 CRA(我))。正如我们之前所看到的,我们将分支权,这使得该模式比我更大。这意味着 CLA&LT; I&LT;模式,和,因此,模式之间的最长的共同preFIX 是一样的模式 CLA 之间;或者换句话说, LCP(ⅰ,CLA)&GT; LCP(我,图案)(否则,模式将必须小于 ),所以 RLCP ,这是 LCP(我,图案),仍然是相同的。

In this case, we know that m(i) comes from cla(i), which means that node i has more in common with cla(i) than with cra(i) (and remember, the pattern has more in common with cra(i)). As we saw before, we'll be branching right, which makes the pattern greater than i. This implies that cla < i < pattern, and, therefore, the longest common prefix between pattern and i is the same as between pattern and cla; or in other words, lcp(i, cla) > lcp(i, pattern) (otherwise, pattern would have to be less than i), so rlcp, which is lcp(i, pattern), remains the same.

案例1.2:最大(LLCP,RLCP)== LLCP和放大器;&安培; D(I)= = LEFT

现在,我们知道, M(1)来自 CRA(我),但该模式有更多的在共同与 CRA(我) CRA(我)。这意味着,节点作为一个瓶颈,使 RLCP 小 - RLCP 只能是一样大的我和 CRA(我),这等于 M(1)。因此,在这种情况下, RLCP 是一样的 M(1)

Now, we know that m(i) comes from cra(i), but the pattern has more in common with cra(i) than i has with cra(i). This means that node i acts as a "bottleneck", making rlcp smaller - rlcp can only be as big as the common prefix between i and cra(i), which is equal to m(i). So, in this case, rlcp is the same as m(i).

一个类似的分析可以为相反的情况,其中 MAX(LLCP,RLCP)== RLCP (再考虑分的情况下 D(I)= = RIGHT 和 D(I)= = LEFT )。要执行的操作的previous情况下,反向版本:我们分支离开, RLCP 保持不变,而 LLCP 变成 M(1)如果 D(I)= = RIGHT ,否则将保持不变。

A similar analysis can be performed for the reverse situation, where max(llcp, rlcp) == rlcp (and then consider the sub-cases d(i) == RIGHT and d(i) == LEFT). The actions to perform are the reverse version of the previous cases: we branch left, rlcp remains unchanged, and llcp becomes m(i) if d(i) == RIGHT, otherwise, it remains unchanged.

在短期:

定理2的结果

                                   d(i) == RIGHT                d(i) == LEFT                      

max(llcp, rlcp) == llcp |            branch right        |   branch right; rlcp = m(i)
max(llcp, rlcp) == rlcp |       branch left; llcp = m(i) |         branch left

定理3探讨的两个情况下,当米(ⅰ)等于与另一祖先的图案的最长公共preFIX。具体地,如果米(ⅰ)== LLCP&安培;&安培; LLCP&GT; RLCP&功放;&安培; D(I)= = RIGHT ,我们知道,匹配尽可能多的与 CLA CRA模式。由于 D(I)= = RIGHT M(1)== LLCP ,它遵循​​ LCP(我,CRA)&LT; LLCP I&LT; CRA ,这意味着该模式大于 - 我们分行的权利。类似的说法保持着相反的情况;如果 M(1)== RLCP和放大器;&安培; RLCP&GT; LLCP&功放;&安培; D(I)= = LEFT ,我们分支化离开了。在这两种情况下,无论是 LLCP RLCP 将保持不变:前, LLCP 永远不会改变,因为 CRA 仍然是相同的,而 RLCP 将不会改变,因为 D(I)==右放;&安培; M(1)== LLCP ,即与相匹配的格局 CLA 是相同的;同样的情况在后一种情况下

Theorem 3 explores two cases when m(i) is equal to the longest common prefix of the pattern with another ancestor. In particular, if m(i) == llcp && llcp > rlcp && d(i) == RIGHT, we know that i matches as much with cla as the pattern with cra. Since d(i) == RIGHT, and m(i) == llcp, it follows that lcp(i, cra) < llcp, and i < cra, implying that the pattern is greater than i - we branch right. A similar argument holds for the opposite case; if m(i) == rlcp && rlcp > llcp && d(i) == LEFT, we branch left. In either case, both llcp and rlcp will stay the same: in the former, llcp would never change because cra is still the same, and rlcp won't change because d(i) == RIGHT && m(i) == llcp, i.e., matching the pattern with cla or with i is the same; the same happens in the latter case.

定理4就是我们要进行实际的字符比较。出现这种情况,每次我们不能推断出任何关于模式和当前节点,即当 M(1)== == RLCP LLCP ,还是之间的相对顺序时间 M(1)== LLCP和放大器;&安培; LLCP&GT; RLCP&功放;&安培; D(I)= = LEFT ,或者相反 M(1)== RLCP和放大器;&安培; RLCP&GT; LLCP&功放;&安培; D(I)= = RIGHT 。直观地说,我们知道,没有什么可以推导有关订单和字符执行比较自该不是最长的共同preFIX部分的第一个字符,并停在第一个的是不同的(在这一点上,我们可以原因有关订单)。同样,如果我们分支正确的,那么 CRA 是一样的,所以 LLCP 并没有改变,而 RLCP 现在将持有的格局和节点之间的新计算最长的共同preFIX值。类似的过程发生了相反的情况:我们分支离开, RLCP 保持不变,而 LLCP 变成了previously计算的最长的共同preFIX值。这个定理是直觉正确的,所以我不会再往前走。

Theorem 4 is where we have to perform actual character comparisons. This happens every time we cannot infer anything about the relative order between the pattern and the current node, that is, when m(i) == rlcp == llcp, or m(i) == llcp && llcp > rlcp && d(i) == LEFT, or the reverse m(i) == rlcp && rlcp > llcp && d(i) == RIGHT. Intuitively, we know that nothing can be infered about the order, and character comparisons are performed beginning on the first character that is not part of the longest common prefix, and stopping on the first that is different (at which point we can reason about the order). Again, if we branch right, then cra remains the same, so llcp doesn't change, and rlcp will now hold the value of the newly computed longest common prefix between the pattern and node i. A similar process happens for the reverse case: we branch left, rlcp stays untouched, and llcp becomes the previously computed value of the longest common prefix. This theorem is intuitively correct, so I won't go any further.

定理3及4结果

                                     d(i) == RIGHT                d(i) == LEFT                      
m(i) == llcp && llcp > rlcp |         branch right       |               *
m(i) == rlcp && rlcp > llcp |             *              |         branch left

* = compare(); if branch == left then rlcp = computed_lcp else llcp = computed_lcp

下面是伪code带来这一切在一起,从9页采取:

Here's the pseudo-code that brings all of this together, taken from page 9:

我的修改算法

这可能需要一些时间来包装你的头左右,该算法的迷死人,但一旦你掌握所有的细节,可以看出,寻找最长的重复子串的问题等价于寻找节点为其 M(1)是最大的。毕竟,最长重复子是最长的共同preFIX可以永远不会任意两个节点之间的发现。这可以在没​​有任何显著的开销可以找到,如果我们跟踪它在构建树:最大 M(1)迄今为止看到必须保持,每当一个新节点Ĵ插入,我们比较 M(j)条与迄今看到的最高,并进行更新,如果 M(j)条大。事实上,这种做法无非是实现一种算法,相当于排序的所有后缀,并找到任何两个连续的后缀之间的最长的共同preFIX,与不进行不必要的字符比较的优势,一个奇特的方式。这是一个相当不错的改进。

It may take a little while to wrap your head around the awesomeness of this algorithm, but once you grasp all the details, it can be seen that the problem of finding the longest repeated substring is equivalent to finding the node i for which m(i) is greatest. After all, the longest repeated substring is the longest common prefix that can ever be found between any two nodes. This can be found without any significant overhead if we keep track of it while building the tree: the maximum m(i) seen so far must be kept, and whenever a new node j is inserted, we compare m(j) with the maximum seen so far, and update it if m(j) is greater. In fact, this approach is nothing more than a fancy way to implement an algorithm that is equivalent to sorting all the suffixes and finding the longest common prefix between any two consecutive suffixes, with the advantage of not performing unnecessary character comparisons. That's a considerably good improvement.

上面显示的假性code是几乎足以建立标准SBST。首先,我们添加一个根节点,用我== 1 ,再presenting整个文本。后缀,然后加入左到右。要插入一个新的后缀,我们查找一个模式,即后缀。这样做会使算法停在插入的确切位置。然而,本文不会进入太多细节上的插入过程。我们必须小心与返回我; 在最后的其他的定理4,我们只能返回,如果我们'再搜索。如果我们正在做的插入,达到了给我们带来定理4的这种情况下一个节点,那么就意味着在所有新后缀的匹配与一些previously插入后缀的字符。因为后缀正在从左至右插入时,我们也知道,新后缀具有比其它一个,这意味着新的后缀是低于对方少字符:正确的举动是来分支离开。由于我们支走后,最接近左侧的祖先保持不变,所以我们只需要更新 LLCP LLCP 变成后缀本身的大小,因为正如我们所看到的,它的所有匹配与目前最接近正确的祖先节点。

The pseudo-code shown above is almost enough to build the standard SBST. We start by adding a root node, with i == 1, representing the whole text. Suffixes are then added left to right. To insert a new suffix i, we search for a pattern that is suffix i. Doing so will make the algorithm stop at the exact spot of insertion. However, the paper doesn't go into much detail on the insertion process. We must be careful with the return i; in the final else for Theorem 4. We can only return if we're searching. If we're doing an insertion and reach a node that brings us to that case of theorem 4, then it means that all of the characters in the new suffix match with some previously inserted suffix. Because suffixes are being inserted from left to right, we also know that the new suffix has less characters than the other one, which means that the new suffix is lower than the other: the correct move is to branch left. Since we branched left, the closest left ancestor stays the same, so we only need to update llcp. llcp becomes the size of the suffix itself, because as we saw, all of it is matched with the node that is now the closest right ancestor.

显然,新节点将有 M(1)值等于 MAX(LLCP,RLCP) D(I)会,顾名思义,是如果 MAX(LLCP ,RLCP)== RLCP 否则。

Obviously, the new node will have a value of m(i) equal to max(llcp, rlcp), and d(i) will, by definition, be RIGHT if max(llcp, rlcp) == rlcp, and LEFT otherwise.

我在C实现归结为伪code,与插入逻辑在一起。有两个数据结构:结构sbst 重presents后缀二叉搜索树以及最大 M(1)看到为止;和结构节点,这是描述符树的节点。

My implementation in C boils down to the pseudo-code, together with the insertion logic. There are two data structures: struct sbst represents a suffix binary search tree along with the maximum m(i) seen so far; and struct node, which is the descriptor for a tree's node.

下面是完整的程序清单:

Here's the full program listing:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define LEFT 0
#define RIGHT 1
#define MODE_INSERT 1
#define MODE_FIND 2
#define MAX_TEXT_SIZE 1024
#define max(a,b) ((a)>(b)?(a):(b))

struct node {
  int m;
  int d;
  int i;
  struct node *left;
  struct node *right;
};

struct sbst {
  struct node *root;
  int max_mi; /* The maximum lcp in the tree */
  int max_i; /* The node i with m(i) == max_mi */
};

struct node *allocate_node(int m, int d, int i) {
  struct node *new_node = malloc(sizeof(*new_node));
  new_node->m = m;
  new_node->d = d;
  new_node->i = i;
  new_node->left = new_node->right = NULL;
  return new_node;
}

int lcp(char *str1, char *str2);

/* The core where all the work takes place.
   It is assumed that when this function is called, tree->root always points to a valid, allocated root. That is, it is assumed
   that the tree contains at least one node.
   This function provides both a find and an insert algorithm. To find a pattern, <mode> must be MODE_FIND. To insert,
   <mode> must be MODE_INSERT. In the latter case, the parameter <text_i> corresponds to the index in the original string
   of the suffix being inserted (index starts counting at 1, as described in the paper). Also, in MODE_INSERT, the pattern is
   the suffix being inserted.
   If in MODE_FIND, this function returns the index (starting at 1) in the text where <pattern> can be found, or 0 if no such
   pattern could be found.
*/
int find_insert_aux(struct sbst *tree, char *pattern, size_t pattern_len, char *text, size_t text_len, char mode, int text_i) {
  struct node *current, *prev;
  int llcp, rlcp;
  int next_dir;

  current = tree->root;
  llcp = rlcp = 0;

  while (current != NULL) {
    int max_pattern = max(llcp, rlcp);
    if (current->m > max_pattern) {
      next_dir = current->d;
    } else if (current->m < max_pattern) {
      if (llcp > rlcp) {
    next_dir = RIGHT;
    if (current->d == LEFT) {
      rlcp = current->m;
    }
      } else if (rlcp > llcp) {
    next_dir = LEFT;
    if (current->d == RIGHT) {
      llcp = current->m;
    }
      }
    } else if (current->m == llcp && llcp > rlcp && current->d == RIGHT) {
      next_dir = RIGHT;
    } else if (current->m == rlcp && rlcp > llcp && current->d == LEFT) {
      next_dir = LEFT;
    } else {
      int sub_lcp = lcp(pattern+current->m, text+current->m+current->i-1);
      int t = current->m + sub_lcp;
      if (t == pattern_len) {
    if (mode == MODE_FIND) {
      return current->i;
    } else {
      next_dir = LEFT;
      llcp = t;
    }
      } else if (current->i+t-1 == text_len || pattern[t] > text[t+current->i-1]) {
    next_dir = RIGHT;
    rlcp = t;
      } else {
    next_dir = LEFT;
    llcp = t;
      }
    }
    prev = current;
    current = (next_dir == RIGHT ? current->right : current->left);
  }
  if (mode == MODE_INSERT) {
    struct node *new_node = allocate_node(max(llcp, rlcp), (llcp > rlcp ? LEFT : RIGHT), text_i);
    if (next_dir == LEFT)
      prev->left = new_node;
    else
      prev->right = new_node;
    if (new_node->m > tree->max_mi) {
      tree->max_mi = new_node->m;
      tree->max_i = new_node->i;
    }
  }
  return 0;
}

void sbst_insert(struct sbst *tree, char *text, size_t text_size, int i) {
  (void) find_insert_aux(tree, text+i-1, text_size-i+1, text, text_size, MODE_INSERT, i);
}

int sbst_find(struct sbst *tree, char *text, size_t text_size, char *pattern, size_t pattern_size) {
  return find_insert_aux(tree, pattern, pattern_size, text, text_size, MODE_FIND, 0);
}

/* Builds a Suffix Binary Search Tree that keeps track of the highest m(i) as it is built. */
struct sbst *build_sbst(char *text, size_t text_size) {
  if (*text == '\0')
    return NULL;

  struct sbst *tree = malloc(sizeof(*tree));
  tree->root = allocate_node(0, 0, 1);
  tree->max_mi = 0;
  tree->max_i = 1;

 for (int i = 1; text[i] != '\0'; i++)
   sbst_insert(tree, text, text_size, i+1);

  return tree;
}

/* Given an SBST for the input, finds the longest repeated substring in O(1)
   Stores the offset in *offset, and the size of the lrs in *size
*/
void find_lrs(struct sbst *tree, int *offset, int *size) {
  *offset = tree->max_i-1;
  *size = tree->max_mi;
}

/* Debug section */
void dump(struct node *n, char *text, int depth) {
  if (!n)
    return;
  for (int i = 0; i < depth; i++)
    putchar(' '), putchar(' ');
  printf("%d|%d|%d|%s\n", n->m, n->d, n->i, text+n->i-1);
  dump(n->left, text, depth+1);
  dump(n->right, text, depth+1);
}

void dump_sorted(struct node *n, char *text) {
  if (!n)
    return;
  dump_sorted(n->left, text);
  printf("%s\n", text+n->i-1);
  dump_sorted(n->right, text);
}
/* End debug section */

int lcp(char *str1, char *str2) {
  int i;
  for (i = 0; str1[i] != '\0' && str1[i] == str2[i]; i++);
  return i;
}

int main(void) {
  char text[MAX_TEXT_SIZE];
  printf("Enter text, hit RETURN to terminate (max. %d chars): ", MAX_TEXT_SIZE-1);

  fgets(text, sizeof text, stdin);
  size_t text_size = strlen(text);

  /* Trim newline */
  text[--text_size] = '\0';

  struct sbst *tree = build_sbst(text, text_size);

  /* Debug */
  #ifdef DEBUG_MODE
  dump(tree->root, text, 0);
  printf("\n");
  dump_sorted(tree->root, text);
  #endif

  int lrs_offset, lrs_size;
  find_lrs(tree, &lrs_offset, &lrs_size);
  if (lrs_size == 0)
    printf("No longest repeated substring.\n");
  else {
    printf("Longest repeated substring found at offset %d with size %d: %.*s\n", lrs_offset, lrs_size, lrs_size, text+lrs_offset);
  }
  return 0;
}

如果我不看报纸,我会看在这code作为黑魔法。

Had I not read the paper, I would have looked upon this code as black magic.

请注意,这是不完全的最好的软件工程实践了很好的示范:人谁不熟悉的算法将有一个非常困难的时间阅读和理解它,它泄漏内存,它不检查返回值对的malloc ,并有一些其他的瑕疵太多,但我认为这是足以让我的观点。

Note that this is not exactly a good demonstration of the best software engineering practices: someone who is not familiar with the algorithm will have a very hard time reading and understanding it, it leaks memory, it doesn't check the return value of malloc, and has some other flaws too, but I think it's enough to make my point.

虽然这可能不如最优作为后缀树,它显然是简单建立和提供了一个很好的起点。例如,作为一个副产品,它可以执行的模式在对数时间匹配 - !这是pretty的好

While this may not be as optimal as a suffix tree, it is clearly simple to build and provides a good starting point. For example, as a by-product, it can perform pattern matching in logarithmic time - that's pretty good!

注意:我没有太多时间来检验的实施。我做了一些基本的测试,它似乎是工作,但我不能保证没有错误。

NOTE: I haven't had much time to test the implementation. I did some basic testing and it seems to be working, but I can't assure there are no errors.