Ukkonen的后缀树算法,用简单的英语?英语、后缀、算法、简单

2023-09-10 22:22:23 作者:现在最火的

我觉得有点厚,在这一点上。我花了几天试图全面总结我的脑袋周围后缀树的构造,而是因为我没有数学的背景下,很多的解释躲避我,他们开始做出过度使用数学符号的。最接近我已经找到了一个很好的解释的 快速的字符串搜索后缀树 的,但他掩盖了各个点和算法的某些方面仍不清楚。

一个一步一步的解释这个算法在这里对堆栈&NBSP的;溢出将是非常宝贵的许多人除了我之外,我敢肯定,

有关引用,这里的Ukkonen对算法纸:http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

我的基本的了解,到目前为止:

我需要遍历一个给定的串T的每preFIX P 在我需要通过每个后缀s遍历在preFIX P和添加到树 要后缀s添加到树,我需要遍历S中的每个字符,与迭代由要么走与S中的同一组字符的C启动一个现有的分支和潜在的分裂边缘到后代节点,当我在后缀达到不同的字符,或者如果没有匹配的边缘走。如果没有匹配的边缘发现走了C,新叶边缘为C语言创建的。

基本算法似乎是为O(n 2 ),如指出,在大多数的解释,因为我们需要逐步完成所有的prefixes的,那么我们就需要步每一个后缀为每个preFIX的。 Ukkonen的算法是因为他使用的后缀指针技术显然是独一无二的,但我认为的是的是我有什么麻烦的理解。

我也遇到了麻烦理解:

在什么时候以及如何激活点分配,使用和修改 什么是与算法的册封方面的事情 为什么我已经看到了需要修复边界变量,他们使用的是实现

修改(2012年4月13日)

算法分析与设计 后缀树

下面是基于以下jogojapan的回答,我已经写和输出完成的源$ C ​​$ C。的code输出的需要,因为它构建树的步骤的详细描述和基于文本的图。这是第一个版本,也许可以做的优化等等,但它的作品,这是最主要的。

[删除URL,见下面的更新的链接]

修改(2012年4月15日)

该人士$ ​​C $ C已经完全从头开始重写,现在不仅可以正常工作,但它支持自动册封并呈现输出一个更好看的文字图形。来源$ C ​​$ c和采样输出是:

  

https://gist.github.com/2373868

解决方案

下面是试图描述Ukkonen算法,首先展示它做什么时,该字符串是简单的(即不包含任何重复的字符),然后它延伸到完整的算法。

首先,几个preliminary声明。

我们正在建设,是的基本的像一个搜索线索。因此,有 是根节点,边缘走出它导致新的节点,并 进一步边出去的那些,等等

但是:与在搜索线索,边标签不是单 字符。相反,每个边缘使用一对整数标记的 [从,到] 。这些都是指针到文本。在这个意义上说,每个 边进行任意长度的字符串标签,但只需要O(1) 空间(两个指针)。

基本原理

我想首先演示如何创建一个后缀树 特别是简单的字符串,字符串没有重复的字符:

  ABC
 

该算法的工作步骤,由左到右。有一个用于字符串的每个字符一步。每一步都可能涉及多个个体经营,但我们将看到(见最后意见底)的操作的总数是O(n)。

所以,我们从离开的,并且插入只有一个字符开始 A 通过创建从根节点的边(左边)到叶, 和标签为 [0,#] ,这意味着边缘重新presents的 子串在位置0和结束时的当前端的。一世 用符号来表示的当前端的,这是在位置1 (右后 A )。

因此​​,我们有一个初步的树,它看起来是这样的:

和它的意思是这样的:

现在我们进步到位置2(右后 B )。 我们在每一步的目标 是插入所有后缀上涨到目前的位置。我们这样做 通过

扩展现有的 A -edge到 AB 插入为一个新的边缘b

在我们重新presentation这看起来像

和它的意思是:

我们观察两件事情:

在边缘重新presentation为 AB 是同作为它曾经是 在最初的树: [0,#] 。它的意义已经自动更改 因为我们更新了当前位置从1至2。 在每个边消耗O(1)空间,因为它包含了只有两个 指针不管到文本,有多少字符, 再presents。

接下来,我们再次递增的位置和通过追加更新树 一个 C 每一个现有的边缘,插入一个新的边缘新 后缀 C

在我们重新presentation这看起来像

和它的意思是:

我们看到:

在该树是正确的后缀树的上涨到目前的位置的 每个步骤后 有,因为有字符的文本尽可能多的步骤 工作的每个步骤中的量是O(1),因为所有现有边 是通过增加自动更新,并插入 在最后一个字符一次新的边缘可以在O完成(1) 时间。因此,对于长度为n的字符串,只有O的要求(n)的时间。

第一次延长:简单的重复

当然,这工作得这么好,只是因为我们的字符串不 包含任何重复。现在我们看一个更现实的字符串:

  abcabxabcd
 

它开始与农行如previous例子,那么 AB 重复 其次为 X ,然后农行被重复,然后按 D

步骤1至3:后的第一个3个步骤,我们已经从previous例如树:

第四步:我们移动定位4。这隐含更新所有现有的 边缘到这一点:

和我们需要插入当前步骤的最终后缀, A ,在 根。

在我们这样做,我们介绍的两个变量(除 ),这当然一直在那里所有的时间,但我们还没有使用 他们至今:

在活动点,这是一个三 (active_node,active_edge,active_length)剩余,这是一个整数,表示许多新后缀 我们需要插入

这两个的确切含义将变得清晰很快,但现在 我们只能说:

在简单的农行例如,活动点总是 (根,'\ 0X',0),即 active_node 是根节点, active_edge 被指定为空字符'\ 0X active_length 为零。这样做的结果是,所述一个新的边缘,该边缘 我们插入在每一个步骤被插入到根节点作为 新创建的边缘。我们将很快看到,为什么三是必要的 再present此信息。 的剩余总是设置为1,在每个月初 步骤。这样做的意义是,后缀的数目,我们不得不 积极地插入在各步骤结束时为1(总是正好 最后一个字符)。

现在该是改变的。当我们插入当前最后 字符 A 的根,我们注意到,已经有传出 边缘开始与 A ,具体为: ABCA 。下面是我们做的 这样的案例:

我们不要插入根节点新鲜的边缘 [4,#] 。相反,我们 只是注意到后缀 A 已经在我们的 树。它结束在一个长边缘的中间,但我们不 受困扰。我们只是把事情会是这样的。 我们设置活动点到(根,'A',1)。这意味着活性 现在点位于根节点的传出边缘的,与开始的中间的某个地方一,具体地讲,后上边缘位置1。我们 请注意,是由它的第一个简单的规定的边缘 字符 A 。这足以因为有可能的只有的边缘 开始与任何特定的字符(确认整个描述看完之后,这是真的)。​​ 我们还增加剩余,因此在下一步的开始 这将是2。

观察:当最后的后缀,我们需要插入被发现 在树中存在已,树本身是没有改变在所有(我们只更新活动点和剩余)。那个树 然后没有后缀树的到一个准确的重新presentation 当前位置的任何更多的,但它的包含所有后缀(因为最终 后缀 A 包含的隐含的)。因此,除了更新 变量(这些都是固定长度的,所以这是O(1)),出现了 没有工作在这一步完成。

第五步:我们更新当前位置 5。这 自动更新树这样的:

和,因为剩余 2 ,我们需要插入两个决赛 当前位置的后缀: AB B 。这主要是因为:

从previous步骤后缀从来没有正确 插入。因此,它具有的仍然的,因为我们已经进步了1 步骤,它现在已经成长从 A AB 。 我们需要插入新的最终边缘 B

在实践中,这意味着,我们去活动点(指向 在 A 什么是现在 abcab 边缘),并插入背后 当前的最后一个字符 B 。 但是:此外,事实证明, B 是 也已经present在同一边缘。

所以,再次,我们不改变树。我们简单:

更新活动点(根,'一',2)(相同的节点和边缘 和以前一样,但现在我们指向背后的 B ) 递增的剩余 3,因为我们还没有正确 插入从previous一步的最终边缘,我们不要插入 目前最终的边缘无论是。

需要明确的是:我们必须插入 AB B 当前步骤,但 因为 AB 已经发现,我们更新了活动点,做 甚至不要尝试插入 B 。为什么?因为如果 AB 在树中, 每个后缀它(包括 B )必须在树中, 太。也许只有隐的,但它必须是因为在那里, 这样,我们已经建立了树为止。

我们进入第六步按递增。该树 自动更新为:

由于 剩余 3 ,我们必须插入 ABX BX X 。活动点告诉我们,其中 AB 结束,所以我们只需要 跳到那里,然后插入 X 。事实上, X 现在还没有,所以我们 拆分 abcabx 边,并插入一个内部节点:

边缘重新presentations仍然指针到文本,所以 分裂和插入一个内部节点可以在O完成(1)时间。

因此​​,我们已经处理了 ABX 和减量剩余 2。现在我们 需要插入的下一个剩余后缀, BX 。但是,在我们这样做 我们需要更新的活动点。该规则对于这一点,分裂后 并插入一个边缘,将被称为第1 的下方,它适用于任何的 active_node 是根(我们将进一步学习规则3其他案件 下面)。这是规则1:

这根插入后,

active_node 依然根 active_edge 设置为新后缀的第一个字符,我们 需要插入,即 B active_length 由1 减少

因此​​,新的活动点三(根,'B',1)表示 接下来的刀片具有在 bcabx 边进行,后面1个字符, 即后面 B 。我们可以找出在O(1)时间的插入点 检查是否 X 已经是present与否。如果是present,我们 将结束当前的步骤,并把一切事情是这样的。但 X 是不是present,所以我们通过分裂的边缘插入:

再次这场耗时O(1)时间,我们更新剩余 1和 积极点(根,'X',0)的规则1的状态。

但有一件事,我们需要做的。我们将这个规则2:

  

如果我们分手的边缘,并插入一个新的节点,如果是这样的不   第一个节点的过程中当前步骤创建,我们连接previously   插入的节点,并通过一个特殊的指针的新的节点,一个的后缀   链接。稍后我们将看到这是为什么有用。下面是我们得到的,   后缀链接psented作为点缀优势重新$ P $:

我们仍需要插入当前步骤的最终后缀, X 。由于活动节点的 active_length 分量已经下降 为0,最后的插入物在直接在根制成。由于没有 在根节点开始与 X 即将离任的优势,我们插入一个新的 优势:

我们可以看到,在目前的步骤中的所有剩余插入发了言。

我们进入第7步设置 = 7,它会自动将下一个字符, A ,所有的叶边,一如既往。然后我们试图插入新的最终 字符到活动点(根),并发现它的存在 已经。因此,我们结束当前的步骤,不需要插入任何与 更新活动点(根,'A',1)

在第8步 ,, = 8,我们追加 B ,和就像前,这只 意味着我们更新活动点(根,'一',2)和增量剩余没有做 别的,因为 B 已经是present。 但是,我们注意到(在O(1)的时间)的活动点 现在在边缘的末端。我们反映这个由它重新设置为 (节点,'\ 0X',0)。在这里,我用节点1 来指代 在内部节点的 AB 边缘处结束。

然后,在步 = 9 ,我们需要将C,这将帮助我们 了解最终的伎俩:

第二次延长:使用后缀链接

与往常一样,更新追加 C 自动向叶边 我们去到活动点,看看我们是否可以将C。事实证明 出'C'已经存在于这种优势,所以我们设置活动点 (节点,'C',1),增量剩余别的什么也不干。

现在在步 = 10 ,余数为4 ,所以我们首先需要插入 ABCD 通过在主动插入 D (这仍然是从3个步骤前) 点。

试图在活动点​​插入 D 引起的边缘分裂 O(1)时间:

active_node ,从中分离发起,被标记为 上面红色。这里是最终规则,规则3:

  

分裂的边缘,从经过 active_node 不是根   节点,我们按照后缀链路外出该节点的,如果有   任,并重置 active_node 的节点,它指向。如果有   没有后缀的链接,我们设置了 active_node 根。 active_edge   和 active_length 保持不变。

所以,积极点,现在是(节点2,'C',1)节点2 标记中 下面红:

由于插入 ABCD 完成后,我们递减剩余来 3,考虑当前步骤的下一个剩余后缀, BCD 。第3条设立活动点,以恰到好处的节点和边缘 所以插入 BCD 可以通过简单地插入它的最后一个字符来完成 D 的活动点。

这样做,因为规则2 这导致另一个边缘分裂,和,我们 必须创建从previously插入节点到新的后缀链接 之一:

我们看到:后缀的联系使我们能够重新设置活动点,所以我们   可以使下的剩下的插入的在O(1)努力。看着那(这   图表上方,以确认确实节点的标签 AB 链接到   b (它的后缀),并在节点ABC 被链接到节点在    BC

当前步骤尚未完成。 剩余现在是2,和我们 需要按照第3条再重新设置活动点。由于 电流 active_node (红色以上)没有后缀的链接,我们重置 根。活动点现在(根,'C',1)

因此​​,下一个插入发生在根节点的一个引出边 其标签开始 C cabxabcd ,第一个字符的后面, 即后面 C 。这会导致另一种分裂:

此外,由于这涉及到一个新的内部节点的创建,我们按照 规则2,设置从pviously创建了内部的$ P $一个新的后缀链接 节点:

(我使用 Graphviz的点这些小 图形。新后缀的链接导致点重新安排现有 边缘,所以仔细检查,以确认这是唯一 上方插入,是一种新的后缀链接。)

通过这个,剩余可设置为1,因为 active_node 是 根,我们使用规则1,更新活动点(根,'D',0)。本 指当前步骤的最后插入是插入一个 D 在根:

这是最后一步,我们正在做。还有的最终数目 观察,虽然

在我们移动前进1个职位的每一步。这将自动 更新为O的所有叶子节点(1)的时间。

不过,这并不涉及一)任何后缀的从previous剩余的 步骤,和b)与当前步骤的最后一个字符。

剩余告诉我们,我们有多少额外的插入需要 使。这些插入对应一对之一的最后的后缀 即在当前位置结束的字符串。我们思考一个 后等,使插入物。 重要:每个刀片是 在O(1)时间内完成,因为活动点告诉我们究竟在何处 走,我们需要在活动只添加一个单字符 点。为什么?因为其他字符的含有隐含的 (否则,活动点不会是它在哪里)。

在每一个这样的插件,我们递减剩余,并按照 如果有任何后缀链路。如果不是我们去根(第3条)。如果我们 处于根已经,我们修改使用规则1.在活动点 任何情况下,只需要O(1)时间。

如果,在这些刀片中的一个,我们发现,对字符我们想要 插入已经存在,我们没有做任何事情,并结束 当前步骤,即使剩余> 0。其理由是,任何 仍然存在的插入会,我们只是想在一个后缀 使。因此,他们都的隐的当前树。事实 该剩余> 0可以确保我们处理余下的后缀 更高版本。

如果在算法剩余> 0结束了吗?这将是 情况下,只要文字到底是什么发生了子 以前在什么地方。在这种情况下,我们必须附加一个额外的字符 在尚未发生前字符串的结尾。在里面 文学,通常是美元符号 $ 被用作一个符号 那。 为什么这件事 - >如果以后我们使用完成后缀树来搜索后缀,我们必须接受的匹配,只有当他们的到底在叶的。否则,我们就得到了很多虚假的比赛,因为有许多的字符串的隐含的包含在树中不是主要的字符串的实际后缀。强制剩余为0时结束本质上是一种方式,以确保所有后缀结尾处的叶节点。 但是,如果我们想用树来搜索的一般子的,不仅后缀的主字符串,这最后一步确实不需要建议的那样,低于OP的评论。

那么,什么是整个算法的复杂性?如果文本为n 字符长,有明显的n步(或N + 1,如果我们增加 美元符号)。在每个步骤中,我们要么什么也不做(除 更新变量),或者我们做剩余插入,每次取O(1) 时间。由于剩余表示多少次,我们都做了什么 在previous步骤,并减少对于每次插入,我们做 现在的总次数,我们做的事情是恰好n(或 的n + 1)。因此,总的复杂度为O(n)。

然而,有一个很小的事情,我没有好好解释一下: 它可以发生,我们遵循一个后缀链接,更新活动 点,然后发现它 active_length 组件不 以及与新的 active_node 工作。例如,考虑这样的情况 像这样的:

(虚线表示树的剩余部分。该虚线是一个 后缀链接。)

现在让活动点是(红色,'D',3),所以它指向的地方 后面的 F DEFG 边缘。现在假设我们做了必要的 更新现在跟随后缀链接更新活动点 根据规则3.新的活跃点(绿色,D,3)。然而, 在 D -edge走出去绿色节点是,所以它只有2 字符。为了找到正确的活动点,我们明显 需要遵循边缘到蓝色节点,并恢复到(蓝,'F',1)

在一个特别不好的情况下, active_length 可能是一样大 剩余,它可以大如正。它很可能发生 ,要找到正确的活动点,我们需要超过一不仅跳跃 内部节点,但或许许多,直到n在最坏的情况下。这是否 意味着该算法有一个隐藏的为O(n 2 )的复杂性,因为 在每一步剩余一般为O(n),以及后期调整 以下面的后缀的链接可能是为O(n),后主动节点呢?

没有。原因是,如果确实,我们必须调整活动点 (例如,从绿色到如上蓝色),这使我们到一个新的节点 有它自己的后缀链路,和 active_length 将减少。如 我们按照下来的后缀链接,我们做其余插入链条, active_length 只能 降低,并且活性位点的调整的数量,我们可以使上 在任何给定时间的方式不能比 active_length 大。以来 active_length ,并且不能超过剩余剩余大 为O(n)不仅在每一个步骤,但增量的总和 以往剩余在整个处理过程中是由 为O(n)也主动点调整的次数也北临 为O(n)。

I feel a bit thick at this point. I've spent days trying to fully wrap my head around suffix tree construction, but because I don't have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology. The closest to a good explanation that I've found is Fast String Searching With Suffix Trees, but he glosses over various points and some aspects of the algorithm remain unclear.

A step-by-step explanation of this algorithm here on StackOverflow would be invaluable for many others besides me, I'm sure.

For reference, here's Ukkonen's paper on the algorithm: http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

My basic understanding, so far:

I need to iterate through each prefix P of a given string T I need to iterate through each suffix S in prefix P and add that to tree To add suffix S to the tree, I need to iterate through each character in S, with the iterations consisting of either walking down an existing branch that starts with the same set of characters C in S and potentially splitting an edge into descendent nodes when I reach a differing character in the suffix, OR if there was no matching edge to walk down. When no matching edge is found to walk down for C, a new leaf edge is created for C.

The basic algorithm appears to be O(n2), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix. Ukkonen's algorithm is apparently unique because of the suffix pointer technique he uses, though I think that is what I'm having trouble understanding.

I'm also having trouble understanding:

exactly when and how the "active point" is assigned, used and changed what is going on with the canonization aspect of the algorithm Why the implementations I've seen need to "fix" bounding variables that they are using

EDIT (April 13, 2012)

Here is the completed source code that I've written and output based on jogojapan's answer below. The code outputs a detailed description and text-based diagram of the steps it takes as it builds the tree. It is a first version and could probably do with optimization and so forth, but it works, which is the main thing.

[Redacted URL, see updated link below]

EDIT (April 15, 2012)

The source code has been completely rewritten from scratch and now not only works correctly, but it supports automatic canonization and renders a nicer looking text graph of the output. Source code and sample output is at:

https://gist.github.com/2373868

解决方案

The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (i.e. does not contain any repeated characters), and then extending it to the full algorithm.

First, a few preliminary statements.

What we are building, is basically like a search trie. So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth

But: Unlike in a search trie, the edge labels are not single characters. Instead, each edge is labeled using a pair of integers [from,to]. These are pointers into the text. In this sense, each edge carries a string label of arbitrary length, but takes only O(1) space (two pointers).

Basic principle

I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:

abc

The algorithm works in steps, from left to right. There is one step for every character of the string. Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n).

So, we start from the left, and first insert only the single character a by creating an edge from the root node (on the left) to a leaf, and labeling it as [0,#], which means the edge represents the substring starting at position 0 and ending at the current end. I use the symbol # to mean the current end, which is at position 1 (right after a).

So we have an initial tree, which looks like this:

And what it means is this:

Now we progress to position 2 (right after b). Our goal at each step is to insert all suffixes up to the current position. We do this by

expanding the existing a-edge to ab inserting one new edge for b

In our representation this looks like

And what it means is:

We observe two things:

The edge representation for ab is the same as it used to be in the initial tree: [0,#]. Its meaning has automatically changed because we updated the current position # from 1 to 2. Each edge consumes O(1) space, because it consists of only two pointers into the text, regardless of how many characters it represents.

Next we increment the position again and update the tree by appending a c to every existing edge and inserting one new edge for the new suffix c.

In our representation this looks like

And what it means is:

We observe:

The tree is the correct suffix tree up to the current position after each step There are as many steps as there are characters in the text The amount of work in each step is O(1), because all existing edges are updated automatically by incrementing #, and inserting the one new edge for the final character can be done in O(1) time. Hence for a string of length n, only O(n) time is required.

First extension: Simple repetitions

Of course this works so nicely only because our string does not contain any repetitions. We now look at a more realistic string:

abcabxabcd

It starts with abc as in the previous example, then ab is repeated and followed by x, and then abc is repeated followed by d.

Steps 1 through 3: After the first 3 steps we have the tree from the previous example:

Step 4: We move # to position 4. This implicitly updates all existing edges to this:

and we need to insert the final suffix of the current step, a, at the root.

Before we do this, we introduce two more variables (in addition to #), which of course have been there all the time but we haven't used them so far:

The active point, which is a triple (active_node,active_edge,active_length) The remainder, which is an integer indicating how many new suffixes we need to insert

The exact meaning of these two will become clear soon, but for now let's just say:

In the simple abc example, the active point was always (root,'\0x',0), i.e. active_node was the root node, active_edge was specified as the null character '\0x', and active_length was zero. The effect of this was that the one new edge that we inserted in every step was inserted at the root node as a freshly created edge. We will see soon why a triple is necessary to represent this information. The remainder was always set to 1 at the beginning of each step. The meaning of this was that the number of suffixes we had to actively insert at the end of each step was 1 (always just the final character).

Now this is going to change. When we insert the current final character a at the root, we notice that there is already an outgoing edge starting with a, specifically: abca. Here is what we do in such a case:

We do not insert a fresh edge [4,#] at the root node. Instead we simply notice that the suffix a is already in our tree. It ends in the middle of a longer edge, but we are not bothered by that. We just leave things the way they are. We set the active point to (root,'a',1). That means the active point is now somewhere in the middle of outgoing edge of the root node that starts with a, specifically, after position 1 on that edge. We notice that the edge is specified simply by its first character a. That suffices because there can be only one edge starting with any particular character (confirm that this is true after reading through the entire description). We also increment remainder, so at the beginning of the next step it will be 2.

Observation: When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changed at all (we only update the active point and remainder). The tree is then not an accurate representation of the suffix tree up to the current position any more, but it contains all suffixes (because the final suffix a is contained implicitly). Hence, apart from updating the variables (which are all of fixed length, so this is O(1)), there was no work done in this step.

Step 5: We update the current position # to 5. This automatically updates the tree to this:

And because remainder is 2, we need to insert two final suffixes of the current position: ab and b. This is basically because:

The a suffix from the previous step has never been properly inserted. So it has remained, and since we have progressed one step, it has now grown from a to ab. And we need to insert the new final edge b.

In practice this means that we go to the active point (which points to behind the a on what is now the abcab edge), and insert the current final character b. But: Again, it turns out that b is also already present on that same edge.

So, again, we do not change the tree. We simply:

Update the active point to (root,'a',2) (same node and edge as before, but now we point to behind the b) Increment the remainder to 3 because we still have not properly inserted the final edge from the previous step, and we don't insert the current final edge either.

To be clear: We had to insert ab and b in the current step, but because ab was already found, we updated the active point and did not even attempt to insert b. Why? Because if ab is in the tree, every suffix of it (including b) must be in the tree, too. Perhaps only implicitly, but it must be there, because of the way we have built the tree so far.

We proceed to step 6 by incrementing #. The tree is automatically updated to:

Because remainder is 3, we have to insert abx, bx and x. The active point tells us where ab ends, so we only need to jump there and insert the x. Indeed, x is not there yet, so we split the abcabx edge and insert an internal node:

The edge representations are still pointers into the text, so splitting and inserting an internal node can be done in O(1) time.

So we have dealt with abx and decrement remainder to 2. Now we need to insert the next remaining suffix, bx. But before we do that we need to update the active point. The rule for this, after splitting and inserting an edge, will be called Rule 1 below, and it applies whenever the active_node is root (we will learn rule 3 for other cases further below). Here is rule 1:

After an insertion from root,

active_node remains root active_edge is set to the first character of the new suffix we need to insert, i.e. b active_length is reduced by 1

Hence, the new active-point triple (root,'b',1) indicates that the next insert has to be made at the bcabx edge, behind 1 character, i.e. behind b. We can identify the insertion point in O(1) time and check whether x is already present or not. If it was present, we would end the current step and leave everything the way it is. But x is not present, so we insert it by splitting the edge:

Again, this took O(1) time and we update remainder to 1 and the active point to (root,'x',0) as rule 1 states.

But there is one more thing we need to do. We'll call this Rule 2:

If we split an edge and insert a new node, and if that is not the first node created during the current step, we connect the previously inserted node and the new node through a special pointer, a suffix link. We will later see why that is useful. Here is what we get, the suffix link is represented as a dotted edge:

We still need to insert the final suffix of the current step, x. Since the active_length component of the active node has fallen to 0, the final insert is made at the root directly. Since there is no outgoing edge at the root node starting with x, we insert a new edge:

As we can see, in the current step all remaining inserts were made.

We proceed to step 7 by setting #=7, which automatically appends the next character, a, to all leaf edges, as always. Then we attempt to insert the new final character to the active point (the root), and find that it is there already. So we end the current step without inserting anything and update the active point to (root,'a',1).

In step 8,, #=8, we append b, and as seen before, this only means we update the active point to (root,'a',2) and increment remainder without doing anything else, because b is already present. However, we notice (in O(1) time) that the active point is now at the end of an edge. We reflect this by re-setting it to (node1,'\0x',0). Here, I use node1 to refer to the internal node the ab edge ends at.

Then, in step #=9, we need to insert 'c' and this will help us to understand the final trick:

Second extension: Using suffix links

As always, the # update appends c automatically to the leaf edges and we go to the active point to see if we can insert 'c'. It turns out 'c' exists already at that edge, so we set the active point to (node1,'c',1), increment remainder and do nothing else.

Now in step #=10, remainder is 4, and so we first need to insert abcd (which remains from 3 steps ago) by inserting d at the active point.

Attempting to insert d at the active point causes an edge split in O(1) time:

The active_node, from which the split was initiated, is marked in red above. Here is the final rule, Rule 3:

After splitting an edge from an active_node that is not the root node, we follow the suffix link going out of that node, if there is any, and reset the active_node to the node it points to. If there is no suffix link, we set the active_node to the root. active_edge and active_length remain unchanged.

So the active point is now (node2,'c',1), and node2 is marked in red below:

Since the insertion of abcd is complete, we decrement remainder to 3 and consider the next remaining suffix of the current step, bcd. Rule 3 has set the active point to just the right node and edge so inserting bcd can be done by simply inserting its final character d at the active point.

Doing this causes another edge split, and because of rule 2, we must create a suffix link from the previously inserted node to the new one:

We observe: Suffix links enable us to reset the active point so we can make the next remaining insert at O(1) effort. Look at the graph above to confirm that indeed node at label ab is linked to the node at b (its suffix), and the node at abc is linked to bc.

The current step is not finished yet. remainder is now 2, and we need to follow rule 3 to reset the active point again. Since the current active_node (red above) has no suffix link, we reset to root. The active point is now (root,'c',1).

Hence the next insert occurs at the one outgoing edge of the root node whose label starts with c: cabxabcd, behind the first character, i.e. behind c. This causes another split:

And since this involves the creation of a new internal node,we follow rule 2 and set a new suffix link from the previously created internal node:

(I am using Graphviz Dot for these little graphs. The new suffix link caused dot to re-arrange the existing edges, so check carefully to confirm that the only thing that was inserted above is a new suffix link.)

With this, remainder can be set to 1 and since the active_node is root, we use rule 1 to update the active point to (root,'d',0). This means the final insert of the current step is to insert a single d at root:

That was the final step and we are done. There are number of final observations, though:

In each step we move # forward by 1 position. This automatically updates all leaf nodes in O(1) time.

But it does not deal with a) any suffixes remaining from previous steps, and b) with the one final character of the current step.

remainder tells us how many additional inserts we need to make. These inserts correspond one-to-one to the final suffixes of the string that ends at the current position #. We consider one after the other and make the insert. Important: Each insert is done in O(1) time since the active point tells us exactly where to go, and we need to add only one single character at the active point. Why? Because the other characters are contained implicitly (otherwise the active point would not be where it is).

After each such insert, we decrement remainder and follow the suffix link if there is any. If not we go to root (rule 3). If we are at root already, we modify the active point using rule 1. In any case, it takes only O(1) time.

If, during one of these inserts, we find that the character we want to insert is already there, we don't do anything and end the current step, even if remainder>0. The reason is that any inserts that remain will be suffixes of the one we just tried to make. Hence they are all implicit in the current tree. The fact that remainder>0 makes sure we deal with the remaining suffixes later.

What if at the end of the algorithm remainder>0? This will be the case whenever the end of the text is a substring that occurred somewhere before. In that case we must append one extra character at the end of the string that has not occurred before. In the literature, usually the dollar sign $ is used as a symbol for that. Why does that matter? --> If later we use the completed suffix tree to search for suffixes, we must accept matches only if they end at a leaf. Otherwise we would get a lot of spurious matches, because there are many strings implicitly contained in the tree that are not actual suffixes of the main string. Forcing remainder to be 0 at the end is essentially a way to ensure that all suffixes end at a leaf node. However, if we want to use the tree to search for general substrings, not only suffixes of the main string, this final step is indeed not required, as suggested by the OP's comment below.

So what is the complexity of the entire algorithm? If the text is n characters in length, there are obviously n steps (or n+1 if we add the dollar sign). In each step we either do nothing (other than updating the variables), or we make remainder inserts, each taking O(1) time. Since remainder indicates how many times we have done nothing in previous steps, and is decremented for every insert that we make now, the total number of times we do something is exactly n (or n+1). Hence, the total complexity is O(n).

However, there is one small thing that I did not properly explain: It can happen that we follow a suffix link, update the active point, and then find that its active_length component does not work well with the new active_node. For example, consider a situation like this:

(The dashed lines indicate the rest of the tree. The dotted line is a suffix link.)

Now let the active point be (red,'d',3), so it points to the place behind the f on the defg edge. Now assume we made the necessary updates and now follow the suffix link to update the active point according to rule 3. The new active point is (green,'d',3). However, the d-edge going out of the green node is de, so it has only 2 characters. In order to find the correct active point, we obviously need to follow that edge to the blue node and reset to (blue,'f',1).

In a particularly bad case, the active_length could be as large as remainder, which can be as large as n. And it might very well happen that to find the correct active point, we need not only jump over one internal node, but perhaps many, up to n in the worst case. Does that mean the algorithm has a hidden O(n2) complexity, because in each step remainder is generally O(n), and the post-adjustments to the active node after following a suffix link could be O(n), too?

No. The reason is that if indeed we have to adjust the active point (e.g. from green to blue as above), that brings us to a new node that has its own suffix link, and active_length will be reduced. As we follow down the chain of suffix links we make the remaining inserts, active_length can only decrease, and the number of active-point adjustments we can make on the way can't be larger than active_length at any given time. Since active_length can never be larger than remainder, and remainder is O(n) not only in every single step, but the total sum of increments ever made to remainder over the course of the entire process is O(n) too, the number of active point adjustments is also bounded by O(n).