字符串的相似性:究竟怎样Bitap工作?相似性、字符串、工作、Bitap

2023-09-11 03:30:03 作者:寂寞让你更快乐

我想换行我的头周围的 Bitap 的算法,但我有麻烦理解后面的步骤的原因该算法。

I'm trying to wrap my head around the Bitap algorithm, but am having trouble understanding the reasons behind the steps of the algorithm.

我理解的算法,也就是(纠正我,如果我错了)的基本premise:

I understand the basic premise of the algorithm, which is (correct me if i'm wrong):

Two strings:     PATTERN (the desired string)
                 TEXT (the String to be perused for the presence of PATTERN)

Two indices:     i (currently processing index in PATTERN), 1 <= i < PATTERN.SIZE
                 j (arbitrary index in TEXT)

Match state S(x): S(PATTERN(i)) = S(PATTERN(i-1)) && PATTERN[i] == TEXT[j], S(0) = 1

在英语方面, PATTERN.substring(0,I)匹配文本字符串,如果previous子 PATTERN.substring( 0,I-1)已成功匹配的字符模式[I] 是一样的字符 TEXT [J]

In english terms, PATTERN.substring(0,i) matches a substring of TEXT if the previous substring PATTERN.substring(0, i-1) was successfully matched and the character at PATTERN[i] is the same as the character at TEXT[j].

我不明白的是位移实现了这一点。 官方给出基本的细节,该算法奠定了吧,但我似乎无法想象什么是应该去。 的算法规范仅仅是第2页的文件的,但我还是要强调的重要部分:

What I don't understand is the bit-shifting implementation of this. The official paper detailing this algorithm basically lays it out, but I can't seem to visualize what's supposed to go on. The algorithm specification is only the first 2 pages of the paper, but I'll highlight the important parts:

下面是位移版本的概念:

Here is the bit-shifting version of the concept:

下面是T [文]一个示例搜索字符串:

Here is T[text] for a sample search string:

这里是一个跟踪的算法。

And here is a trace of the algorithm.

具体而言,我不明白什么是T台上的象征,而背后的原因荷兰国际集团在其当前状态的条目。

Specifically, I don't understand what the T table signifies, and the reason behind ORing an entry in it with the current state.

我会很感激,如果有人能帮助我明白究竟是怎么回事

I'd be grateful if anyone can help me understand what exactly is going on

推荐答案

T 是略显混乱,因为在平时的多个位置 模式从左至右:

T is slightly confusing because you would normally number positions in the pattern from left to right:

0 1 2 3 4
a b a b c

...而位被从右到左正常编号

...whereas bits are normally numbered from right to left.

可是我在写 图案向后上方位明确:

But writing the pattern backwards above the bits makes it clear:


  bit: 4 3 2 1 0

       c b a b a
T[a] = 1 1 0 1 0

       c b a b a
T[b] = 1 0 1 0 1

       c b a b a
T[c] = 0 1 1 1 1

       c b a b a
T[d] = 1 1 1 1 1

位的 N 的的T [X] 0 如果 X 出现在位置上的 N 或 1 如果没有。

Bit n of T[x] is 0 if x appears in position n, or 1 if it does not.

等价地,你可以认为这是说,如果当前字符 在输入字符串 X ,你可以看到在位置的 0 的 N 的 T [X] ,那么你 只可能被匹配模式,如果比赛开始的 N 的字符 previously。

Equivalently, you can think of this as saying that if the current character in the input string is x, and you see a 0 in position n of T[x], then you can only possibly be matching the pattern if the match started n characters previously.

现在到匹配的过程。 A 0 在位的 N 的状态,意味着我们开始与模式匹配的 N 的字符前(其中0为当前字符)。最初,没有相匹配。

Now to the matching procedure. A 0 in bit n of the state means that we started matching the pattern n characters ago (where 0 is the current character). Initially, nothing matches.

  [start]
1 1 1 1 1

由于我们消耗字符尝试匹配,国家左移(其在移动零 至底部位,位0)和OR-ED与当前字符的表条目。第一个字符是 A ;左移和的OR-ing在 T [A] 给出:

As we consume characters trying to match, the state is shifted left (which shifts a zero in to the bottom bit, bit 0) and OR-ed with the table entry for the current character. The first character is a; shifting left and OR-ing in T[a] gives:

        a
1 1 1 1 0

0 位被移入是preserved,因为 A 能一个当前字符 开始的码型的匹配。对于任何其它的字符,位将已被设为 1

The 0 bit that was shifted in is preserved, because a current character of a can begin a match of the pattern. For any other character, the bit would be have been set to 1.

这一点国家的0事实是现在 0 意味着我们开始匹配的模式 当前字符;持续,我们得到:

The fact that bit 0 of the state is now 0 means that we started matching the pattern on the current character; continuing, we get:

      a b
1 1 1 0 1

...因为 0 位被左移 - 认为它是说,我们开始1个字符前匹配的模式 - 和 T [B] 有一个 0 在相同的位置,告诉 我们一看到 B 在目前的位置还是不错的,如果我们开始匹配1个字符 以前。

...because the 0 bit has been shifted left - think of it as saying that we started matching the pattern 1 character ago - and T[b] has a 0 in the same position, telling us that a seeing a b in the current position is good if we started matching 1 character ago.

    a b d
1 1 1 1 1

D 所无法比拟的任何地方;所有的位都设置回 1

d can't match anywhere; all the bits get set back to 1.

  a b d a
1 1 1 1 0

和以前一样。

As before.

a b d a b
1 1 1 0 1

和以前一样。

As before.

b d a b a
1 1 0 1 0

A 是一件好事,如果比赛开始或者2个字符之前或当前字符。

a is good if the match started either 2 characters ago or on the current character.

d a b a b
1 0 1 0 1

B 是一件好事,如果比赛开始1或3个字符前。该 0 在第3位的手段 我们已经几乎与整个模式...

b is good if the match started either 1 or 3 characters ago. The 0 in bit 3 means that we've almost matched the whole pattern...

a b a b a
1 1 0 1 0

...但下一个字符是 A ,这是没有好,如果在比赛开始4个字符 前。然而,较短的比赛可能仍然是良好的。

...but the next character is a, which is no good if the match started 4 characters ago. However, shorter matches might still be good.

b a b a b
1 0 1 0 1

不过看起来很不错。

Still looking good.

a b a b c
0 1 1 1 1

最后, C 的是的好,如果在比赛开始4个字符之前。事实 一个 0 使得它所有的方式来最高位意味着我们有一个匹配。

Finally, c is good if the match started 4 characters before. The fact that a 0 has made it all the way to the top bit means that we have a match.