Levenshtein距离,分别跟踪插入/缺失/取代缺失、距离、Levenshtein

2023-09-11 05:17:59 作者:Curtain 私念

在 Levenshtein距离维基百科的文章说,在其的可能的修改部分,即[我们]可以单独存放插入,删除和替换的数目。

The Wikipedia article on Levenshtein distance says, in its possible modifications section, that "[w]e can store the number of insertions, deletions, and substitutions separately".

这是怎么做的?我创建了文章,其中概述了基于矩阵的动态规划解决方案的实现矩阵的每个单元具有缺失/插入/替换一个单独的值,但对我的生活中,我似乎无法找出逻辑。

How is this done? I created an implementation of the matrix-based dynamic-programming solution outlined in the article, where each cell of the matrix has a separate value for deletions/insertions/substitutions, but for the life of me I can't seem to work out the logic.

直观地,似乎是,如果我发现需要做的缺失,并在同一步骤的插入,那么这些应该成为一个替代代替

Intuitively, it seems that if I find a need to do a deletion and an insertion in the same step, then those should become a substitution instead.

要做出我想要完全清楚,这里是源字符串垫子,目标字符串抓的样品基质。我预计这将需要一个替代(即改变M为C)和两个插入(即追加CH)。每个小区是缺失/插入/替换

To make what I want completely clear, here's a sample matrix for the source string "mat" and the target string "catch". I expect this to require one substitution (i.e. changing "m" to "c") and two insertions (i.e. appending "ch"). Each cell is "deletions/insertions/substitutions".


           C     A     T     C     H
  +-----+-----+-----+-----+-----+-----+
  |D/I/S|0/1/0|0/2/0|0/3/0|0/4/0|0/5/0|
  +-----+-----+-----+-----+-----+-----+
M |1/0/0|0/0/1|0/1/1|0/2/1|0/3/1|0/4/1|
  +-----+-----+-----+-----+-----+-----+
A |2/0/0|1/0/1|0/0/1|0/1/1|0/2/1|0/3/1|
  +-----+-----+-----+-----+-----+-----+
T |3/0/0|2/0/1|1/0/1|0/0/1|0/1/1|0/2/1|
  +-----+-----+-----+-----+-----+-----+

有没有人制定了这种算法?

Has anyone worked out this algorithm?

推荐答案

对于目标小区x,我们需要找到最小的:

For a target cell x, we need to find the minimum of:

this + substitution | this + deletion
--------------------+----------------
this + insertion    |       x

从左上角的是,当我们还没有处理,无论是价值还没有,所以我们必须同时处理两种,因此,它是一种替代。

From top-left is when we haven't processed to either value yet, so we must process both at the same time, thus it's a substitution.

从左边的是,当我们还没有处理的目标值呢,所以它的插入。

From left is when we haven't processed the target value yet, so it's insertion.

从顶部的时候,我们还没有处理的源值呢,所以它删除。

From top is when we haven't processed the source value yet, so it's deletion.

要单独存放的价值,你需要一个三维数组:

To store the values separately, you'll need a 3D array:

array[targetSize+1][inputSize+1][3]

然后,对于每个所述3 previous细胞,在添加1置换,缺失或插入(如上所示),然后根据对取代,缺失和插入的数量计算总的费用,并且找到最小的3成本。然后从细胞赋予最小到当前小区(与1操作所添加)的值复制

Then, for each of the 3 previous cells, you add 1 substitution, deletion or insertion (as indicated above), then calculate the total cost based on the number of substitutions, deletions and insertions, and find the minimum of the 3 costs. Then copy the values from the cell giving the minimum to the current cell (with the 1 operation added).

因此​​,对于:

0/1/0|0/2/0
-----+-----
0/0/1|  x

让我们假定1为每个操作成本。

Let's assume a cost of 1 for each operation.

我们计算: 0/1/0 + 1替代= 0/1/1 ,成本= 2 0/0/1 + 1插= 0/1/1 ,成本= 2 0/2/0 + 1 =删除 1/2/0 ,成本= 3

We calculate: 0/1/0 + 1 substitution = 0/1/1, cost = 2 0/0/1 + 1 insertion = 0/1/1, cost = 2 0/2/0 + 1 deletion = 1/2/0, cost = 3

然后我们挑选的费用2的任放 0/1/1 进入新的细胞。

Then we pick either of the cost 2's and put 0/1/1 into the new cell.

我希望帮助。