是否有一个编辑距离算法,需要"大块换位"考虑?算法、有一个、距离、编辑

2023-09-11 03:21:08 作者:道不尽的沧桑

我把引号块换位,因为我不知道是否有什么技术术语应该是。只要知道如果有一个技术术语的过程将是非常有益的。

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.

借助上编辑距离维基百科的文章对这个概念的一些很好的背景。

The Wikipedia article on edit distance gives some good background on the concept.

通过采取块换位考虑在内,我的意思是

By taking "chunk transposition" into account, I mean that

Turing, Alan.

应该匹配

Alan Turing

更紧密地比它匹配

more closely than it matches

Turing Machine

即。距离计算应检测时,文子都被简单地在文本中移动。这是不符合共同Levenshtein距离公式的情况

I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.

中的字符串将是几百个字符长至多 - 它们的作者的名字或作者姓名这可能是在一个不同的格式的列表。我没有做DNA测序(虽然我怀疑人们做会知道一点关于这个主题)。

The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).

推荐答案

有一个看杰卡德距离度量(JDM)。这是一个过时的歌曲,但是,糖果这是pretty的擅长标记级别的差异,如姓氏第一,姓。为两个字符串comparands中,皮肌炎计算仅仅是唯一字符的两个串都在共同除以它们之间唯一字符的总数的数量(换句话说在联合的交点)。例如,给定的两个参数JEFFKTYZZER和TYZZERJEFF,分子是7,分母是8,得到0.875的数值。我的选择字符作为令牌不是唯一可用的,顺便说一句 - 正 - 克经常使用以及

Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.