莱文斯坦距离:如何更好地处理的话交换位置?的话、文斯、距离、位置

2023-09-12 21:18:19 作者:我若不死,尔等永远都是妃

我已经比较使用PHP的字符串一些成功莱文斯坦的功能。

I've had some success comparing strings using the PHP levenshtein function.

不过,对于两个字符串包含已交换位置的子串,该算法计算那些作为全新子。

However, for two strings which contain substrings that have swapped positions, the algorithm counts those as whole new substrings.

例如:

levenshtein("The quick brown fox", "brown quick The fox"); // 10 differences

将被视为的不太常用的比:

levenshtein("The quick brown fox", "The quiet swine flu"); // 9 differences

我倒是preFER看见了的前两个的更相似。

我怎么能去想出一个比较功能,它可以识别已交换位置是不同的,以编辑子?

How could I go about coming up with a comparison function that can identify substrings which have switched position as being distinct to edits?

我已经想到了一种可能的方法是把字符串中的所有单词按字母顺序,比较前。这需要的话,原来的顺序完全摆脱了比较。一个缺点于此,但是,是改变一个字的只是第一字母可以创建一个更大的破坏比改变一个字母应引起

One possible approach I've thought of is to put all the words in the string into alphabetical order, before the comparison. That takes the original order of the words completely out of the comparison. A downside to this, however, is that changing just the first letter of a word can create a much bigger disruption than a changing a single letter should cause.

我试图做到的,是比较两个事实有关的人都是免费的文本字符串,并决定这些事实怎么可能是表示同一个事实。事实可能是学校的人参加,他们的雇主或出版商的名称,例如。两个记录可能具有相同的学校拼写不同,以不同的顺序的话,多余的话,等于是匹配了,如果我们要做出很好的猜测,他们指的是同一所学校是有点模糊。因此,到目前为止,它是工作得很好的拼写错误(我使用的是phoenetic算法类似于音位在这一切之上),但非常差,如果你切换单词的顺序围绕这似乎普遍在一所学校:XX学院与高校的xxx。

What I'm trying to achieve is to compare two facts about people which are free text strings, and decide how likely these facts are to indicate the same fact. The facts might be the school someone attended, the name of their employer or publisher, for example. Two records may have the same school spelled differently, words in a different order, extra words, etc, so the matching has to be somewhat fuzzy if we are to make a good guess that they refer to the same school. So-far it is working very well for spelling errors (I am using a phoenetic algorithm similar to metaphone on top of this all) but very poorly if you switch the order of words around which seem common in a school: "xxx college" vs "college of xxx".

推荐答案

使用的N-gram ,该支持的多字符换位在整个文本

总的想法是,拆分所涉及的两个字符串到所有可能的2-3字子(正克)和治疗的两个字符串作为他们的相似性测度之间共享正克数。这可以通过在共享数除以的n-gram中的时间越长串的总数被再归一化。这是微不足道的计算,但相当强大的。

The general idea is that you split the two strings in question into all the possible 2-3 character substrings (n-grams) and treat the number of shared n-grams between the two strings as their similarity metric. This can be then normalized by dividing the shared number by the total number of n-grams in the longer string. This is trivial to calculate, but fairly powerful.

有关例句:

A. The quick brown fox
B. brown quick The fox
C. The quiet swine flu

A和B股的 18 2克的

A和C份额只有 8 2克的

出的 20 总的可能。

此已经详细在 Gravano等人讨论。纸。

一个不能算小的选择,但在信息论接地是使用期限词频逆文档频率(TF-IDF)权衡的标记,构建句子载体,然后使用余弦相似的相似指标。

A not so trivial alternative, but grounded in information theory would be to use term term frequency–inverse document frequency (tf-idf) to weigh the tokens, construct sentence vectors and then use cosine similarity as the similarity metric.

的算法是:

在每个句子的计算2个字符标记频率(TF)。 计算逆句子频率(IDF),这是所有句子在语料库中的数量的商的对数(在此情况下,3)由一特定的令牌出现在所有句子的次数分配。在这种情况下的个的是在所有的句子,以便它具有零信息内容(日志(3/3)= 0)。 将在TF和IDF表中的相应单元格相乘产生TF-IDF矩阵。 最后,计算余弦相似性矩阵的所有句子对,其中A和B是从相应令牌TF-IDF表的权重。的范围是从0(不相似)到1(相等)。 Calculate 2-character token frequencies (tf) per sentence. Calculate inverse sentence frequencies (idf), which is a logarithm of a quotient of the number of all sentences in the corpus (in this case 3) divided by the number of times a particular token appears across all sentences. In this case th is in all sentences so it has zero information content (log(3/3)=0). Produce the tf-idf matrix by multiplying corresponding cells in the tf and idf tables. Finally, calculate cosine similarity matrix for all sentence pairs, where A and B are weights from the tf-idf table for the corresponding tokens. The range is from 0 (not similar) to 1 (equal).

至于其他的答案。 Damerau - 莱文斯坦 modificication支持的两个相邻的字符只有换位。 音位的目的是要匹配的发音相同,而不是单词 对于相似性匹配。

Levenshtein modifications and Metaphone

Regarding other answers. Damerau–Levenshtein modificication supports only the transposition of two adjacent characters. Metaphone was designed to match words that sound the same and not for similarity matching.