匹配的字符串由ressemblance 2列出字符串、ressemblance

2023-09-11 04:57:19 作者:气势比天高

问题

我有一个字符串2列出。我想找到从我的列表中的最佳匹配对。

I have 2 lists of strings. I want to find the best matching pairs from my lists.

例如,我有那些2列出:

For example, I have those 2 lists:

list1 = {"a1","b1","c1"}
list2 = {"a2","b2","c2"}

我希望得到以下结果:

I want to get the following results:

results = {{"a1,"a2"}, {"b1,"b2"}, {"c1,"c2"}}

其他信息

要比较两个字符串在一起,我想用类似 Levenshtein距离的东西。例如,当我比较A1A2,它给了我比A1与B2,所以A1 + A2将被视为一个更好的匹配。

To compare 2 strings together, I would like to use something similar to the Levenshtein distance. For example, when I compare "a1" with "a2", it gives me a shorter distance than "a1" with "b2", so "a1"+"a2" would be considered a better match.

我变得复杂时,对不同的获取同样的距离的效果。你不能只是采取一个具体项目的最小距离的List1 ,因为的List1 其他项目可能会获得相同的在距离同一项目 list2中

I gets complicated when different pairs gets the same distance results. You can't just take minimum distance for a specific item in list1, because another item in list1 could obtain the same distance with the same item in list2.

问题

你有算法的建议是什么?

Do you have suggestions of algorithms for that?

我在哪里,现在

您最好不要看我的发现第一次,所以你不要被我的工作的影响。

You better not look at my finding first so you don't get influenced by my work.

余计算Levenshtein距离为每个可能的串,并将结果存储在一个2维数组。然后,我建一个一维数组,其中每个元素都有:

I calculate the Levenshtein distance for each possible pair of string and store the results in a 2-dimension array. Then I build a single dimension array where each element has:

的对(我在我的2维数组Ĵ指数) 的距离

然后,我用距离元件进行排序该数组。

Then I sort this array by using distance element.

最后,我经过排序后的数组和解决一个共同的距离物品放在一起(所有距离== 0,再全部距离== 1,等...)。每到这时,我解决的一个元素,我在二维数组标记它,这样我就可以快速跳过我的有序阵列的解决了项目。

Finally, I go through the sorted array and resolve the items with a common distance together (all distance==0 first, then all distance==1, etc...). Every time, I resolve an element, I mark it in my 2D array, so I can quickly skip the resolved items in my sorted array.

我想我可以比这更好的解决方案。它可能不是最有效的时间和空间。

I think I can better than this solution. It may not the most efficient in time and space.

推荐答案

一旦你建立了指标要使用跟踪的两个字符串之间的距离,无论是Levenshtein距离或者另外一个,你可以使用匈牙利算法来解决你的问题。

Once you have established the metric you want to use to keep track of the "distance" between two strings, be it the Levenshtein distance or another one, you can use the Hungarian algorithm to solve your problem.

我个人从来没有实现它,但是维基百科包括几个环节可能会有所帮助。

I personally have never implement it, but Wikipedia includes several links that might be of help.