类似的字符串算法字符串、算法、类似

2023-09-10 22:43:15 作者:敷衍〞怎么演

我在寻找一种算法,或者至少是理论上的操作你将如何找到类似的文字在两个或两个以上不同的字符串...

I'm looking for an algorithm, or at least theory of operation on how you would find similar text in two or more different strings...

就像提出的问题在这里:算法找到类似的文本,差异是我的文本字符串将永远只能是少数的话。

Much like the question posed here: Algorithm to find similar text, the difference being that my text strings will only ever be a handful of words.

就像说我有一个字符串: 进了湛蓝的天空 和我做有以下两个字符串一个比较: 颜色是天蓝色和 在蓝色的晴空

Like say I have a string: "Into the clear blue sky" and I'm doing a compare with the following two strings: "The color is sky blue" and "In the blue clear sky"

我在寻找可以用来匹配两个文本,并决定如何接近它们匹配的算法。就我而言,拼写和标点符号都将是非常重要的。我不想让他们影响到发现真正的文本。在上面的例子中,如果颜色参考存储为'天蓝色的',我希望它仍然能够匹配。然而,上市的第三根弦应该有更好的比赛在第二,等等。

I'm looking for an algorithm that can be used to match the text in the two, and decide on how close they match. In my case, spelling, and punctuation are going to be important. I don't want them to affect the ability to discover the real text. In the above example, if the color reference is stored as "'sky-blue'", I want it to still be able to match. However, the 3rd string listed should be a BETTER match over the second, etc.

我敢肯定的地方,如谷歌可能会使用类似的东西了你的意思是:功能...

I'm sure places like Google probably use something similar with the "Did you mean:" feature...

*编辑* 在和一个朋友聊天,他曾与谁写了一篇关于这个话题的人。我想我会和大家一起阅读分享,因为有一些它描述真正的好方法和过程......

* EDIT * In talking with a friend, he worked with a guy who wrote a paper on this topic. I thought I might share it with everyone reading this, as there are some really good methods and processes described in it...

这里是链接到他的论文,我希望它是帮助那些读了这个问题,并在相似的弦的话题算法。

Here's the link to his paper, I hope it is helpful to those reading this question, and on the topic of similar string algorithms.

推荐答案

我不能标记两个答案在这里,所以我要回答和纪念我自己的。的Levenshtein距离似乎是在大多数情况下,这种情况的正确方法。不过,值得一提的 j_random_hackers 回答为好。我已经使用LZMA的实现,以测试他的理论,并证明是一个完善的解决方案。在我原来的问题我一直在寻找的短字符串(2至200个字符),其中,莱文斯坦距离算法将工作的方法。但是,不要在问题中提到的是需要比较两个(较大)字符串(在这种情况下,大小适中的文本文件),并进行快速检查,看看相似的两者。我认为,这种玉米pression技术将工作良好,但我还没有研究它找到在该点就会有一个比其他更好,在采样数据的大小和速度的操作中/成本方面题。我想了很多给这个问题的答案都是有价值的,值得一提的是,为寻找解决类似的字符串磨难像我在这里做。谢谢大家对你的伟大的答案,我希望他们可以用来为他人服务也很好。

I can't mark two answers here, so I'm going to answer and mark my own. The Levenshtein distance appears to be the correct method in most cases for this. But, it is worth mentioning j_random_hackers answer as well. I have used an implementation of LZMA to test his theory, and it proves to be a sound solution. In my original question I was looking for a method for short strings (2 to 200 chars), where the Levenshtein Distance algorithm will work. But, not mentioned in the question was the need to compare two (larger) strings (in this case, text files of moderate size) and to perform a quick check to see how similar the two are. I believe that this compression technique will work well but I have yet to study it to find at which point one becomes better than the other, in terms of the size of the sample data and the speed/cost of the operation in question. I think a lot of the answers given to this question are valuable, and worth mentioning, for anyone looking to solve a similar string ordeal like I'm doing here. Thank you all for your great answers, and I hope they can be used to serve others well too.