SQL - 不同长度的两个字符串之间的相似性相似性、字符串、长度、不同

2023-09-11 22:58:12 作者:自我清欢

我有一个产品SQL Server表,并且每个产品都有一个描述是公布在我们的网站。我想prevent,或者至少提醒我们的用户时,需要说明的是太相似了其他产品的描述。每个产品的描述长度可以相差很大。

I have a SQL Server table of products, and each product has a description that is publicly available on our website. I want to prevent, or at least warn our users when, a description is too similar to another product's description. Each product's description length can greatly vary.

我想查询产品的描述,包括彼此之间的文本重复/类似的段落/块。即String一个有一堆的独特的内容,但是具有类似的/相同的段落W /串B.但是,我不知道这相似的算法是最好的使用方法:

I'd like query for products with descriptions that include duplicate/similar paragraphs/blocks of text between one another. i.e. String A has a bunch of unique content, but shares a similar/identical paragraph w/ string B. However, I'm not sure which similarity algorithm is best to use:

在 Levenshtein距离和的哈罗 - Winler距离的算法似乎只能用短字符串工作。

The Levenshtein distance and Jaro-Winler distance algorithms appear to only work well with short strings.

我不知道该最长公共子序列算法考虑到大的差异很好。也就是说,它似乎忽略潜在空间两个字符之间,找到任何类似的组合顺序。

I'm not sure the longest common subsequence algorithm takes into account large differences very well. i.e. it appears to ignore potential space between two characters, finding any similar combination sequence.

模糊哈希的声音我在寻找什么,但我的而不仅仅是的寻找重复的内容W /细微的差别。我的也的寻找重复的量w内的文本的独特块注入 /细微的差别。而且我不知道如何实现SQL模糊哈希值。 SOUNDEX()和的差()出现使用模糊哈希,但都相当IM precise我的用例。

Fuzzy hashing sort of sounds what I'm looking for, but I'm not just looking for duplicate content w/ subtle differences. I'm also looking for duplicate content w/ subtle differences injected within a unique block of text. And I'd have no idea how to implement fuzzy hashes in SQL. SOUNDEX() and DIFFERENCE() appear to use fuzzy hashing, but are quite imprecise for my use case.

理想的情况是相似的SQL函数是快,但我可以存储缓存相似值在另一桌和安排作业偶尔更新。

Ideally the similarity SQL function would be fast, but I could store cached similarity values in another table and schedule a job to occasionally update.

什么是最好的算法/ SQL(或CLR集成)实现做到这一点?

推荐答案

我不那么最近不得不参加由模糊字符串匹配的组名。 我试图约40个不同的算法,但没有一个是不够好,做到这一点,即使组名只写​​一些拼写错误不同,缺少空格,并在年底偶尔添加_mLF。

I not-so-recently had to join group names by fuzzy string matching. I have tried about 40 different algorithms, but none was good enough to do this,even though the groupnames writing only differed by some spelling mistakes, missing whitespaces, and occasional added _mLF at the end.

所以,如果你尝试了类似的事情,我强烈建议你马上停止,并发送数据(在我的情况的Excel文件)回用户的修正,它属于。

So if you attempts a similar thing, I strongly suggest you stop right now, and send the data (in my case Excel-file) back to the users for correction, where it belongs.

如果你在比较字符串真的只是兴趣,这个环节可能正是你需要: http://anastasiosyal.com/POST/2009/01/11/18.ASPX

If you're really just interested in comparing strings, this link may be just what you need: http://anastasiosyal.com/POST/2009/01/11/18.ASPX

我发现哈罗 - 温克勒功能,以产生最好的结果在我的情况,但你可以测试,对于yourselfs。

I found the Jaro-Winkler function to yield the best results in my case, but you can test that for yourselfs.