SQL - 不同长度的两个字符串之间的相似性相似性、字符串、长度、不同

2023-09-11 22:58:12 作者：自我清欢

我有一个产品SQL Server表，并且每个产品都有一个描述是公布在我们的网站。我想prevent，或者至少提醒我们的用户时，需要说明的是太相似了其他产品的描述。每个产品的描述长度可以相差很大。

I have a SQL Server table of products, and each product has a description that is publicly available on our website. I want to prevent, or at least warn our users when, a description is too similar to another product's description. Each product's description length can greatly vary.

我想查询产品的描述，包括彼此之间的文本重复/类似的段落/块。即String一个有一堆的独特的内容，但是具有类似的/相同的段落W /串B.但是，我不知道这相似的算法是最好的使用方法：

I'd like query for products with descriptions that include duplicate/similar paragraphs/blocks of text between one another. i.e. String A has a bunch of unique content, but shares a similar/identical paragraph w/ string B. However, I'm not sure which similarity algorithm is best to use:

在 Levenshtein距离和的哈罗 - Winler距离的算法似乎只能用短字符串工作。

The Levenshtein distance and Jaro-Winler distance algorithms appear to only work well with short strings.

我不知道该最长公共子序列算法考虑到大的差异很好。也就是说，它似乎忽略潜在空间两个字符之间，找到任何类似的组合顺序。

I'm not sure the longest common subsequence algorithm takes into account large differences very well. i.e. it appears to ignore potential space between two characters, finding any similar combination sequence.

模糊哈希的声音我在寻找什么，但我的而不仅仅是的寻找重复的内容W /细微的差别。我的也的寻找重复的量w内的文本的独特块注入 /细微的差别。而且我不知道如何实现SQL模糊哈希值。 SOUNDEX（）和的差（）出现使用模糊哈希，但都相当IM precise我的用例。

Fuzzy hashing sort of sounds what I'm looking for, but I'm not just looking for duplicate content w/ subtle differences. I'm also looking for duplicate content w/ subtle differences injected within a unique block of text. And I'd have no idea how to implement fuzzy hashes in SQL. SOUNDEX() and DIFFERENCE() appear to use fuzzy hashing, but are quite imprecise for my use case.

理想的情况是相似的SQL函数是快，但我可以存储缓存相似值在另一桌和安排作业偶尔更新。

Ideally the similarity SQL function would be fast, but I could store cached similarity values in another table and schedule a job to occasionally update.

什么是最好的算法/ SQL（或CLR集成）实现做到这一点？

SQL - 不同长度的两个字符串之间的相似性相似性、字符串、长度、不同

推荐答案