如何检测重复的数据?数据

2023-09-11 03:01:22 作者:犯賤的祈求*≡

我有一个简单的联系人数据库,但我有问题的用户进入重复的数据。我实现了一个简单的数据比较,但不幸的是正在进入的复制数据是不完全一样的。例如,名称拼写错误或一个人将投入比尔史密斯,另一个将投入威廉史密斯为同一人。

那么,有没有某种算法,可以给予多么相似的条目是另外一个比例?

解决方案   

那么,有没有某种算法   可以给出怎样的百分比   类似的条目是另一个?

算法的探测法和编辑距离(如在previous后建议)可以解决您的一些问题。但是,如果你是认真的清理你的数据,这将是不够的。正如其他人指出,条例并没有听起来像威廉任何事情。

我发现,最好的解决方法是使用一个简算法和表,以减少名它的根名称。

到你的正常地址表中,添加根版本的名称,例如中 人(名字,RootFirstName,姓氏,Rootsurname ....)

现在,创建一个映射表。 FirstNameMappings(主键名字,ROOTNAME)

填充你的映射表: 将忽略(选择名字,未定义的人)到FirstNameMappings

这将增加您在人员表必须连同未定义

的ROOTNAME所有firstnames 产品有了功能和体验后 如何提升复购率

现在,可悲的是,你将不得不通过所有的独特的名字,并将它们映射到根名称。例如,条例草案,Billl和会都应该被翻译为威廉 这是非常耗时,但如果数据质量真的对你很重要,我认为这是最好的方法之一。

现在使用新创建的映射表中的Person表更新Rootfirstname字段。重复姓氏和地址。一旦做到这一点,你应该能够检测到重复而不拼写错误的痛苦。

I have got a simple contacts database but I'm having problems with users entering in duplicate data. I have implemented a simple data comparison but unfortunately the duplicated data that is being entered is not exactly the same. For example, names are incorrectly spelled or one person will put in 'Bill Smith' and another will put in 'William Smith' for the same person.

So is there some sort of algorithm that can give a percentage for how similar an entry is to another?

解决方案

So is there some sort of algorithm that can give a percentage for how similar an entry is to another?

Algorithms as Soundex and Edit distances (as suggested in a previous post) can solve some of your problems. However, if you are serious about cleaning your data, this will not be enough. As others have stated "Bill" does not sound anything like "William".

The best solution I have found is to use a reduction algorithm and table to reduce the names to it's root name.

To your regular Address table, add Root-versions of the names, e.g Person (Firstname, RootFirstName, Surname, Rootsurname....)

Now, create a mapping table. FirstNameMappings (Primary KEY Firstname, Rootname)

Populate your Mapping table by: Insert IGNORE (select Firstname, "UNDEFINED" from Person) into FirstNameMappings

This will add all firstnames that you have in your person table together with the RootName of "UNDEFINED"

Now, sadly, you will have to go through all the unique first names and map them to a RootName. For example "Bill", "Billl" and "Will" should all be translated to "William" This is very time consuming, but if data quality really is important for you I think it's one of the best ways.

Now use the newly created mapping table to update the "Rootfirstname" field in your Person table. Repeat for surname and address. Once this is done you should be able to detect duplicates without suffering from spelling errors.