字比较算法算法

2023-09-11 01:47:13 作者:钢铁小伙伴

我做的一个CSV导入工具,我工作的项目。 客户需要能够在Excel中输入的数据,将其导出为CSV,并将其上传到数据库中。 例如,我有这样的CSV记录:

I am doing a CSV Import tool for the project I'm working on. The client needs to be able to enter the data in excel, export them as CSV and upload them to the database. For example I have this CSV record:

   1,   John Doe,     ACME Comapny   (the typo is on purpose)

当然,这两家公司都保存在一个单独的表和外键联系在一起,所以我需要在插入之前发现正确的公司ID。 我打算通过与在CSV的公司名称的数据库进行比较,该公司名称来做到这一点。 比较应该返回0,如果字符串是完全一样的,并返回一些值,该值变大的字符串得到更多的不同,但的strcmp不剪这里,因为:

Of course, the companies are kept in a separate table and linked with a foreign key, so I need to discover the correct company ID before inserting. I plan to do this by comparing the company names in the database with the company names in the CSV. the comparison should return 0 if the strings are exactly the same, and return some value that gets bigger as the strings get more different, but strcmp doesn't cut it here because:

阿克米公司和北美经销商尖端应该有一个非常小的差异指数,但 阿克米公司和经互会Mpnyaco应该有非常大的差异指数 或阿克米公司和尖端比较。也应该有差小索引,即使字符计数是不同的。 此外,阿克米公司和公司尖端应返回0。

"Acme Company" and "Acme Comapny" should have a very small difference index, but "Acme Company" and "Cmea Mpnyaco" should have a very big difference index Or "Acme Company" and "Acme Comp." should also have a small difference index, even though the character count is different. Also, "Acme Company" and "Company Acme" should return 0.

因此​​,如果客户端发出的类型在输入数据时,我可能会促使他选择,他很可能要插入的名称。

So if the client makes a type while entering data, i could prompt him to choose the name he most probably wanted to insert.

是否存在已知的算法来做到这一点,或许我们可以创造的:) ?

Is there a known algorithm to do this, or maybe we can invent one :) ?

推荐答案

您可能要检查出 Levenshtein距离算法作为起点。它会率的距离两个词之间。

You might want to check out the Levenshtein Distance algorithm as a starting point. It will rate the "distance" between two words.

会这样线程上实现谷歌式的你的意思是...?系统可以提供一些思路也是如此。

This SO thread on implementing a Google-style "Do you mean...?" system may provide some ideas as well.