什么算法给出了一个拼写检查的建议?出了、算法、建议

2023-09-10 22:28:10 作者:烟如女、是我忘不掉的伤

实施一个拼写检查器是伴随着字的建议时,通常用什么算法?

What algorithm is typically used when implementing a spell checker that is accompanied with word suggestions?

起初我以为这可能是有意义的检查每一个新词和输入(如果在字典中未找到)反对的 Levenshtein距离< /一>从字典中的每一个其它字和返回顶端的结果。然而,这似乎是这将是非常低效的,不必重复评估整个字典。

At first I thought it might make sense to check each new word typed (if not found in the dictionary) against it's Levenshtein distance from every other word in the dictionary and returning the top results. However, this seems like it would be highly inefficient, having to evaluate the entire dictionary repeatedly.

这是如何典型的做法?

推荐答案

有好文章由彼得·诺维格如何实现一个拼写校正。这是basicly蛮力方式试图与给定的编辑距离的候选串。 (这里有一些提示如何使用的布鲁姆过滤器和的快候选散列。)

There is good essay by Peter Norvig how to implement a spelling corrector. It's basicly a brute force approach trying candidate strings with a given edit distance. (Here are some tips how you can improve the spelling corrector performance using a Bloom Filter and faster candidate hashing.)

是一个拼写检查器的需求较弱。你只需要找出一个单词不在字典中。您可以使用布隆过滤器建立一个拼写检查器,消耗更少的内存。一个古老版本的描述下由乔恩·宾利 编程珠玑使用64KB的一本英语词典。

The requirements for a spell checker are weaker. You have only to find out that a word is not in the dictionary. You can use a Bloom Filter to build a spell checker which consumes less memory. An ancient versions is decribed in Programming Pearls by Jon Bentley using 64kb for an english dictionary.

一个 BK-树是一种替代方法。一个很好的文章是这里。

A BK-Tree is an alternative approach. A nice article is here.

Levenshstein距离是不完全正确的编辑距离的一个拼写检查器。它仅知道插入,删除和置换。换位缺失,并产生2的1个字符(它的1删除和插入1)换位。 Damerau-Levenshtein距离是正确的编辑距离。

Levenshstein distance is not exactly the right edit distance for a spell checker. It knows only insertion, deletion and substitution. Transposition is missing and produces 2 for a transposition of 1 character (it's 1 delete and 1 insertion). Damerau–Levenshtein distance is the right edit distance.