有这种算法得到正确实施?算法、正确

2023-09-12 21:18:14 作者:我是路人却爱你过深

我目前正在实施一个 BK-树,以使拼写检查器。我有工作的字典是非常大的(数百万字的),这就是为什么我不能承受任何低效率的。但是,我知道,查找函数,我写(整个程序中可以说是最重要的部分)可更好。我希望能找到一些关于相同的帮助。下面是我写的查询:

I am currently implementing a BK-Tree to make a spell checker. The dictionary I am working with is very large (millions of words), which is why I cannot afford any inefficiencies at all. However, I know that the lookup function that I wrote (arguably the most important part of the entire program) can be made better. I was hoping to find some help regarding the same. Here's the lookup that I wrote:

public int get(String query, int maxDistance)
{
    calculateLevenshteinDistance cld = new calculateLevenshteinDistance();
    int d = cld.calculate(root, query);
    int tempDistance=0;

    if(d==0)
        return 0;

    if(maxDistance==Integer.MAX_VALUE)
        maxDistance=d;

    int i = Math.max(d-maxDistance, 1);
    BKTree temp=null;

    for(;i<=maxDistance+d;i++)
    {
        temp=children.get(i);
        if(temp!=null)
        {
            tempDistance=temp.get(query, maxDistance);
        }
        if(maxDistance<tempDistance)
            maxDistance=tempDistance;
    }

    return maxDistance;
}

我知道,我正在运行的循环不必要大量的时间,我们可以修剪搜索空间,使查找更快。我只是不知道如何最好地做到这一点。

I know that I am running the loop an unnecessarily large number of times and that we can trim the search space to make the lookup faster. I'm just not sure how to best do that.

推荐答案

您环路看起来通常是正确的,如果有点拜占庭式的。您尝试完善停止条件(有tempdistance / maxdistance)不正确,但是:在BK-树的结构需要你探索在Levenshtein距离DK的所有节点到D + K当前节点,如果你想找到所有的结果,所以你不能修剪它这样。

Your loop looks generally correct, if a little byzantine. Your attempt to refine the stopping condition (with tempdistance/maxdistance) is incorrect, however: the structure of the BK-tree requires that you explore all nodes within levenshtein distance d-k to d+k of the current node if you want to find all the results, so you can't prune it like that.

是什么让你觉得你探索过多的树?

What makes you think you're exploring too much of the tree?

您可能会发现对L- evenshtein自动机启发,因为他们比BK树更有效。如果你正在构建一个拼写检查器,不过,我建议以下Favonius的建议,并检查出这篇文章关于如何写一个。这是更好的适合拼写校正不是一个天真的字符串距离检查。

You may find my followup post on Levenshtein Automata instructive, as they're more efficient than BK-trees. If you're building a spelling checker, though, I'd recommend following Favonius' suggestion and checking out this article on how to write one. It's much better suited to spelling correction than a naive string-distance check.