优化哈罗 - 温克勒算法哈罗、算法、克勒

2023-09-11 03:17:47 作者:念多次不如忘一世

我有这个code从采取哈罗 - 温克勒算法,该网站。我需要运行15万次拿到的差异之间的距离。这需要很长的时间,因为我跑的Andr​​oid移动设备上。

I have this code for Jaro-Winkler algorithm taken from this website. I need to run 150,000 times to get distance between differences. It takes a long time, as I run on an Android mobile device.

它可以优化更多?

public class Jaro {
    /**
     * gets the similarity of the two strings using Jaro distance.
     *
     * @param string1 the first input string
     * @param string2 the second input string
     * @return a value between 0-1 of the similarity
     */
    public float getSimilarity(final String string1, final String string2) {

        //get half the length of the string rounded up - (this is the distance used for acceptable transpositions)
        final int halflen = ((Math.min(string1.length(), string2.length())) / 2) + ((Math.min(string1.length(), string2.length())) % 2);

        //get common characters
        final StringBuffer common1 = getCommonCharacters(string1, string2, halflen);
        final StringBuffer common2 = getCommonCharacters(string2, string1, halflen);

        //check for zero in common
        if (common1.length() == 0 || common2.length() == 0) {
            return 0.0f;
        }

        //check for same length common strings returning 0.0f is not the same
        if (common1.length() != common2.length()) {
            return 0.0f;
        }

        //get the number of transpositions
        int transpositions = 0;
        int n=common1.length();
        for (int i = 0; i < n; i++) {
            if (common1.charAt(i) != common2.charAt(i))
                transpositions++;
        }
        transpositions /= 2.0f;

        //calculate jaro metric
        return (common1.length() / ((float) string1.length()) +
                common2.length() / ((float) string2.length()) +
                (common1.length() - transpositions) / ((float) common1.length())) / 3.0f;
    }

    /**
     * returns a string buffer of characters from string1 within string2 if they are of a given
     * distance seperation from the position in string1.
     *
     * @param string1
     * @param string2
     * @param distanceSep
     * @return a string buffer of characters from string1 within string2 if they are of a given
     *         distance seperation from the position in string1
     */
    private static StringBuffer getCommonCharacters(final String string1, final String string2, final int distanceSep) {
        //create a return buffer of characters
        final StringBuffer returnCommons = new StringBuffer();
        //create a copy of string2 for processing
        final StringBuffer copy = new StringBuffer(string2);
        //iterate over string1
        int n=string1.length();
        int m=string2.length();
        for (int i = 0; i < n; i++) {
            final char ch = string1.charAt(i);
            //set boolean for quick loop exit if found
            boolean foundIt = false;
            //compare char with range of characters to either side

            for (int j = Math.max(0, i - distanceSep); !foundIt && j < Math.min(i + distanceSep, m - 1); j++) {
                //check if found
                if (copy.charAt(j) == ch) {
                    foundIt = true;
                    //append character found
                    returnCommons.append(ch);
                    //alter copied string2 for processing
                    copy.setCharAt(j, (char)0);
                }
            }
        }
        return returnCommons;
    }
}

我提到,在整个过程中我做的脚本只是实例,所以只有一次

I mention that in the whole process I make just instance of the script, so only once

jaro= new Jaro();

如果你要测试和需要的例子,这样不会破坏脚本,你​​会发现它here,在另一个线程的蟒蛇优化

If you are going to test and need examples so not break the script, you will find it here, in another thread for python optimization

推荐答案

是的,但你是不是要去享受它。替换所有 ED StringBuffers与被分配在构造函数和永远不会再使用整数索引来跟踪什么是在他们的字符数组。

Yes, but you aren't going to enjoy it. Replace all those newed StringBuffers with char arrays that are allocated in the constructor and never again, using integer indices to keep track of what's in them.

这未决下议院郎补丁会给你一些味道。