在PHP Pearson相关PHP、Pearson

2023-09-11 02:20:22 作者:烂命

我想实现的两组数据在PHP人之间的相关系数的计算。 我只是试图做移植的Python脚本,可以在这个网址找到 http://answers.oreilly.com/主题/ 1066 - 如何找到的相似用户与 - 蟒蛇/

I'm trying to implement the calculation of correlation coefficient of people between two sets of data in php. I'm just trying to do the porting python script that can be found at this url http://answers.oreilly.com/topic/1066-how-to-find-similar-users-with-python/

我的实现如下:

class LB_Similarity_PearsonCorrelation implements LB_Similarity_Interface{
public function similarity($user1, $user2){

    $sharedItem = array();
    $pref1 = array();
    $pref2 = array();

    $result1 = $user1->fetchAllPreferences();
    $result2 = $user2->fetchAllPreferences();

    foreach($result1 as $pref){
        $pref1[$pref->item_id] = $pref->rate;
    }

    foreach($result2 as $pref){
        $pref2[$pref->item_id] = $pref->rate;
    }

    foreach ($pref1 as $item => $preferenza){
        if(key_exists($item,$pref2)){
            $sharedItem[$item] = 1;
        }
    }

    $n = count($sharedItem);
    if ($n == 0) return 0;

    $sum1 = 0;$sum2 = 0;$sumSq1 = 0;$sumSq2 = 0;$pSum = 0;

    foreach ($sharedItem as $item_id => $pre) {
        $sum1 += $pref1[$item_id];
        $sum2 += $pref2[$item_id];

        $sumSq1 += pow($pref1[$item_id],2);
        $sumSq2 += pow($pref2[$item_id],2);

        $pSum += $pref1[$item_id] * $pref2[$item_id];
    }

    $num = $pSum - (($sum1 * $sum2) / $n);
    $den = sqrt(($sumSq1 - pow($sum1,2)/$n) * ($sumSq2 - pow($sum2,2)/$n));
    if ($den == 0) return 0;
    return $num/$den;

}
}

澄清,以更好地了解code,该方法使用fetchall preferences返回一组实际上是项目的对象,将他们变成一个数组以便于管理

clarification to better understand the code, the method fetchAllPreferences return back a set of objects that are actually the items, turns them into an array for ease of management

我不知道,这个实现是正确的,特别是我对分母的计算的正确性有些疑惑。

I'm not sure that this implementation is correct, in particular I have some doubts about the correctness of the calculation of the denominator.

任何建议是值得欢迎的。

any advice is welcome.

在此先感谢!

推荐答案

您的算法是数学上正确的,但数值不稳定。找到平方和明确是一个灾难。如果你有一个像数字阵列(10000000001,10000000002,10000000003)?数值上稳定的单通算法计算方差可以发现维基百科上和原理相同可以应用于计算的协方差。

Your algorithm looks mathematically correct but numerically unstable. Finding the sum of squares explicitly is a recipe for disaster. What if you have numbers like array(10000000001, 10000000002, 10000000003)? A numerically stable one-pass algorithm for calculating the variance can be found on Wikipedia, and the same principle can be applied to computing the covariance.

更简单的是,如果你不那么在意速度,你可以只使用两遍。发现在第一遍的装置,然后计算使用教科书式中,第二次使用方差和协方差。

Easier yet, if you don't care much about speed, you could just use two passes. Find the means in the first pass, then compute the variances and covariances using the textbook formula in the second pass.