施加函数R中一个距离矩阵矩阵、函数、距离

2023-09-11 04:38:38 作者:麋鹿恠亽

这个问题今天排在manipulatr邮件列表。

This question came today in the manipulatr mailing list.

http://groups.google.com/group/manipulatr/browse_thread/thread/fbab76945f7cba3f

我改写。

由于距离矩阵(与计算 DIST )函数应用到距离矩阵的行。

Given a distance matrix (calculated with dist) apply a function to the rows of the distance matrix.

code:

library(plyr)
N <- 100
a <- data.frame(b=1:N,c=runif(N))
d <- dist(a,diag=T,upper=T)
sumd <- adply(as.matrix(d),1,sum)

问题是,你必须存储整个矩阵(而不是仅仅下三角部分按行应用功能,所以它使用了太多的内存对于大型矩阵。它没有在我的电脑的尺寸矩阵〜10000

The problem is that to apply the function by row you have to store the whole matrix (instead of just the lower triangular part. So it uses too much memory for large matrices. It fails in my computer for matrices of dimensions ~ 10000.

任何想法?

推荐答案

我的解决办法是让距离矢量的索引,给定的一排和所述矩阵的大小。我从 codeguru

My solution is to get the indexes of the distance vector, given a row and the size of the matrix. I got this from codeguru

int Trag_noeq(int row, int col, int N)
{
   //assert(row != col);    //You can add this in if you like
   if (row<col)
      return row*(N-1) - (row-1)*((row-1) + 1)/2 + col - row - 1;
   else if (col<row)
      return col*(N-1) - (col-1)*((col-1) + 1)/2 + row - col - 1;
   else
      return -1;
}

翻译成R后,假设指数从1开始,并假设较低的三,而不是上三矩阵我。 编辑:使用量化版本贡献的RCS

After translating to R, assuming indexes start at 1, and assuming a lower tri instead of upper tri matrix I got. Using the vectorized version contributed by rcs

noeq.1 <- function(i, j, N) {
    i <- i-1
    j <- j-1
    ix <- ifelse(i < j,
                 i*(N-1) - (i-1)*((i-1) + 1)/2 + j - i,
                 j*(N-1) - (j-1)*((j-1) + 1)/2 + i - j) * ifelse(i == j, 0, 1)
    ix
}

## To get the indexes of the row, the following one liner works:

getrow <- function(z, N) noeq.1(z, 1:N, N)

## to get the row sums

getsum <- function(d, f=sum) {
    N <- attr(d, "Size")
    sapply(1:N, function(i) {
        if (i%%100==0) print(i)
        f(d[getrow(i,N)])
    })
}

所以,用这个例子:

So, with the example:

sumd2 <- getsum(d)

这是多的比as.matrix的矢量化之前,小矩阵慢。但几乎3倍矢量化后慢。在英特尔酷睿2 2GHz的由大小10000矩阵的行应用总和只花了100秒。该as.matrix方法失败。由于RCS!

This was much slower than as.matrix for small matrices before vectorizing. But just about 3x as slow after vectorizing. In a Intel Core2Duo 2ghz applying the sum by row of the size 10000 matrix took just over 100s. The as.matrix method fails. Thanks rcs!