I was reviewing some things about matrices and the Mahalanobis distance and it occurred to me to make a small function that ranks the observations of each column of a matrix. Below the code:
test <- matrix(c(78.17,70.25,75.33, 86.08,54.97, 43.63,18.04,
0.3,1.4,0.5,1.5,0.7,0.2,0.1,3,5,5,8,9,10,2), ncol=3)
test
rank_columns <- function (x) {
y <- matrix(ncol=ncol(x), nrow=nrow(x))
for (j in 1:ncol(x)) {
y[,j] <- rank(x[,j])
}
return(y)
}
rank_columns (test)
The function returns an array with the original dimensions of the input array and the ranked observations:
rank_columns(test)
[,1] [,2] [,3]
[1,] 6 3 2.0
[2,] 4 6 3.5
[3,] 5 4 3.5
[4,] 7 7 5.0
[5,] 3 5 6.0
[6,] 2 2 7.0
[7,] 1 1 1.0
As many of you know, I'm not very good at using the family apply
, so I was wondering if there was a way to vectorize the function to optimize its performance when dealing with larger matrices. Beforehand thank you very much.
Alejandro, first of all, the answer that Javier Ascunce has given you is undoubtedly an adequate way to solve it, but I want to extend the explanation a little more.
In R it is repeated over and over again about not using explicit cycles (
for
,while
,repeat
) but using functions*apply
, that is, implicit cycles. This because:Let me clarify that in reality we are not exactly talking about "vectorization", your function would already be "vectorized", it is very optimal, since it would only be using one cycle per column.
In your example, where you're looking to "apply" the function
rank
to each column of an array, and I'm assuming you're looking to get an array similar to the original, the easiest way to apply an implicit loop is:or in its most explicit version:
That is, enter the matrix in this case, and using
MARGIN = 2
that is to say that we take the columns (itMARGIN = 1
would be per row), to each column then, we will apply (FUN = rank
) the functionrank()
.Another way is to use
sapply()
, which is more like what you're doing:In this case we iterate over each column of the matrix and apply the
rank
on a slice of the matrix corresponding to the column.What happens to the performance? Let's see, let's do a test with an array of 10,000 rows and try each function 1000 times:
Interesting, the three ways of doing the same have a very similar performance, apart from
apply()
having a greater dispersion of values, it could be said that there is no significant winner, in fact, yourank_columns()
could even be a "tip" faster.Beyond performance, without a doubt, solving several lines of code in a single one is a significant improvement that is worth taking advantage of whenever possible.
test_ranked <- apply(test, 2, rank)
The 2 in the second argument makes the function apply by columns.