I try to apply a simple normalization function to the numeric variables in the R database iris
through an for
and using lapply
in order to obtain a new database containing only the normalized variables:
data(iris)
normal <- function (x) {
num <- x - min(x)
den <- max(x) - min(x)
return (num/den)
}
iris_n <- data.frame()
for (i in 1:length(iris)){
if (is.numeric(iris[,i])) {
}
iris_n[,i] <- as.data.frame(lapply(iris[,i], normal))
}
Error in Summary.factor(1L, na.rm = FALSE) :
‘min’ not meaningful for factors
Además: There were 50 or more warnings (use warnings() to see the first 50)
iris_n
[1] NaN. NaN..1 NaN..2 NaN..3
<0 rows> (or 0-length row.names)
Now, I try to lapply directly with:
iris_n <- as.data.frame(lapply(iris, function(x) {if (is.numeric(x)) normal(x)}))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 150, 0
No matter how hard I turn it, I can't find the error. Any guidance will be greatly appreciated (Note: I am interested in solving this specific problem, I know there are other types of solutions to achieve what I need)
The options that Patricio gives you are very good and they explain well how the
*apply()
.Two alternatives:
If you don't know in advance which variables are not numeric enough to discard them directly in the function call, you can apply the function only to numeric ones like this:
sapply()
is a relative oflapply()
, only it returns a vector instead of a list. In this case we use it to ask iris which columns are numeric (those returnTRUE
, those that are not returnFALSE
). As it is inside the square brackets, it is used to "trim" the data, leaving only the numeric ones. It thenlapply()
takes care of applying to thosenormal
and you get an easy list to coerce intodata.frame
. Mind you, you lose the factorSpecies
.If you have no problem using a separate package there is a very neat syntax option using the
purrr
. The function that replaceslapply()
orapply()
for this case ismodify_if()
. As its name implies, it modifies an element of a list (iris is a list because all data.frames are lists, although not all lists are data.frames) if a condition is met. In this case, let the column be numeric. The interesting thing is that it keeps intact the columns in which the condition is not met.Another particular characteristic of
modify_if()
and of the whole family.modify_*
is that it tries to return the data with the same structure that enters. That is, if the function receives a data.frame -in this case,iris
- it will try to return another data.frame. It is not necessaryas.date.frame()
after. So elegant:Normalized and with all the original columns.
However I think the best option, if you are willing to carry an extra package, is to use
dplyr::mutate_if()
.This guarantees you that the result is a data.frame or an error. That may be better than it sounds...
First of all, you have this problem:
That is because in your code
lapply()
it is also being applied to the values of the columnSpecies
that is a factor, the call tolapply()
should be made within the block of theif (is.numeric(iris[,i]))
The other problem is the values
Nan
that are generated. This is due to a somewhat debatable behavior of R when trimming objects via indices[]
. When you trim an array or similar object by taking a single row or column, R by default "coerces" the return value to a more primitive type, in this case a vector. Which produces thatlapply
it is applied on each element of the vector, which ends up generating a division by 0 and consequently theNan
To avoid this, just as you have the code written, you could simply do:
either
Lastly, the problem of doing this:
It is that one of the columns, the one of
Species
you are not returning, so you have 4 lists with 150 elements and one with none, for which you get the error:You solve it by returning
normal(x)
for the numeric columns and directlyx
for the non-numeric ones. In short, your code can be summarized in this: