On many occasions, when we work with databases, we find various encodings for unregistered or unreported data. For example, in can be zeros, -999, -99, among others, which, for data processing purposes, we can convert to NA.
Thinking about it, I made a small function that looks for unregistered values that can be changed to NA:
'is.na.m<-' <- function(x, value, ...) {
x[c(value)] = NA
x
}
We create a test vector:
x <- c(1:3,2:5,1:10, -99, -999, -98)
is.na.m(x) = x%in%c(-99, -999, -98)
x
Salida:
[1] 1 2 3 2 3 4 5 1 2 3 4 5 6 7 8 9 10 NA NA NA
Now, I've tried to get this to work on a dataframe, but to no avail:
b <-data.frame(animal=c("perro", "gato", -999), num=c(1,-98,3))
is.na.m(b) = b[b%in%c(-98, -999)]
b
As can be seen, there are no changes in the dataframe:
animal num
1 perro 1
2 gato -98
3 -999 3
** It should be noted that is.na.m(b) = b%in%c(-98, -999)
it did not work either
I tried to use the indexing function, but it didn't work either:
is.na.m(b[,1:ncol(b)]) = b[,1:ncol(b)%in%c(-98, -999)]
b
Now, when I try to use lapply, it gives me an error:
b <- unlist(lapply(b, is.na.m(b))
Error in is.na.m(b) : no se pudo encontrar la función "is.na.m"
The question is: What are the adjustments that I must make in the function so that it operates correctly in all the columns of a dataframe?
I thank you in advance for any guidance.
Alejandro, the main problem I see is that you have a confusion with the operator
%in%
. This is a binary operator that points to the functionmatch()
, if we see its documentation for the first input parameter it says:That is, the expected input is a vector, even a two-dimensional vector (array), but not a
data.frame
, i.e. thisx %in% c(-99, -999, -98)
works as you expect, it returns a logical vector the same size as the input vector, but thisb %in% c(-98, -999)
no longer, whyb
is it adata.frame
. The interesting and confusing thing is that it does not give us an error, it returns data, but not the expected ones, the return is a vector ofFALSE
the size of the columns of thedata.frame
I owe you the explanation of this behavior,
match
it's an internal function, written inC
and I'm missing a lot of base from the R <-> C API. Anyway the bottom line is that you can't use itmatch
the way you do.The other problem, I don't know if you have noticed two situations that should be paid attention to:
a. This:
c("perro", "gato", -999)
by automatic coercion, it will be transformed into a vector of strings, the number-999
is promoted to the most general data type, in this case a stringb. The other issue is that by default it
data.frame()
treats strings as datafactor
, this adds a bit more complexity. If we don't want this behavior we should usestringsAsFactors = FALSE
I tell you this because you are looking to do a
match()
with numeric values, so: What should be the behavior when we compare with strings? In fact, this same question, since it is a question, should be askeddata.frame
for each type of possible data.Now, suppose the following scenario:
Let's solve the first problem, how to replace the values
-98
and"-999"
number and string respectively. We have already seen that%in%
it does not work for us, so what we can do is compare by==
for each searched value:This will generate a list, where each element is an array of the size of the
data.frame
logical for each value sought, and additionally, this operator does an automatic coercion in such a way that we can successfully compare (if that is what you are looking for) the string"-999"
with the number-999
. The idea then is to combine each array into one, where eachTRUE
is the place where we want to replace byNA
Now yes with
Reduce
and combining the arrays with aor
logical we obtain the places that we will have to replace, finally:Now your example should work:
Comments:
is.na.m(b) <- NA
, but your example is totally valid.x[c(value)]
directlyx[value]
reach.is.na.m(b)
without the assignment, you're calling another function, not ais.na.m<-()
butis.na.m()
so if it's not defined, that's where you get the error when you try to use it withlapply
.