First of all, I want to comment that I did an exhaustive search before asking my question without obtaining satisfactory results. The database names I want to use are quite long and confusing and therefore I want to change their name but using their position. Some of the variable names are:
[1] "Fecha"
[2] "Delegación"
[3] "Clave.INEGI.AGEE"
[4] "Código.Penal.Federal..CPF._Delitos.contra.la.salud_.Producción"
[5] "Código.Penal.Federal..CPF._Delitos.contra.la.salud_Transporte"
[6] "Código.Penal.Federal..CPF._Delitos.contra.la.salud_Tráfico"
[7] "Código.Penal.Federal..CPF._Delitos.contra.la.salud_Comercio"
[8] "Código.Penal.Federal..CPF._Delitos.contra.la.salud_Suministro"
[9] "Código.Penal.Federal..CPF._Delitos.contra.la.salud_Posesión"
[10] "Código.Penal.Federal..CPF._Delitos.contra.la.salud_Otros"
It's easy to change the name with rename using:
rename(nuevo_nombre = viejo_nombre)
when it comes to not so long names. However, this is not the case. Now I found solutions like:
rename(!!produccion := names[4], !!transporte:= names[5],
!!trafico :=names[6], !!comercio:= names[7], !!suministro := names[8],
!!posesion := names[9], !!otros:= names[10])
Error in quos(...) : objeto 'produccion' no encontrado
either:
rename("produccion" = names[4], "transporte"= names[5],
"trafico" =names[6], "comercio"= names[7], "suministro" = names[8],
"posesion" = names[9], "otros"= names[10])
Error: Expressions are currently not supported in `rename()`
The second obviously means that the format was valid in some earlier version of dplyr but not now. My question is: Is it no longer possible to rename a variable by position using rename
de dplyr
? What alternatives are there, particularly if I want to use the %>%
given operator as it provides an accurate reading of how the database is being manipulated?
I appreciate any feedback and guidance.
Have you tried using the other function families associated with
rename
, such asrename_at
orrename_all
, which can receive a function (like the ones @mpaladino recommended)?The example is more descriptive:
I frequently have the same problem parsing survey databases: the variable names are the question, so they are long, with spaces in between and special characters. Awful to be repeating them all the time, no matter how autocomplete RStudio has.
It is a somewhat long answer, hopefully it will serve for your case and for us to share strategies to deal with this problem.
As you mention
dplyr::rename()
it doesn't work with index numbers. This has to do in part with Wickham's philosophy, which comes more from dealing with relational databases where index numbers have no place, and also with issues of non-standard evaluation used by thetidyverse
. Those functions are easy to use: you don't need to use quotes around strings and you don't need to explicitly create arrays using themc()
to make multiple calls, but for that reason they have trouble mixing numbers and strings in function arguments. That's whytidyverse
column names are always used and almost never index numbers.select()
is an exception.Preamble
First, a more general comment: I don't think it's a good practice to use index numbers to do subsetting
data.frame
or manipulations like renaming. Why? Because that works if your data structure is stable and the index numbers and names always match. But if you are using the operator%>%
you are treating your data structure dynamically (within the chain of functions). So you run the risk that when you eliminate a variable the following ones are located one number less and if you apply a change based on index numbers the result will not be what you are looking for. I am a bit exaggerated with this, although there are contexts in which it is not so risky, I prefer never to use what is potentially a bad practice, to "forget" how to do it and not resort to that "shortcut" in another context in which which is risky.Preferred option: a dictionary and use short names.
The alternative I use is to generate a set of short names for my database columns, so when I'm manipulating them I call them by short name. For that I make -or import if it is available in the source from which I get the data- a set
diccionario
of variables. A simpledata.frame
with at least two columns,Variable
andEtiqueta
.Variable
is the short name, which I use to refer to that column during data manipulation,etiqueta
a long and very descriptive name that I use in charts or tables.If you want, you can build it in Excel or similar, export it to .csv and import it in the R session you need or generate it directly in R. By convention in the R environment I always call it
diccionario
.In practice I use the dictionary together with two simple ad hoc functions:
nombrar_largo()
andnombrar_corto()
. The first is responsible for making the match and changing names from short to long, the second the reverse. As written they don't fail when there is a mismatch in the number of columns (i.e. you could use them after aselect()
) and they also don't fail if there are new columns (leave them unchanged), but they do require that the order of the column names inx
and that of the rows indf
it is equal to .What they allow you is to do all the processing and manipulation with the short names and when you want, for example, to generate a graph in which the long names are, you simply add them
nombrar_largo() %>%
to the appropriate place in your chain of functions and that's it.Here is a practical example of a database that I am working with and that you can download here: http://portalanterior.ine.mx/archivos2/s/DECEYEC/EducacionCivica/Base_datos_Informe_Pais.xlsx
These functions can be improved, since they require some fixed names: Dictionary, Label, Variable and they also have the order problem. However, as they work, I have not put my hands on them. They are also a bit slow for very wide bases, because it
%in%
is quite lazy. But since they work for what I need them for, I haven't started to improve them.Alternative method with function
nombrar()
If it's too much work to do the dictionary and you prefer a faster solution (in the short term) you can use a variation of the function
setNames()
, which I callednombrar()
.setNames()
takes two arguments, adata.frame
and a vector of names, the problem is that it doesn't handle index numbers well. So I modified it a bit so that it takes a third argument: a numeric vector with the indices. I also changed the order of the arguments: the first is thedata.frame
, so using%>%
lo you can leave it empty, the second a numeric array with the indices, and the third a character array with whatever names you like for the variables selected by number.Of course, for each use you have to make sure that the index vector and the name vector have the same length and match by location. But by having a as input
data.frame
and returning the same structure as output you can use it with the operator%>%
.Option 3: Use an
gsub()
ad hocSometimes the short name we are looking for is in the long and awkward names and it is just a matter of removing what is left over with a regular expression or directly a string. It can be used on your database
gsub()
vector , in this case I use setNames() to wrap the process in a function.names()
In your case it would be something like:Cheers!
Going through some old R notes, I came up with an answer that may be useful to those who give this post a read. The solution is to use the operator
%>%
and base functions R.I will use the base mtcars to exemplify.
First, let's look at database variable names.
Suppose I am interested in selecting the variables mpg, cyl and hp (remember that the nature of the problem is to change names for variables with not very simple names, but it works the same way anyway)
As can be seen, the result is the desired one. The crux of all this is to use
subset
andsetNames
which are R code base functions. An alternative solution and also quite effective.I hope someone finds it useful.
With the function
colnames()
it is very easy to do what you say, for example if the first column of my dataset I want to be called "Date" the instruction iscolnames(data)[1]="Fecha"
where the 1 indicates that it is the first column to which you want to change the name