Whether it is when asking a question on this site or when we need to share an example with a colleague, what elements should we take into account to ensure the reproducibility of the example? (information, data, structures, etc.)
Whether it is when asking a question on this site or when we need to share an example with a colleague, what elements should we take into account to ensure the reproducibility of the example? (information, data, structures, etc.)
We are going to translate and adapt a bit Joris Meys excellent answer on the English site.
A good minimal and reproducible example should consist of the following elements:
set.seed()
)It is often useful to examine help file examples of functions used. In general, all the code given there meets the requirements of a minimal reproducible example: data is provided, minimal code is provided, and everything is executable.
Production of a minimal data set
For most cases, this can be easily accomplished by providing the object
vector/dataframe/matrix/etc
with some example values. Or you can indicate one of thedatasets
already incorporated ones, which are supplied with most of the packages. A full list of built-in datasets can be viewed with the command:library(help = "datasets")
. There is a brief description of each dataset and more information can be obtained, for example, by asking?mtcars
wheremtcars
one of the listed datasets is. Other packages may contain additional data sets.Creating one
vector
is easy. Sometimes you need to add some randomness, and there are a whole number of functions to do that.sample()
you can randomize avector
, or give avector
random with only a few values.letters
is a useful vector containing the alphabet, which can be used to construct factors.Some examples:
x <- rnorm(10)
for normal distribution,x <- runif (10)
for uniform distribution,...x <- sample(1:10)
for the vector1:10
in random order.x <- sample(letters[1:4], 20, replace = TRUE)
For arrays, you can use
matrix()
, for example:matrix(1:10, ncol = 2)
The creation of
dataframes
can be done usingdata.frame()
. Pay attention in the creation, do not make adataframe
too complicated, do not add variables that are not going to be used.An example:
In some cases, it is necessary to maintain the specific formats of each variable/column. For these, you can use any of the provided functions such
<as.AlgunTipo>
as:as.factor
,as.Date
,as.xts
, etc.Copy own data
If you have some data that would be too difficult to construct using these methods, or is necessary to understand a problem (e.g. to determine a problem converting a date from a string you have to "see" the format of the actual data, not an example that is surely correct), then you can always subset your original data, using for example
head()
,subset()
or the indices. Then you can usedput()
to give us something that can be put into R immediately:In some cases, a
dataframe
can have many values that are handled as Factors, doing onesubset
or ahead
we obtain a smaller sample, but in any case we would be transferring the Factors/levels that we are not using in this sample. What we can do in these cases is eliminate the Levels/levels that are not being used in the sample. Usingdroplevels()
, for example:Note that it now
Species
has only one level.Label = "setosa"
because effectively we only have these in the sample:Species = structure(c(1L, 1L, 1L, 1L)
Another caveat for dput is that it won't work for indexed data
data.table
or fortbl_df
( classgrouped_df
ofdplyr
). For this cases you can convert the object to adataframe
common one before sharing it:dput(as.data.frame(my_data))
.In the worst case, you can give a textual representation that could eventually be read by
read.table()
:Eventually it could happen that the data is such that it is impracticable to share it in the aforementioned ways, then consider using some service, for example. up to 0.5mb could be used pastebin.com :
d <- read.table("http://pastebin.com/raw.php?i=m1ZJuKLH")
or some other. Remember that you can save any object withwrite(df, "archivo.Rda")
and then load it withload("archivo.Rda")
.If in any way it is not possible to share data, at a minimum we should be able to inform the structure and class of the objects, for that some of these routines usually provide relevant information:
Sharing the minimal code
This should be the easy part, but it often isn't. What you shouldn't do is:
What it should do is:
library("randomForest")
orrequire("ggplot")
.unlink()
)op <- par(mfrow=c (1,2))... algún código... par(op)
)give additional information
In most cases, just the version of R and the operating system will suffice. When conflicts with packages arise, outputting
sessionInfo()
can be of great help. When talking about connections to other applications (whether via ODBC or anything else), version numbers for these should also be provided, and if possible the information needed in the configuration as well.If you are running R in RStudio using
rstudioapi::versionInfo()
can be useful to report your version of RStudio .If you have a problem with a specific package, you may want to provide a version of the package by giving the output of
packageVersion("nombre del paquete")
.Reprex
This package that you can install on demand with
install.packages("reprex")
or, if you useRstudio
it already has it incorporated as aaddin
, does something very simple and tremendously useful. Let's say you have code like the following:You select it, copy it to the clipboard and call
reprex::reprex()
orRstudio
go toAddins -> Reprex selection
and it will magically generate the full code to paste as an example, for example here on SOes.Which would end up being:
Created on 2019-05-21 by the reprex package (v0.3.0)