I have a database of this type (here is a minimal sample that would contain data from two simulations for each round, level and condition):
Edited data:
df <- data.frame(Sim=c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2),Ronda=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),Condicion=c('A1','A1','A2','A2','A1','A1','A2','A2','B1','B1','B2','B2','B1','B1','B2','B2','A1','A1','A2','A2','A1','A1','A2','A2','B1','B1','B2','B2','B1','B1','B2','B2'),Nivel=c(1,1,1,1,2,2,2,2,1,1,1,1,2,2,2,2,1,1,1,1,2,2,2,2,1,1,1,1,2,2,2,2),Salida=c(3,2.5,2.1,1.9,2.8,2.3,2.0,1.6,2.6,2.7,1.3,1.2,2.4,2.3,1,1.1,2,1.3,1.3,0.9,2,2.1,2.1,1.2,2,1.7,1.2,1,2,1.3,0.5,0.4))
I would now like to manipulate this database to obtain the relative reduction that, in order, the "Output" corresponding to A2 produces in A1 (1-(A2/A1), and B2 in B1 (1-(B2/B1), for each simulation, Round and Level.In other words, it seeks to obtain the reduction percentage that n2 produces in n1.
So what you want to generate is something like this:
Ronda Condición Nivel Resultado
1 1-(A2/A1) 1 0.3
1 1-(A2/A1) 1 0.24
...
1 1-(B2/B1) 2 0.5
1 1-(B2/B1) 2 0.56
...
2 1-(A2/A1) 1
2 1-(A2/A1) 1
...
2 1-(B2/B1) 1
2 1-(B2/B1) 1
...
My attempts for now have been aimed at operating using subset
data and trying to do calculations with tapply
. I appreciate help.
Second edition:
Here is an example of real data. The goal is the same. In this case Condition is a factor with 6 levels. Therefore, it is intended to create a new dataframe that contains Simulation, Round, Cofactor and the new columns: "Result" and "New Condition" (a new condition with 3 levels). For example:
A=(1-(Heterogeneity OTA/Homogeneity OTA)
B=(1-(Heterogeneity C/Homogeneity C)
C=(1-(Heterogeneity PR/Homogeneity PR)
This would allow you to easily graph the data later.
A solution goes with the
dlyr
andtidyr
. The idea is to go to wide format, so that there is a column for each Condition (A1, A2, B1, B2), then it is easy to do the division and subtraction operation.The problem I found is that there was no identifier for the simulations. In your example data for each Round, Level and Condition group there were two rows left.
I assumed each row corresponded to a different simulation and gave them a unique identifier. If in your real data you have a simulation/patient/whatever identifier you should use that identifier. The idea is that each group is unique.
If your data is very clean and you can trust the alphabetical order of
Condición
the following alternative would work for an arbitrary number of groups ofCondición
Although you should be careful with the separator on the 5th line which is completely ad hoc.
Answer with real data
Diagnosis
Reviewing the data I see that they are octuplicated
Perhaps this data that you uploaded is missing a column that separates those groups of 8, but as it is, that information is redundant. In the answer I use
distinct()
to remove repeated rows. Otherwise itspread()
will fail (and rightly so).Solution
With real data it is difficult to use the approximation
x/lag(x)
, because it depends on the alphabetical or numerical order and in this case there is no "natural" order, as would be the case when working with dates.Considering that you already know the calculations you want to do I think the best solution is
pass it to wide data: one column for each level of the Condition factor
make a
mutate()
that specifies the calculations on those columns.Since the result is going to be "width", pass to long again with
gather()
.I use the
clean_names()
package functionjanitor::
to avoid having to wrap names with spaces in `. It's optional, but if you don't use it you should correct the column names in the onemutate()
on line 5.I think the result is what you are looking for, however, reviewing the results
tail()
, some-Inf
(negative infinities) appear, probably because there are divisions between 0 and R handles them as infinite numbers.With
filter (nueva_condicion == -Inf)
you can see that there are 2303 potentially problematic rows, especially if you plan to graph later.First of all, I understand the same as @mpaladino, that there is duplicate information, there are 8 identical rows per observation. First, we remove these duplicates, although technically it is not necessary, but it is important to regenerate one
id
for each group of observations as follows:Now, one way to solve it is by thinking about the problem from the perspective of relational databases. The idea is to start from a master table that indicates the existing relationships. Since there are three relationships, it is not difficult to define them manually:
With this "master table" we simply have to propose
joins
that each observation be joined in a single row and then simply do the calculations. This is quite handy because if you eventually show all the variables, we can check the result manually.