From this DataFrame with name df
samples | target | CT | Ct Mean |
---|---|---|---|
C41 | B2M | 20.642399 | 20.680149 |
C41 | B2M | 20.717901 | 20.680149 |
C42 | ULK1 | 29.097883 | 29.110802 |
C42 | ULK1 | 29.123722 | 29.110802 |
C43 | TBP | 22.126412 | 21.4221565 |
C43 | TBP | 20.717901 | 21.4221565 |
The Ct Mean column is the average of the two values of the Ct column of the same Sample and Target (this is given by default by the .xlsx that I imported into pandas).
My intention is to check if the difference between the two values of the Ct column for the same sample and target is not greater than +-1 two to two. For example, for C41 (B2M) the difference between 20.642399 and 20.717901 is less than 1, so it should return the value of Ct as is. Instead, for C43 (TBP) the difference between 22.126412 and 20.717901 is greater than 1 and replace the two Ct values for C43 with "Undetermined". The result that should give me would be this:
samples | target | CT | Ct Mean |
---|---|---|---|
C41 | B2M | 20.642399 | 20.680149 |
C41 | B2M | 20.717901 | 20.680149 |
C42 | ULK1 | 29.097883 | 29.110802 |
C42 | ULK1 | 29.123722 | 29.110802 |
C43 | TBP | Undetermined | Undetermined |
C43 | TBP | Undetermined | Undetermined |
I have tried in various ways to make a subtraction between two elements of the same column of a dataframe but I have not been able to. The first was to apply a loop for that column that would make the difference between the two values by making a jump so that it would then do the following two:
def loop(i):
for i in range(0,96,2):
if i-(i+1)>1 or i-(i+1)<(-1):
i=="Undetermined"
else:
return i
prueba = df["Ct"].apply(loop)
prueba
Print:
0 0
1 0
2 0
3 0
4 0
..
91 0
92 0
93 0
94 0
95 0
Name: Ct, Length: 96, dtype: int64
NOTE* My dataframe has 96 rows. I have only put a head with the first 6 for the example. When printing, it gives me all 0. I have been searching and I saw that there is a method .diff
that allows subtracting the value of an element minus the value of the previous element, but I don't know how to apply it. Another way I thought is to use:
df["Ct"].sub(df[0,len(df),2], axis=0)
Obviously it gives an error and the syntax is not correct either.
Solution
show
If
df
it initially contains:the result of the above code produces:
How does it work
As you can see, everything is resolved in one line:
What this does is to group the dataframe by "Sample" and "Target" so that it gathers in several "sub-dataframes" the rows that have the same value in "Sample" and "Target". The function
myfunc
.Therefore, this function receives in its parameter
g
a group, which is actually a dataframe but "filtered" so that it has only a couple of elements with the same Sample and Target, at least in this case they are only a couple of elements. More generally, they receive a dataframe with an arbitrary number of rows, with the same columns as thedf
original, and with the same value of "Sample" and "Target" in all rows.What the function does is determine if in that group the value of the "Ct Mean" column must be changed to put "Undetermined" or it must be left as it was. Then it returns the group in question, so that
.apply()
it can be concatenated again with the remaining groups to create the dataframe with the result.The key to determining whether or not to put "Undetermined" is the following line:
g.Ct
is the Ct column of the received group. When applying.diff()
it, the previous one is subtracted in that column from each element. The first one does not have a previous one, so the result is NaN, but in the following ones the result will be the difference. Thus, we have a column of differences. To that column it is applied.abs()
to keep the absolute value so that the sign does not influence. Therefore, there is a column of numbers (the first of them NaN).The column is compared with
>1
what gives us a new column but this time of booleans. For each element of the difference that is greater than 1, there will be aTrue
(and the rest will beFalse
). The first that isNaN
willFalse
always give. In your case there will only be one more element (because there are only two rows in each group), but in general we could have a column with many booleans.That column is passed to
any()
which returnsTrue
if there is at least oneTrue
among the elements. Only if all are will itFalse
returnFalse
.The result is that if among the values returned by .diff() there is one greater than 1,
if
it will be executed and then it will do:which assigns that value to the entire column, that is, to all rows in that group. If
if
he is not fulfilled,g
he is not touched.Finally the function returns
g
(whether it has been modified or not).You could use diff but then you would have to retrieve the second elements of each group to see the difference and substitute the values of each of them. Perhaps conceptually it is a bit more of a mess, so first I will propose an alternative and then I will explain how to do it with
diff
.Option 1: Group by
What you are looking for is to group by the sample column, calculate the difference and return a mask indicating whether the difference is greater than 1 or not. With that mask you would modify the values of your dataframe in those in which the condition is met:
At this point you have this dataframe
The Ct_y column tells us if the difference is greater than 1, so we could now replace the values
Our dataframe would look like this:
You could really simplify it by bypassing the column with True and False and directly returning the difference, which would look something like this:
Option 2: Diff
First we calculate diff (assuming the dataframe is sorted) and remove duplicates to keep the last element:
With this dataframe we can now operate exactly the same as in option 1