I have the following Dataframe
prueba =
M1 M2 M3 M4
0 1 1 1 NaN
1 2 3 3 NaN
2 3 2 2 1
3 4 NaN 1 NaN
4 1 NaN NaN NaN
5 1 3 2 2
6 3 3 NaN 1
7 2 2 3 NaN
8 1 3 NaN 1
9 6 4 5 5
I need to do two tasks for each of the rows:
If a column is empty (NaN) and a following one has a value, that value remains in the first empty column and NaN remains in the others... That is, move the values to the left
If two values are equal in a row, leave it only in the first column that appears: for example, if M1 and M2 are equal, only the value remains in M1 and M2 becomes NaN, if the value is repeated in several M it should be left only on the first and the other NaNs.
I have tried the following options:
For the first question try a pairwise comparison. For example for M2 and M3:
for row in prueba.itertuples():
prueba['M2']= prueba.where((prueba['M2'].isnull() & prueba['M3'].notnull()), prueba['M3'])
but it generates an error.
For the second question (This part works)
prueba.loc[prueba['M1']== prueba['M2'] , 'M2'] = 'NaN'
prueba.loc[prueba['M1']== prueba['M3'] , 'M3'] = 'NaN'
prueba.loc[prueba['M1']== prueba['M4'] , 'M4'] = 'NaN'
prueba.loc[prueba['M2']== prueba['M3'] , 'M3'] = 'NaN'
prueba.loc[prueba['M2']== prueba['M4'] , 'M4'] = 'NaN'
prueba.loc[prueba['M3']== prueba['M4'] , 'M4'] = 'NaN'
I am new to programming, I appreciate if you can help me to solve the two questions mentioned. The time spent finding the solution is important because there is a lot of data.
The processed Dataframe should come out like this:
M1 M2 M3 M4
0 1 NaN NaN NaN
1 2 3 NaN NaN
2 3 2 1 NaN
3 4 1 NaN NaN
4 1 NaN NaN NaN
5 1 3 2 NaN
6 3 1 NaN NaN
7 2 3 NaN NaN
8 1 3 NaN NaN
9 6 4 5 NaN
For the second point it can be used in a general way by
pandas.Series.drop_duplicates
passing the argumentkeep="first"
to keep only the first occurrence. A boolean mask with wouldpandas.Series.duplicated
also work.For the first point I can't think of a vectorized form. It is possible to do this by using
pandas.DataFrame.apply
rows(axis=1
) and calling for each row a Python function that uses the methodpandas.Series.dropna
to construct the new row.With the above we obtain a DataFrame that allows us to reproduce your example:
Now let's apply the idea explained before:
With this we obtain something that is quite close:
We just need to add the missing columns (columns with all NaN values) and rename the rest:
Result:
I don't know if I understood correctly, but I think it's about:
Although this statement does not coincide with the one you have put, I think that in the end the result is the same, and expressed in this way it is clearer.
In fact, this suggests another way to calculate that result without using Pandas, but instead extracting the two-dimensional array underlying the dataframe. In this array it is a matter of going through it by rows and building with each one a set with its elements (the sets automatically eliminate the duplicates). After this transformation there will be rows with only two elements, others with four, etc.
Finally a new dataframe can be built with all those sets. Since Pandas converting it to a dataframe will make all the rows the same length, it will fill the missing elements with NaN.
The problem with the previous idea is that the sets do not have an internal order , so the first row, for example, would give rise to a set with the elements
1, NaN
, or perhaps with the elementsNaN, 1
. That is, the order in which the elements were added to the set is not preserved, and this is not convenient for us because we want to respect that order when they are expanded again into rows.A solution to this consists of the following trick. Instead of a set, we use a
OrderdeDict()
, which is a dictionary that preserves the order in which keys are added to it. We use the elements of each row to create the keys of that dictionary (values are irrelevant and I'll useTrue
). A duplicate key is stored in the same key as there was. If at the end we take the keys of the resulting dictionary (.keys()
), we will have the ordered set of the numbers of that row, in the order in which they were inserted, which is the column order from left to right.That is, going to the point, this is my idea:
Result:
Update
@Carolina points out to me in a comment that row 3 is out of spec. In fact, all the NaNs are not together on the right in it.
The error comes from the fact that the loop that puts each element in an ordered dictionary also puts the
NaN
. We really only want to preserve the original order of the numbers, not theNaN
. Therefore, it is enough not to put thoseNaN
in the dictionary, that is:The problem is that now the list
r
will contain the resulting rows without anyNaN
, so when creating a DataFrame from them, the final number of columns may be less than what we originally had if, for example (and as is the case) to the right were columns in which there was only NaN.To fix this second problem, I'll use the following trick. First I create a dataframe with the data I have in
r
. This dataframe will generally have N columns where N<=4. I then name those columns by copying the names from the original dataframe, but only the first N names. Finally I usereindex()
in columns to expand the number of columns to the ones the dataframe originally had. This will fill with NaN any extra columns you need to add. Namely:And now yes: