What is a promise in Javascript?

Question

Asked: 2020-05-21 13:13:42 +0800 CST 2020-05-21 13:13:42 +0800 CST 2020-05-21 13:13:42 +0800 CST

Delete and replace values in pandas python using conditionals

772

I have the following Dataframe

prueba = 

     M1    M2    M3    M4
0     1     1     1   NaN
1     2     3     3   NaN
2     3     2     2     1
3     4   NaN     1   NaN
4     1   NaN   NaN   NaN
5     1     3     2     2
6     3     3   NaN     1
7     2     2     3   NaN
8     1     3   NaN     1
9     6     4     5     5

I need to do two tasks for each of the rows:

If a column is empty (NaN) and a following one has a value, that value remains in the first empty column and NaN remains in the others... That is, move the values to the left
If two values are equal in a row, leave it only in the first column that appears: for example, if M1 and M2 are equal, only the value remains in M1 and M2 becomes NaN, if the value is repeated in several M it should be left only on the first and the other NaNs.

I have tried the following options:

For the first question try a pairwise comparison. For example for M2 and M3:

for row in prueba.itertuples(): prueba['M2']= prueba.where((prueba['M2'].isnull() & prueba['M3'].notnull()), prueba['M3'])but it generates an error.

For the second question (This part works)

prueba.loc[prueba['M1']== prueba['M2'] , 'M2'] = 'NaN'
prueba.loc[prueba['M1']== prueba['M3'] , 'M3'] = 'NaN'
prueba.loc[prueba['M1']== prueba['M4'] , 'M4'] = 'NaN'
prueba.loc[prueba['M2']== prueba['M3'] , 'M3'] = 'NaN'
prueba.loc[prueba['M2']== prueba['M4'] , 'M4'] = 'NaN'
prueba.loc[prueba['M3']== prueba['M4'] , 'M4'] = 'NaN'

I am new to programming, I appreciate if you can help me to solve the two questions mentioned. The time spent finding the solution is important because there is a lot of data.

The processed Dataframe should come out like this:

     M1     M2      M3    M4
0     1     NaN     NaN   NaN
1     2     3       NaN   NaN
2     3     2       1     NaN
3     4     1       NaN   NaN
4     1     NaN     NaN   NaN
5     1     3       2     NaN
6     3     1       NaN   NaN
7     2     3       NaN   NaN
8     1     3       NaN   NaN
9     6     4       5     NaN

2 Answers

Voted

FJSevilla · Answer 1 · 2020-05-21T18:45:32+08:00

For the second point it can be used in a general way by pandas.Series.drop_duplicatespassing the argument keep="first"to keep only the first occurrence. A boolean mask with would pandas.Series.duplicatedalso work.

For the first point I can't think of a vectorized form. It is possible to do this by using pandas.DataFrame.applyrows( axis=1) and calling for each row a Python function that uses the method pandas.Series.dropnato construct the new row.

import io
import pandas as pd
import numpy as np


data = io.StringIO('''\
M1,M2,M3,M4
1,1,1,NaN
2,3,3,NaN
3,2,2,1
4,NaN,1,NaN
1,NaN,NaN,NaN
1,3,2,2
3,3,NaN,1
2,2,3,NaN
3,3,NaN,1
6,4,5,5
''')

df = pd.read_csv(data, dtype="f")

With the above we obtain a DataFrame that allows us to reproduce your example:

>>> df

    M1   M2   M3   M4
0  1.0  1.0  1.0  NaN
1  2.0  3.0  3.0  NaN
2  3.0  2.0  2.0  1.0
3  4.0  NaN  1.0  NaN
4  1.0  NaN  NaN  NaN
5  1.0  3.0  2.0  2.0
6  3.0  3.0  NaN  1.0
7  2.0  2.0  3.0  NaN
8  3.0  3.0  NaN  1.0
9  6.0  4.0  5.0  5.0

Now let's apply the idea explained before:

res = df.apply(lambda row: pd.Series(row.drop_duplicates(keep="first")
                                        .dropna()
                                        .values
                                     ),
                axis=1
              )

With this we obtain something that is quite close:

>>> res

     0    1    2
0  1.0  NaN  NaN
1  2.0  3.0  NaN
2  3.0  2.0  1.0
3  4.0  1.0  NaN
4  1.0  NaN  NaN
5  1.0  3.0  2.0
6  3.0  1.0  NaN
7  2.0  3.0  NaN
8  3.0  1.0  NaN
9  6.0  4.0  5.0

We just need to add the missing columns (columns with all NaN values) and rename the rest:

p_cols, m_cols = df.columns[:res.shape[1]], df.columns[res.shape[1]:] 
res.columns = p_cols

for col in m_cols:
    res[col] = np.nan

Result:

>>> res

    M1   M2   M3  M4
0  1.0  NaN  NaN NaN
1  2.0  3.0  NaN NaN
2  3.0  2.0  1.0 NaN
3  4.0  1.0  NaN NaN
4  1.0  NaN  NaN NaN
5  1.0  3.0  2.0 NaN
6  3.0  1.0  NaN NaN
7  2.0  3.0  NaN NaN
8  3.0  1.0  NaN NaN
9  6.0  4.0  5.0 NaN

abulafia · Answer 2 · 2020-05-22T12:44:45+08:00

I don't know if I understood correctly, but I think it's about:

In each of the rows of the dataframe:
1. Remove duplicates and keep only one instance of each number that appears
2. Fill the rest of the row with NaN

Although this statement does not coincide with the one you have put, I think that in the end the result is the same, and expressed in this way it is clearer.

In fact, this suggests another way to calculate that result without using Pandas, but instead extracting the two-dimensional array underlying the dataframe. In this array it is a matter of going through it by rows and building with each one a set with its elements (the sets automatically eliminate the duplicates). After this transformation there will be rows with only two elements, others with four, etc.

Finally a new dataframe can be built with all those sets. Since Pandas converting it to a dataframe will make all the rows the same length, it will fill the missing elements with NaN.

The problem with the previous idea is that the sets do not have an internal order , so the first row, for example, would give rise to a set with the elements 1, NaN, or perhaps with the elements NaN, 1. That is, the order in which the elements were added to the set is not preserved, and this is not convenient for us because we want to respect that order when they are expanded again into rows.

A solution to this consists of the following trick. Instead of a set, we use a OrderdeDict(), which is a dictionary that preserves the order in which keys are added to it. We use the elements of each row to create the keys of that dictionary (values are irrelevant and I'll use True). A duplicate key is stored in the same key as there was. If at the end we take the keys of the resulting dictionary ( .keys()), we will have the ordered set of the numbers of that row, in the order in which they were inserted, which is the column order from left to right.

That is, going to the point, this is my idea:

import io
import pandas as pd
from collections import OrderedDict

datos = """\
     M1    M2    M3    M4
0     1     1     1   NaN
1     2     3     3   NaN
2     3     2     2     1
3     4   NaN     1   NaN
4     1   NaN   NaN   NaN
5     1     3     2     2
6     3     3   NaN     1
7     2     2     3   NaN
8     1     3   NaN     1
9     6     4     5     5"""

# Leer el dataframe en cuestión
df = pd.read_table(io.StringIO(datos), sep=r'\s+')

# Construir la lista de las nuevas filas
r = []
for fila in df.values:
  r.append(OrderedDict({k:True for k in fila}).keys())

# Convertir a dataframe de nuevo la lista obtenida
resultado = df.DataFrame(r, columns=df.columns)

Result:

    M1   M2   M3  M4
0  1.0  NaN  NaN NaN
1  2.0  3.0  NaN NaN
2  3.0  2.0  1.0 NaN
3  4.0  NaN  1.0 NaN
4  1.0  NaN  NaN NaN
5  1.0  3.0  2.0 NaN
6  3.0  NaN  1.0 NaN
7  2.0  3.0  NaN NaN
8  1.0  3.0  NaN NaN
9  6.0  4.0  5.0 NaN

Update

@Carolina points out to me in a comment that row 3 is out of spec. In fact, all the NaNs are not together on the right in it.

The error comes from the fact that the loop that puts each element in an ordered dictionary also puts the NaN. We really only want to preserve the original order of the numbers, not the NaN. Therefore, it is enough not to put those NaNin the dictionary, that is:

r = []
for fila in df.values:
  r.append(OrderedDict({k:True for k in fila if not np.isnan(k)}).keys())

The problem is that now the list rwill contain the resulting rows without any NaN, so when creating a DataFrame from them, the final number of columns may be less than what we originally had if, for example (and as is the case) to the right were columns in which there was only NaN.

To fix this second problem, I'll use the following trick. First I create a dataframe with the data I have in r. This dataframe will generally have N columns where N<=4. I then name those columns by copying the names from the original dataframe, but only the first N names. Finally I use reindex()in columns to expand the number of columns to the ones the dataframe originally had. This will fill with NaN any extra columns you need to add. Namely:

result = pd.DataFrame(r)
result.columns = df.columns[:result.shape[1]]
# En realidad, si te vale con que el dataframe resultante tenga sólo
# las columnas M1, M2, y M3 (ya que M4 sería todo NaN), podriamos dejarlo
# así. Si quieres que el resultado tenga el mismo número de columnas, entonces...

result = result.reindex(columns=df.columns)
print(result)

And now yes:

    M1   M2   M3  M4
0  1.0  NaN  NaN NaN
1  2.0  3.0  NaN NaN
2  3.0  2.0  1.0 NaN
3  4.0  1.0  NaN NaN
4  1.0  NaN  NaN NaN
5  1.0  3.0  2.0 NaN
6  3.0  1.0  NaN NaN
7  2.0  3.0  NaN NaN
8  1.0  3.0  NaN NaN
9  6.0  4.0  5.0 NaN

Delete and replace values in pandas python using conditionals

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?

Delete and replace values ​​in pandas python using conditionals

2 Answers

Delete and replace values in pandas python using conditionals