What is a promise in Javascript?

Question

Asked: 2020-04-28 12:39:59 +0800 CST 2020-04-28 12:39:59 +0800 CST 2020-04-28 12:39:59 +0800 CST

How to override a For loop in pandas python to reduce execution time?

772

I have the following simplified DataFrame (the real one has 26 million records):

df:
     id D1   D2   D3   D4
0   111  A    B    C    D
1   222  B    C    D  NaN
2   333  C    A  NaN  NaN
3   444  A  NaN  NaN  NaN
4   111  A    E    C    M
5   333  C    M  NaN  NaN
6   555  D    E  NaN  NaN
7   111  E    A    B  NaN
8   444  F  NaN  NaN  NaN
9   333  G    A  NaN  NaN
10  666  H    N  NaN  NaN

I need to obtain the following records for each id: 1. Number of records in the base 2. Number of values different from D 3. Total number ofD

The code used is the following:

iden = df['id'].unique().tolist()
reporte=pd.DataFrame(columns=['ide','n_c','n_dif','n_total'])
for i in iden: 
    c = df[df['id'] == i]
    d = c['D1'].append([c['D2'], c['D3'], c['D4']]) 
    d = d.dropna()  
    d_dif = d.drop_duplicates()
    reporte=reporte.append({'ide':i,'n_c':len(c),'n_dif':len(d_dif) ,'n_total': len(d)},ignore_index=True)

and the result obtained is:

   ide n_c n_dif n_total
0  111   3     6      11
1  222   1     3       3
2  333   3     4       6
3  444   2     2       2
4  555   1     2       2
5  666   1     2       2

I need to find a path in pandas to replace the forone that looks for the records of each idin the database, since it takes about a second to lookup and perform the calculation by id, which is very inefficient considering the size of base.

I appreciate if you can help me.

2 Answers

Voted

Patricio Moracho · Answer 1 · 2020-05-01T07:44:12+08:00

What you can do is build a function that processes each group:

def stats(x):

  complete_list = [item for sublist in x.values.tolist() for item in sublist]
  without_nan = [item for item in complete_list if item is not np.nan]
  d = {}
  d['n_c'] = len(x)
  d['n_dif'] = len(set(without_nan))
  d['n_total'] = len(without_nan)

  return pd.Series(d)

As you can see, the function statsconverts all the values of all the received columns into a simple list, then we simply use this for the calculations and return one Seriethat will be our new columns.

Then we simply apply the functions to each group:

grouped = df.groupby(['id'])[['D1', 'D2', 'D3', 'D4']].apply(stats)

print(grouped)
         n_c  n_dif  n_total
id                      
111    3      6       11
222    1      3        3
333    3      4        6
444    2      2        2
555    1      2        2
666    1      2        2

At least with this dataset, it's a third of what you're already doing, but you should try it with the full dataset.

FJSevilla · Answer 2 · 2020-05-03T16:52:30+08:00

Starting from groupby, there are two columns whose calculation is very simple in vectorized form:

To n_cjust usepandas.DataFrame.groupby.size
For n_totalwe can use pandas.DataFrame.groupby.countfollowed by sumto sum each row. countdiscard the NaNdefault ones.

The problem is to get the unique values for each group considering all the columns. In this case, the best thing that has occurred to me has been to use it pandas.uniqueon a view of the flattened array, which generates an array with the unique items, followed by pandas.Series.countcounting the items discarding the NaN.

groups = df.groupby("id")
reporte = pd.DataFrame({
    'n_c': groups.size(),
    'n_dif': groups[['D1', 'D2', 'D3', 'D4']].apply(
        lambda group: pd.Series(pd.unique(group.values.ravel())).count()),
    'n_total': groups.count().sum(axis=1)}
    )

The full reproducible example:

import io
import pandas as pd



data = io.StringIO("""\
  id D1   D2   D3   D4
111  A    B    C    D
222  B    C    D  NaN
333  C    A  NaN  NaN
444  A  NaN  NaN  NaN
111  A    E    C    M
333  C    M  NaN  NaN
555  D    E  NaN  NaN
111  E    A    B  NaN
444  F  NaN  NaN  NaN
333  G    A  NaN  NaN
666  H    N  NaN  NaN 
""")


df = pd.read_csv(data, sep="\s+", engine="python")


groups = df.groupby("id")
reporte = pd.DataFrame({
    'n_c': groups.size(),
    'n_dif': groups[['D1', 'D2', 'D3', 'D4']].apply(
        lambda group: pd.Series(pd.unique(group.values.ravel())).count()),
    'n_total': groups.count().sum(axis=1)}
    )

>>> reporte
     n_c  n_dif  n_total
id                      
111    3      6       11
222    1      3        3
333    3      4        6
444    2      2        2
555    1      2        2
666    1      2        2

How to override a For loop in pandas python to reduce execution time?

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?