I have the following simplified DataFrame (the real one has 26 million records):
df:
id D1 D2 D3 D4
0 111 A B C D
1 222 B C D NaN
2 333 C A NaN NaN
3 444 A NaN NaN NaN
4 111 A E C M
5 333 C M NaN NaN
6 555 D E NaN NaN
7 111 E A B NaN
8 444 F NaN NaN NaN
9 333 G A NaN NaN
10 666 H N NaN NaN
I need to obtain the following records for each id
: 1. Number of records in the base 2. Number of values different from D
3. Total number ofD
The code used is the following:
iden = df['id'].unique().tolist()
reporte=pd.DataFrame(columns=['ide','n_c','n_dif','n_total'])
for i in iden:
c = df[df['id'] == i]
d = c['D1'].append([c['D2'], c['D3'], c['D4']])
d = d.dropna()
d_dif = d.drop_duplicates()
reporte=reporte.append({'ide':i,'n_c':len(c),'n_dif':len(d_dif) ,'n_total': len(d)},ignore_index=True)
and the result obtained is:
ide n_c n_dif n_total
0 111 3 6 11
1 222 1 3 3
2 333 3 4 6
3 444 2 2 2
4 555 1 2 2
5 666 1 2 2
I need to find a path in pandas to replace the for
one that looks for the records of each id
in the database, since it takes about a second to lookup and perform the calculation by id
, which is very inefficient considering the size of base.
I appreciate if you can help me.
What you can do is build a function that processes each group:
As you can see, the function
stats
converts all the values of all the received columns into a simple list, then we simply use this for the calculations and return oneSerie
that will be our new columns.Then we simply apply the functions to each group:
At least with this dataset, it's a third of what you're already doing, but you should try it with the full dataset.
Starting from
groupby
, there are two columns whose calculation is very simple in vectorized form:n_c
just usepandas.DataFrame.groupby.size
n_total
we can usepandas.DataFrame.groupby.count
followed bysum
to sum each row.count
discard theNaN
default ones.The problem is to get the unique values for each group considering all the columns. In this case, the best thing that has occurred to me has been to use it
pandas.unique
on a view of the flattened array, which generates an array with the unique items, followed bypandas.Series.count
counting the items discarding the NaN.The full reproducible example: