I have an old DataFrame for people:
df=
identificador edad
1 50
2 10
3 22
4 60
5 45
6 2
7 27
8 30
9 14
10 55
I have defined 4 groups: less than or equal to 5 years, between 5 and 20 years, between 20 and 40 years and between 40 and 60 years. I require to create a DataFrame that has the same number of records from each group. That number is defined by the size of the smallest group (number of records that meet the condition). In the example, the group under 5 years has the smallest size 1: only one record meets the condition. The new DataFrame would be:
identificador edad
1 50
2 10
3 22
6 2
What I did was create a new column with a defined code for each group and with that know the size of each group.
def ciclodevida(edad):
if edad <= 5: return 1
if (edad > 5 and edad <= 20 ): return 2
if (edad > 20 and edad <= 40 ): return 3
if (edad > 40 and edad <= 60 ): return 4
df['ciclo']= df['edad'].apply(ciclodevida)
ciclodevida=df.groupby('ciclo').size()
Then create a separate Dataframe for each group:
ciclo1 = df[df['ciclo'] == 1]
ciclo1 =ciclo1.reset_index(drop=True)
ciclo2 = df[df['ciclo'] == 2]
ciclo2 =ciclo2.reset_index(drop=True)
ciclo3 = df[df['ciclo'] == 3]
ciclo3 =ciclo3.reset_index(drop=True)
ciclo4 = df[df['ciclo'] == 4]
ciclo4 =ciclo4.reset_index(drop=True)
Finally I delete the records in each group depending on the size of group 1 (the smallest size) and then I joined all the DataFrame using concatenate
C2= ciclo2.drop(range(len(ciclo1), len(ciclo2), 1), axis=0)
C3= ciclo3.drop(range(len(ciclo1), len(ciclo3), 1), axis=0)
C4= ciclo4.drop(range(len(ciclo1), len(ciclo4), 1), axis=0)
final= pd.concat([C1,C2,C3,C4])
The code works but I would like to make it more efficient, as you can see many phases must be done (and this increases if there are more groups). In addition, I would like the records that will enter the new DataFrame of each group to be selected randomly.
Can someone help me with an idea? Thank you!
Conceptually, what you are looking for is the first
n
rows of each group (defined by an interval) and beingn
the minimum number of rows in all groups.First, we reproduce your data:
And now, we can do the following:
ciclo
we generate it with a list comprehension:[i for i, v in enumerate(rangos, 1) if x <= v][0]
, we simply place the first range where the age is less than the element at the top of that range, a more compact way that avoids the use of theif
min(df.groupby('ciclo').size())
i.e. the size of the smallest grouphead(minimo)
to keep the firstn
rows of each group