What is a promise in Javascript?

Question

Asked: 2020-03-30 05:36:39 +0800 CST 2020-03-30 05:36:39 +0800 CST 2020-03-30 05:36:39 +0800 CST

How to generate a DataFrame with the same number of elements from different subgroups?

772

I have an old DataFrame for people:

df= 
identificador   edad
1                50
2                10
3                22
4                60
5                45
6                2
7                27
8                30
9                14
10               55

I have defined 4 groups: less than or equal to 5 years, between 5 and 20 years, between 20 and 40 years and between 40 and 60 years. I require to create a DataFrame that has the same number of records from each group. That number is defined by the size of the smallest group (number of records that meet the condition). In the example, the group under 5 years has the smallest size 1: only one record meets the condition. The new DataFrame would be:

identificador       edad
    1                50
    2                10
    3                22
    6                2

What I did was create a new column with a defined code for each group and with that know the size of each group.

def ciclodevida(edad):
if edad <= 5: return 1
if (edad > 5 and edad <= 20 ): return 2
if (edad > 20 and edad <= 40 ): return 3
if (edad > 40 and edad <= 60 ): return 4
df['ciclo']= df['edad'].apply(ciclodevida)
ciclodevida=df.groupby('ciclo').size()

Then create a separate Dataframe for each group:

ciclo1 = df[df['ciclo'] == 1]
ciclo1 =ciclo1.reset_index(drop=True)
ciclo2 = df[df['ciclo'] == 2]
ciclo2 =ciclo2.reset_index(drop=True)
ciclo3 = df[df['ciclo'] == 3]
ciclo3 =ciclo3.reset_index(drop=True)
ciclo4 = df[df['ciclo'] == 4]
ciclo4 =ciclo4.reset_index(drop=True)

Finally I delete the records in each group depending on the size of group 1 (the smallest size) and then I joined all the DataFrame using concatenate

C2= ciclo2.drop(range(len(ciclo1), len(ciclo2), 1), axis=0)
C3= ciclo3.drop(range(len(ciclo1), len(ciclo3), 1), axis=0)
C4= ciclo4.drop(range(len(ciclo1), len(ciclo4), 1), axis=0)

final= pd.concat([C1,C2,C3,C4])

The code works but I would like to make it more efficient, as you can see many phases must be done (and this increases if there are more groups). In addition, I would like the records that will enter the new DataFrame of each group to be selected randomly.

Can someone help me with an idea? Thank you!

1 Answers

Voted

Patricio Moracho · Answer 1 · 2020-04-02T12:13:54+08:00

Conceptually, what you are looking for is the first nrows of each group (defined by an interval) and being nthe minimum number of rows in all groups.

First, we reproduce your data:

from io import StringIO
import pandas as pd

TESTDATA = StringIO("""identificador;edad
1;50
2;10
3;22
4;60
5;45
6;2
7;27
8;30
9;14
10;55
""")

df = pd.read_csv(TESTDATA, sep=";")

And now, we can do the following:

# Generamos la nueva columna ciclo
rangos = [5, 20, 40, 60]
df['ciclo']= df['edad'].apply(lambda x: [i for i, v in enumerate(rangos, 1) if x <= v][0])

# Buscamos el valor minímo
minimo = min(df.groupby('ciclo').size())

# Nos quedamos con las filas de cada grupo hasta el mínimo calculado
print(df.groupby('ciclo').head(minimo))

   identificador  edad  ciclo
0              1    50      4
1              2    10      2
2              3    22      3
5              6     2      1

ciclowe generate it with a list comprehension: [i for i, v in enumerate(rangos, 1) if x <= v][0], we simply place the first range where the age is less than the element at the top of that range, a more compact way that avoids the use of theif
Then we get the minimum value with min(df.groupby('ciclo').size())i.e. the size of the smallest group
Finally, we use head(minimo)to keep the first nrows of each group

How to generate a DataFrame with the same number of elements from different subgroups?

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?