I want to make a frequency table of my data frame in such a way that in the gender data (written as RIAGENDR) I only select females (designated as 2, males as 1) and count the number of observations in a designated variable as DMDMARTL (marital status: 1 ==> married, 2 ==> widowed, 3 ==> divorced, 4 ==> separated, 5 ==> never married, 6 == > living with a partner).
The data frame is available at: https://www.kaggle.com/ramendrapandey/nhanes-2015-2016
My code to select the variables of my data frame (RIAGENDR and DMDMARTL) and do the frequency count (in the case of RIAGENDR (gender) only select women (2)) is as follows:
pd.crosstab(index = da["RIAGENDR"] == 2, columns = da["DMDMARTL"])
With that code it shows me the following:
DMDMARTL | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 | 6.0 |
---|---|---|---|---|---|---|
RIAGEDR | ||||||
False | 1477 | 100 | 229 | 68 | 484 | 265 |
True | 1303 | 296 | 350 | 118 | 520 | 262 |
I guess that True is the value 2 (female) and False corresponds to 1 (male). However, I want instead of those boolean values to show me the number 2 and its respective values in each category of the variable DMDMARTL(marital status). How could I modify my code so that it does the above?
If what you are looking for is only for the case of women, that can be obtained in a simpler way. Simply select (using
.loc[]
) all the rows in which "RIAGENDR" is equal to 2, and the "DMDMARTL" column, to apply the operation to that selection.value_counts()
So:
and that produces the result:
(by the way, in the dataframe I downloaded from Kaggle there seems to be a case where DMDMARTL takes the value 77)
Using cross tab
If you prefer to do it with
crosstab
, because you want to obtain the counters for both men and women, what happened to you is that you are using a boolean as an index of the result, which isTrue
when RIAGENDR is 2 andFalse
when it is different from 2, because what you doda["RIAGENDR"]==2
and that comparison exitsTrue
orFalse
item by item.To obtain instead 1 and 2 directly, just do not do any comparison, but use the values of the "RIAGENDR" column as an index:
And the result is now: