I have created a function that removes the atypical values on the column of a dataframe by passing the df and the column as parameters:
import numpy as np
def outlier(df, col_name):
q1 = np.percentile(np.array(df[col_name].tolist()), 25)
q3 = np.percentile(np.array(df[col_name].tolist()), 75)
IQR = q3 - q1
Q3 = q1+(3*IQR)
Q1 = q3-(3*IQR)
outlier_num = 0
for value in df[col_name].values.tolist():
if (value < Q1) | (value > Q3):
outlier_num +=1
return Q1, Q3, outlier_num
The problem is when trying to pass the parameters:
df_covtype = df_covtype[(df_covtype['column_name'] > outlier(df_covtype, 'column_name')[0]) &
(df_covtype['colum_name'] < outlier(df_covtype, 'column_name')[1])]
It tells me the following:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-122-e4962bb5c2b0> in <module>()
----> 1 df_covtype = df_covtype[(df_covtype['column_name'] > outlier(df_covtype, 'column_name')[0]) &
2 (df_covtype['column_name'] < outlier(df_covtype, 'column_name')[1])]
3 df_covtype.shape
1 frames
<ipython-input-119-f1e12f2fd893> in outlier(df, col_name)
2 import numpy as np
3 def outlier(df, col_name):
----> 4 q1 = np.percentile(np.array(df[col_name].tolist()), 25)
5 q3 = np.percentile(np.array(df[col_name].tolist()), 75)
6 IQR = q3 - q1
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __getattr__(self, name)
5485 ):
5486 return self[name]
-> 5487 return object.__getattribute__(self, name)
5488
5489 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'tolist'
If anyone can give me a hand, I'd appreciate it. Greetings and thank you
Good day,
There is a slightly easier way to do what you are looking for using
pandas.DataFrame.quantile
andpandas.DataFrame.apply
First I'll generate one
dataframe
for the example:This returns a
dataframe
3 column 100 row random valueNow we get the low and high bounds to detect the outliers
This will give us something
dataframe
like the following:Indicating the limits for each column
Finally we use
apply()
in thedataframe
original using the limits obtained in thedataframe
previousNote: If you want to include the limits then you must use
>=
and<=
respectively, in the above example the limits are excludedNote 2: If you want to exclude any column you must do it before applying the limits, you can do it in the following way:
And in this case we would do the
apply
enfilt_df
instead of thedataframe
originalFinally, if you want to delete the rows that contain anything
NaN
caused by it,apply
you can do it as follows:Full example: