Dear, I am developing some queries and I want to have a group which between dates for each 'pfr_Fault_Code'
there are no more than 30 days.
If they run the following code, they can see a grouping 'pfr_Fault_Code'
by date and a count of events:
from collections import defaultdict
import pandas as pd
from datetime import datetime, timedelta
from time import time
df = pd.read_csv('FDE_before_delay_aviso_previo.csv', header=0, sep=',', usecols=[1,16,20,27,28],parse_dates = ['fault_date'])
df2=df.sort(columns=['CASS_ID','pfrs_pfr_date','fault_date'])
df2 = df2.reset_index(drop=True)
df2=df2.groupby(['pfr_Fault_Code','fault_date']).count()
print(df2)
I want to count the same, but as long as there are no more than 30 days in each date for each 'pfr_Fault_Code'
from the most current date backwards . On the other hand, the dates are ordered. In pandas will it be possible to do this directly?
Example:
For one 'pfr_Fault_Code'=XXXXXX
I have the following'fault_date'=[2017/12/01],[2017/11/29],[2017/11/10],[2017/09/30],[2017/09/15]
From the most current date back I see if there is a cut between dates greater than 30 days, which is generated between '[2017/11/10]
and [2017/09/30]
therefore it works for me 'fault_date'=[2017/12/01],[2017/11/29],[2017/11/10]
.
I was thinking of adding something like this, a new column with the difference in days, but it doesn't work for me:
df2['diferencia'] =df2.groupby('pfr_Fault_Code')['fault_date'].transform(pd.Series.diff).fillna(df2['fault_date'])
TypeError: unsupported operand type(s) for -: 'str' and 'str'
The CSV file to be able to execute the code is the following:
https://drive.google.com/file/d/1f8IFNvMA0Zbm_t0whweZjt4QSHR-ahVy/view?usp=sharing
You can use
pandas.DataFrame.groupby.diff()
together withcount()
to create a new column that differentiates each group of dates within a same valuepfr_Fault_Code
that they are not separated by more than 30 days with respect to the one that precedes it. Once this is done, it would be enough to just keep the first group for each value ofpfr_Fault_Code
.Let's look at a reproducible and simplified example:
With this we obtain the following DataFrame:
Now it is very simple to obtain for each value of
pfr_Fault_Code
the rows until one of the dates offault_date
has a difference greater than 30 days with respect to the one that precedes it: