I want to make a nested for in python, where I compare all the records of a dataframe record by record, I have tried this so far (the condition compares if the record of the first for has a Null date and the record of the second for also):
for i in df.index:
for j in df.index:
if i!=j and df['idDCL'][i]==df['idDCL'][j] and pd.isnull(df['fechaFin'][i])==True and pd.isnull(df['fechaFin'][j])==True:
The data
There is a dataframe whose columns include two that are of interest, called "idDCL" and "fechaFin". A piece of this dataframe would be:
"idDCL"
The same value can appear in the column many times. For example, the value "557DGQ" appears ten times:In some cases in
fechaFin
appearsNaN
. Some of the idDCLs can have one NaN or more than one, for example, it happens to "534OUT" that appears 13 times but in two of them it has NaN in EndDate.The problem posed
(As I understood it, correct me in comments if I'm wrong)
Find all those values of for which there are two or more NaNs
idDCL
in the column .fechaFin
The solution
Group by
idDCL
and for each group count how manyNaN
there are, keeping only (filtering) the groups that have more than 1 NaN. From the result we are left with the idDCL column, which we convert to a set to remove duplicates:This gives us the set of
ids
(idDCL) that have two or more NaNs in EndDate. We can take a look at the result:There are 58 cases.
If we now want to see the entire rows where the problem occurs, we can use these
ids
to filter the entire dataframe (along with the condition thatfechaFin
it isNaN
):This is the dataframe you were looking for. You can dump it to csv or do whatever you want with it. For example, let's see how it starts (ordering by the idDCL column so that those of the same idDCL come out together):
As you can see, a for loop has been avoided. Let's not say two nested for loops, which would have a complexity O(n^2) that in a dataframe as big as this would mean several seconds of processing (my solution ends in less than 1s)