I have a first dataframe to which I need to add the lines of a second dataframe .
It is more or less like the first one:
QID Questions B Answer1 Answer2 Answer3 F G H I J
0 3 a 4.0 a a a a e g i l
1 4 b 5.0 b b b a r h m p
2 5 d 5.0 NaN e d b u e i z
3 6 e 5.0 d h r b c z i 3
...
And the second:
QID Questions B Answer1 Answer2 Answer3 F ...
0 1 a 4.0 a a a a
1 2 b 5.0 b k b a
2 2_1 z 5.0 b k b a
3 2_2 w 4.0 b k b c
4 3 d 5.0 NaN e d b
5 4 e 5.0 d h r b
...
I would like to get:
QID Questions B Answer1 Answer2 Answer3 F G H I J
0 3 a 4.0 a a a a e g i l
1 4 b 5.0 b b b a r h m p
2 4_1 z 5.0 b k b a r h m p
3 4_2 w 4.0 b k b c r h m p
4 5 d 5.0 NaN e d b u e i z
5 6 e 5.0 d h r b c z i 3
...
As you can see the dataframes share the Questions b
so I have added the following lines that are included _
in the new dataframe.
Literally this means that the first data box and the second data box share the same texts t1
and t2
in the cells of the "Answers" column. But for a given combination (t1,t2) where t1 == t2
, when there are also rows below it such that QID has a _
then I want to add those rows after the row they were recorded on.
Until today I tried:
rows_to_add = pd.DataFrame()
for i, row1 in df.iterrows():
for j, row2 in df2.iterrows():
if row1['Questions'] == row2['Questions']:
# here I want to test if the next row has _ in his QID
# if so I add all the lines with the same QID before _ but with row1 QID
k = 0
for _, next_row_df2 in df2[j+1:].iterrows():
if "_" in str(next_row_df2['QID']):
next_row_df2['QID'] = str(row1['QID']) + '_' + k
rows_to_add += next_row_df2 # but I need to change the QID
else:
break # exit this loop and add the lines to the dataframe
k += 1
df = pd.concat([df.iloc[:i], rows_to_add, df.iloc[i:]]).reset_index(drop=True)
rows_to_add = pd.DataFrame()
But it doesn't add the rows. Maybe you could do it in a more efficient way: only iterate on the df2 lines where there are _
? Or with map-reduce?
Maybe you could do it in a more efficient way: only iterate on lines df2 where there is _ ? Use map reduce?
Although the explanation you give is a bit confusing, I think I understand what you are asking for, correct me if there is any confusion.
From the questions in dataframe_1 find if the question is repeated in dataframe_2
If the question is repeated, look for the subquestions in dataframe_2 that have an underscore in the QID .
Replace the QID of the subquestions in dataframe 2 , with the QID of the question in dataframe 1 .
We can create a method to do these functions:
And lastly, we use this method to review all questions in dataframe_1
And you can join this new dataframe with dataframe_1 , put the index, sort it and so on.
I hope it helps you.