I have the following two DataFrame
with a column with the same name iden
:
df1:
iden c A1 A2 A3
11 1 1 1 NaN
23 2 3 3 NaN
11 3 2 2 1
74 4 NaN 1 NaN
74 1 NaN NaN NaN
df2=
iden caso
74 A
77 B
11 C
25 A
48 B
What I need is to replace all the values of the column iden
in both DataFrame
in such a way that if there is a value that is in both DataFrame
the same number is assigned, the values are identifiers. In the example the answer would be:
df1:
iden c A1 A2 A3
1 1 1 1 NaN
2 2 3 3 NaN
1 3 2 2 1
3 4 NaN 1 NaN
3 1 NaN NaN NaN
df2=
iden caso
3 A
4 B
1 C
5 A
6 B
I thought about creating a new column in each DataFrame using isin
for number generation:
df1['new_iden'] = list(".." if x else ".." for x in df1.iden.isin(df2.iden))
and then delete the original column.
But I don't know how to tell it what value to put in the if so that it generates the numbers as required.
I appreciate what you can help me with.
One possibility is to iterate over both columns (first
df1.iden
and thendf2.iden
) and assign new values in that order, using a dictionary as an intermediary to store the pairs"antiguo valor": "nuevo valor"
. Then just make use ofloc
/at
and assign each cell its new value according to the dictionary:Departure:
If using Python < 3.8.x (without assignment expressions ) the code should be:
There are, as always, more possibilities, another option is to use
collectiosn.defaultdict
to generate the dictionary together withitertools.count
(as a generator of the new keys) andpandas.Series.replace
to substitute the values based on the dictionary:For DataFrames with a relatively large number of rows and replacements it is in principle more efficient than the previous version:
you can create a function that generates successive numbers on each iteration:
But be careful because if you try to evaluate the entire generator by iterating over it in a for loop, or using it to initialize other iterators like lists or sets, you will get into an infinite loop.
If you know the range of indices you need you can make it safer with a for loop.