I have the following code in Pyspark:
%%time
lst_tablas = [tabla_pivote_0_15, tabla_pivote_15_30, tabla_pivote_30_60, tabla_pivote_60_90, tabla_pivote_90_120,
tabla_pivote_120_150, tabla_pivote_150_180,
tabla_pivote_180_210, tabla_pivote_210_240, tabla_pivote_240_270,
tabla_pivote_270_300, tabla_pivote_300_330, tabla_pivote_330_360]
trans = ['COMPRAS_CREDITO', 'RETIROS_CREDITO','RETIROS_DEBITO','COMPRAS_MADRUGADA','COMPRAS_MANANA','COMPRAS_TARDE','COMPRAS_NOCHE','RETIROS_MADRUGADA','RETIROS_MANANA','RETIROS_TARDE','RETIROS_NOCHE','COMPRAS_ENTRE_SEMANA','RETIROS_ENTRE_SEMANA','MAX_DEBITO_DISPONIBLE','MAX_PORC_ENDEUDAMIENTO','PORCENTAJE_COMPRAS_ECOMMERCE','razon_comercios','razon_transacciones','razon_compras','razon_retiros','razon_ecomerce']
i = 1
for t in lst_tablas:
for var in trans:
nombre = var + '_m' + str(i)
t = t.withColumnRenamed(var, nombre)
i += 1
#display(t.show())
tabla_pivote_30_60.show()
In which I have in a list the tables with information of the 12 months of the year. And the first month from 0 to 15 days and from 15 to 30 days.
What I am trying to do is that to each one of those tables that have different fields, only to the fields that are in the list trans
I put a suffix, that is why the part of the code of var + '_m' + str(i)
.
When I see the results and do a show
in the table tabla tabla_pivote_30_60
, I realize that the name of the variables that I wanted was not modified:
However, if I uncomment the line
display(t.show())
To validate what happened with it, for
I realize that within for
it it does it well, but finishing for
it does not change the name of the tables, that is, as if they were only temporary:
Could someone tell me what is going on and help me to permanently modify the tables. First of all, Thanks
Your bug has NOT to do with
pyspark
, it has to do with how Python worksTo do this, first I am going to explain it to you with pure Python, without using
pyspark
.Explanation
In Python, the operator
=
reserves a memory space to assign information. Two things can happen:If the variable does not exist, a new memory space will be reserved on your computer and assigned to the variable. Therefore it will be completely new and will not be related to anything before.
If the variable exists, two things can happen, either the existing memory space is used or it is destroyed (due to incompatibility) and a new one is created.
In your case, when you iterate through the loop, you create a new variable called
t
which you then overwrite, that is, Python destroys the variablet
and creates at
new variable, which has nothing to do with the Pyspark DataFrame , which is in your listlst_tablas
. Let's demonstrate this:Departure:
Great, now let's reproduce the mapping you're doing with
pyspark
this list:Departure:
Exactly what has been mentioned has happened, in this case, when iterating with
for
I have gone to the memory position of each element oflista_de_listas
and then I overwritelista
, this variable has been assigned to a new memory position, that is, it has nothing to do with the items insidelista_de_listas
Solution
There are several solutions, I leave you a couple following the previous example:
use enumerate
It consists of indicating the element of the list that we refer to so that a new variable is not created, and an assignment is made to an existing memory space.
Departure:
[[1, 2], [1, 2]]
enumerate
returns an iterator with the position of the element and the element (in this order). We could do the same withrange()
andlen()
Your case would be:
use range
range
: creates a range of numbers up to the indicated one.len
: tells us the length of the list.Departure:
[[1, 2], [1, 2]]
Your case would be: