I need to fix an error that appears when replacing cell values by cells within a loop in pandas:
Using the following information:
import pandas as pd
data = pd.read_csv('iEC7R76C.txt', sep=",")
data.head(2)
I want to complete the year based on the title and I get it like this:
import re
cadena = data.title[0]
cadena
# 'Nicosia 2013 VulkÃ\xa0 Bianco (Etna)'
cadena = re.sub("\D", "", cadena)
cadena
# 2013
Now if I match the year in position 0 to that string, it changes it but I get this alert that I don't know what it means and although I already google enough I don't understand it:
data.anio[0] = cadena
data.head(2)
With something more or less like this I can go through all the data and complete it but the error keeps coming up:
for i in range(0, len(data)):
cadena = data.title[i]
cadena = re.sub("\D", "", cadena)
if cadena != '':
if 1900 <= int(cadena) <= 2020:
data['anio'][i] = cadena
I put that empty condition because sometimes the title does not have the year and if that year is between 1900 and 2020 because there are titles that have for example "Name of wine 2001 batch 2" and when I get the numbers within that string I returns 20012
Enter the link that shows me the link error but I do not understand.
First, what does the error mean?
It's a warning that when you do something like:
the value may not be assigned. This is because the select
dataframe[columna]
sometimes returns a reference to the actual column of your dataframe, in which case subsequent access to[indice]
yes will modify the column, but in other circumstances it maydataframe[columna]
return a copy of your column, in which case subsequent assignment would modify that copy, but not the original dataframe.It seems that Pandas sometimes returns a copy, although I'm not clear on the circumstances. I guess it will depend on what kind of expression you put to select the column. In any case, just in case, it warns you not to do that.
What to do then?
The correct way to reference a cell to change its value would be:
This way of accessing the cell will always modify the original dataframe, and therefore will not generate the warning.
In your code it would translate to:
a better way
Whenever possible, loops should be avoided when doing pandas operations, and change them to vector operations that act on the entire dataframe in a single line (naturally, internally pandas will make loops for it, but they will be much more efficient as they are done in C than what you can do in Python).
In this case all of your previous loop can be reduced to one line:
The operator
.str
on a column returns a vector of objects that are the contents of the column, but have vector methods like.extract()
(and you have many more, like.startswith()
,.strip()
, etc) that work on all of them "at once" so to speak.In this case the method
extract()
expects a regular expression with at least one capturing group indicating which part of the text you want to extract. In this case I have asked for a group of 4 digits. This way you avoid the problem that the result would include extra digits that were not part of the year.Update
The user indicates in a comment that some rows contain other four-digit numbers such as 1840, in addition to the year sought. My solution with
.extract()
extracts only the first match , so if there are two it will keep the first one. The user asks if it would be possible to verify if the extracted year is between 1900 and 2019 and if not, leave the result empty.I have a better solution. You can refine the regular expression so that instead of saying "any sequence of four digits" it says "the digits 19 followed by two more digits, or the digits 20 followed by a 1 or 0, and then another digit" .
The regular expression that says that would be:
Note the use of
|
to separate the two desired options. The first is19\d\d
that it will match any year in the last century, and the second is20[01]\d
that it will match a year whose first two digits are "20", the next is either a 1 or a 0, and the last is any digit, that is, the years between 2000 and 2019.So now you would put:
Thanks to this, in cells where you had two possible groups of four digits, such as "18401 Cellars 2013 Proprietary Red (Walla Walla Valley (OR))" only one of the groups will match the regular expression, and that will be the one returned.
Note that a case like: "19235 Cellars 2013 Proprietary Red (Walla Walla Valley (OR))" could still appear, in which case there would be a match with 1923 that would be wrong. You can be even more precise with the regular expression and force the group of digits you are looking for to be a "whole word", that is, there is a "word border" on both sides of the number you are looking for. So "19235" would not fit because after the 3 there is no word border. In a regular expression the character
\b
means just that ( word boundary ), so the following regular expression would be even safer to avoid cases like that: