I am working with a CSV file. One of the columns is in string format and I want to convert it to type float . I have created a function that removes commas and currency symbols. But the value, when returned by the function, is automatically replaced directly in the variable that contains all the information of the CSV file and it throws me an alert:
A value is trying to be set on a copy of a slice from a DataFrame
With the code it will be clearer:
import pandas as pd
import numpy as np
import os
mainPath = 'C:/Users/Marcos/Documents/'
filePath = './resultados.csv'
fullPath = os.path.join(mainPath, filePath)
# Leo el archivo e igualo la columna 2017 a la variable 'parse'
data = pd.read_csv(fullPath, sep = '\t')
parse = data['2017'] # Igualando a data.loc[:,'2017'] sigue apareciendo la alerta
data.head()
The first result using head gives the following. I clarify, that in the example I am working with the column 2017 and I want to convert it to float :
Now I show the result of the parse variable
The function I have created is the following. What it does is check if the character is a number or a period, creates a new variable and adds it to an array:
def parseFloat(col):
length = len(col)
for i in range(length):
iElement = col[i]
newElement = ''
for j in range(len(col[i])) :
if iElement[j].isdigit():
newElement += str(iElement[j])
elif iElement[j] in ['.']:
newElement += '.'
col[i] = newElement
return col
parsedResult = parseFloat(parse)
print(parsedResult)
The result of the print is the following (with alert included):
If it weren't for the alert, the result would be as expected. But if I query the original data variable and do a .head() on it , the values of the 2017 column are automatically replaced.
The first thing to understand is that some actions in Pandas that involve selecting a section of the original data may return a view of the data and others will return a copy . A view allows us to display a subset of data as if it were a separate DataFrame/Series, but this is really just an illusion, a copy of the selected data is never actually made. It is as if we used a perforated template that we superimpose on the page of a book and that only allows us to see some words. This implies that if we cross out a word with the template on, we cross out the sheet of the workbook, that is, modifying data through the view causes them to be modified in the DataFrame that generated it, as it happens in your case.
Having said that, this warning is encountered late early by almost everyone who uses Pandas. Let's see an example better:
We are going to create a stupid function that is going to iterate with a
for
over the column and is going to add 2 to each cell. I say "stupid" because this operation doesn't make sense to iterate with an inefficient Python loop when it's simple to vectorize it in Pandas/NumPy, but it's a very simple example:Now let's run it passing column A:
Our dear warning... Why? Well, because we have done what is known as a chained assignment :
df["A"]
, which selected theA
DataFrame column, but Pandas didn't return a copy of the data, it returned a view.columna[i]
. Since the previous step returned a view, this is equivalent todf["A"][i]
... We are chaining two indexing operations on which we perform an assignment, this is known as a chained assignment.These two chained operations are independently executed sequentially , one after the other. In a first operation, a column is selected from among the others in the DataFrame using
__getitem__
, in a second a specific index of said column is selected and a value is assigned, which is done through the method__setitem__
. This indexing sequence, at best, makes the code more inefficient.It is not easy to determine a priori when Pandas returns a view and when a copy , although there are some assumptions, for example, indexing operations on an object with various data types will always return a copy. However, for efficiency as we mentioned before, on an object with a single type it almost always returns a view, the "almost always" thing is because it depends on the memory layout of the object. At this point anyone asks the question, why doesn't Pandas predictably and clearly generate a view or a copy in every situation?
This is because Pandas uses NumPy under the hood to represent and operate on the data while trying to offer versatile indexing methods. Views are inherited from NumPy, where they are predictable mainly because an array in NumPy has only one type. Pandas always tries to minimize memory and processing times when storing the complex DataFrame with multiple levels and types using NumPy below, for this a set of complex rules has been created with the aim of finding the best possible array structure NumPy to represent a given data set. Sections of a DataFrame that contain a single data type can be returned as a view in a single NumPy array, which is a highly efficient way of handling the operation.
Well, at this point we know two things:
df[...][...][...]
they are performed independently sequentially.What does this have to do with the damn alert?
Well, if one of the indexing operations returns a view and not a copy, the result can be unpredictable, since the effects will also be reflected in the data set that generated that view. We may or may not want this collateral effect, but given its unpredictable and inconsistent nature, the Warning is generated to warn of it. It may seem innocent, but inadvertently modifying original data when we didn't mean to can cause one of the worst mistakes in this world, wrong results that go completely unnoticed with no exceptions or clearly outliers...
In your case, if you wanted to properly parse the column in the DataFrame itself there is no problem, but if this happens when it is not wanted it can be a disaster with unpredictable and inconsistent results.
The solution is to learn to always detect and avoid chained assignment, this same operation can be done using
loc
the following way:No sign of the warning, why? Because by doing
loc[i: col_name]
we allow the indexing and mapping operation to be done in one step, in a single call to__setitem__
, thus avoiding the problem caused by not being able to determine whether the first indexing will return a view or a copy, this code is predictable therefore, it modifies in-place the value of the cell.If we can't or don't want to do the above, we can create a copy explicitly:
In both cases we explicitly indicate whether or not the original data is going to be modified, knowing exactly what is going to happen deterministically, if we mess up we mess up because we are wrong, not because NumPy/Pandas decides for itself based on confusing rules whether to return a view or a copy in a given case depending on data type, available memory or mood... :).
We must be clear about the concept of chained indexing, it is not equivalent to
df[...][...][...]...
, we can also cause it withloc
,iloc
or other operations that involve data selection if they are applied in a chained way:The examples are a bit forced but I hope they serve to understand the concept.
That said, if you want to convert the column to the original DataFrame directly you can just do:
Finally, I was able to fix the error.
Inside the function I have created a new DataFrame and then I have equalized it to the column 2017
Even so, if someone could tell me what caused the alert to go off, I would appreciate it, to avoid future complications.