What is a promise in Javascript?

Question

gmarsi

Asked: 2020-08-12 12:00:31 +0800 CST 2020-08-12 12:00:31 +0800 CST 2020-08-12 12:00:31 +0800 CST

Avoid overwriting variables, SettingWithCopyWarning

772

I am working with a CSV file. One of the columns is in string format and I want to convert it to type float . I have created a function that removes commas and currency symbols. But the value, when returned by the function, is automatically replaced directly in the variable that contains all the information of the CSV file and it throws me an alert:

A value is trying to be set on a copy of a slice from a DataFrame

With the code it will be clearer:

import pandas as pd
import numpy as np
import os

mainPath = 'C:/Users/Marcos/Documents/'
filePath = './resultados.csv'
fullPath = os.path.join(mainPath, filePath)

# Leo el archivo e igualo la columna 2017 a la variable 'parse'
data = pd.read_csv(fullPath, sep = '\t')
parse = data['2017'] # Igualando a data.loc[:,'2017'] sigue apareciendo la alerta
data.head()

The first result using head gives the following. I clarify, that in the example I am working with the column 2017 and I want to convert it to float :

Now I show the result of the parse variable

The function I have created is the following. What it does is check if the character is a number or a period, creates a new variable and adds it to an array:

def parseFloat(col):
    length = len(col)
    for i in range(length):
        iElement = col[i]
        newElement = ''
        for j in range(len(col[i])) :
            if iElement[j].isdigit():
                newElement += str(iElement[j])
            elif iElement[j] in ['.']:
                newElement += '.'            
        col[i] = newElement
    return col

parsedResult = parseFloat(parse)

print(parsedResult)

The result of the print is the following (with alert included):

If it weren't for the alert, the result would be as expected. But if I query the original data variable and do a .head() on it , the values of the 2017 column are automatically replaced.

2 Answers

Voted

FJSevilla · Answer 1 · 2020-08-12T15:52:17+08:00

The first thing to understand is that some actions in Pandas that involve selecting a section of the original data may return a view of the data and others will return a copy . A view allows us to display a subset of data as if it were a separate DataFrame/Series, but this is really just an illusion, a copy of the selected data is never actually made. It is as if we used a perforated template that we superimpose on the page of a book and that only allows us to see some words. This implies that if we cross out a word with the template on, we cross out the sheet of the workbook, that is, modifying data through the view causes them to be modified in the DataFrame that generated it, as it happens in your case.

Having said that, this warning is encountered late early by almost everyone who uses Pandas. Let's see an example better:

import pandas as pd


data = {"A": [1, 2, 3], "B": [2.2, 3.4, 1.3]}
df = pd.DataFrame(data)

>>> df
A    B
0  1  2.2
1  2  3.4
2  3  1.3

We are going to create a stupid function that is going to iterate with a forover the column and is going to add 2 to each cell. I say "stupid" because this operation doesn't make sense to iterate with an inefficient Python loop when it's simple to vectorize it in Pandas/NumPy, but it's a very simple example:

def sumar2(columna):
    for i in range(columna.size):
        columna[i] += 2

Now let's run it passing column A:

>>> sumar2(df["A"])

SettingWithCopyWarning: 
   A value is trying to be set on a copy of a slice from a DataFrame...

Our dear warning... Why? Well, because we have done what is known as a chained assignment :

It all started at df["A"], which selected the ADataFrame column, but Pandas didn't return a copy of the data, it returned a view.
In the function we perform an assignment (modify a value) using columna[i]. Since the previous step returned a view, this is equivalent to df["A"][i]... We are chaining two indexing operations on which we perform an assignment, this is known as a chained assignment.

These two chained operations are independently executed sequentially , one after the other. In a first operation, a column is selected from among the others in the DataFrame using __getitem__, in a second a specific index of said column is selected and a value is assigned, which is done through the method __setitem__. This indexing sequence, at best, makes the code more inefficient.

It is not easy to determine a priori when Pandas returns a view and when a copy , although there are some assumptions, for example, indexing operations on an object with various data types will always return a copy. However, for efficiency as we mentioned before, on an object with a single type it almost always returns a view, the "almost always" thing is because it depends on the memory layout of the object. At this point anyone asks the question, why doesn't Pandas predictably and clearly generate a view or a copy in every situation?

This is because Pandas uses NumPy under the hood to represent and operate on the data while trying to offer versatile indexing methods. Views are inherited from NumPy, where they are predictable mainly because an array in NumPy has only one type. Pandas always tries to minimize memory and processing times when storing the complex DataFrame with multiple levels and types using NumPy below, for this a set of complex rules has been created with the aim of finding the best possible array structure NumPy to represent a given data set. Sections of a DataFrame that contain a single data type can be returned as a view in a single NumPy array, which is a highly efficient way of handling the operation.

Well, at this point we know two things:

When we index on a DataFrame we don't know a priori if Pandas returns a view or a copy.
If we chain multiple indexings using df[...][...][...]they are performed independently sequentially.

What does this have to do with the damn alert?

Well, if one of the indexing operations returns a view and not a copy, the result can be unpredictable, since the effects will also be reflected in the data set that generated that view. We may or may not want this collateral effect, but given its unpredictable and inconsistent nature, the Warning is generated to warn of it. It may seem innocent, but inadvertently modifying original data when we didn't mean to can cause one of the worst mistakes in this world, wrong results that go completely unnoticed with no exceptions or clearly outliers...

In your case, if you wanted to properly parse the column in the DataFrame itself there is no problem, but if this happens when it is not wanted it can be a disaster with unpredictable and inconsistent results.

The solution is to learn to always detect and avoid chained assignment, this same operation can be done using locthe following way:

import pandas as pd


data = {"A": [1, 2, 3], "B": [2.2, 3.4, 1.3]}


def sumar2(data, col_name):
    for i in range(data[col_name].size):
        data.loc[i: col_name] += 1


df = pd.DataFrame(data)

>>> sumar2(df, "A")
>>> df
   A    B
0  2  3.2
1  4  5.4
2  6  4.3

No sign of the warning, why? Because by doing loc[i: col_name]we allow the indexing and mapping operation to be done in one step, in a single call to __setitem__, thus avoiding the problem caused by not being able to determine whether the first indexing will return a view or a copy, this code is predictable therefore, it modifies in-place the value of the cell.

If we can't or don't want to do the above, we can create a copy explicitly:

def sumar2(col):
    new_col = col.copy()
    for i in range(new_col.size):
        new_col[i] += 2
    return new_col


df = pd.DataFrame(data)
df["A"] = sumar2(df["A"])

In both cases we explicitly indicate whether or not the original data is going to be modified, knowing exactly what is going to happen deterministically, if we mess up we mess up because we are wrong, not because NumPy/Pandas decides for itself based on confusing rules whether to return a view or a copy in a given case depending on data type, available memory or mood... :).

We must be clear about the concept of chained indexing, it is not equivalent to df[...][...][...]..., we can also cause it with loc, ilocor other operations that involve data selection if they are applied in a chained way:

import pandas as pd


data = {"A": [1, 2, 3], "B": [2.2, 3.4, 1.3]}


def sumar2(columna):
    columna.loc[0:] += 1


df = pd.DataFrame(data)
sumar2(df.loc[df.A>1])

The examples are a bit forced but I hope they serve to understand the concept.

The warning in my opinion should never be ignored, even if we fully understand its cause and when we don't give a damn if the original DataFrame is modified or not and if this is inconsistent even between executions. As the zen of Python says, "Explicit is better than implicit", since there are ways to force in-place modification or copying, better to do it explicit and well from the beginning than to ignore the warning.

That said, if you want to convert the column to the original DataFrame directly you can just do:

>>> import pandas as pd


>>> data = {"2017": ("1,172.90 €", "53,963.87 €")}

>>> df = pd.DataFrame(data)
>>> df["2017"] = df["2017"].replace('[\€,]', '', regex=True).astype(float)

>>> df
       2017
0   1172.90
1  53963.87

gmarsi · Answer 2 · 2020-08-12T12:11:56+08:00

gmarsi

2020-08-12T12:11:56+08:002020-08-12T12:11:56+08:00

Finally, I was able to fix the error.

Inside the function I have created a new DataFrame and then I have equalized it to the column 2017

def parseFloat(col):
    length = len(col)
    newCol = {}
    newCol['2017'] = []
    for i in range(length):
        iElement = col[i]
        newElement = ''
        for j in range(len(col[i])) :
            if iElement[j].isdigit():
                newElement += str(iElement[j])
            elif iElement[j] in ['.']:
                newElement += '.'            
        newCol['2017'].append(newElement)
    return pd.DataFrame(newCol)

parsedResult = parseFloat(parse)
print(parsedResult)

Even so, if someone could tell me what caused the alert to go off, I would appreciate it, to avoid future complications.

1

Avoid overwriting variables, SettingWithCopyWarning

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?