What is a promise in Javascript?

Question

Asked: 2020-07-17 08:09:41 +0800 CST 2020-07-17 08:09:41 +0800 CST 2020-07-17 08:09:41 +0800 CST

Statistical functions. Same function different results in pandas, scipy and python statistics

772

I have the following "table" dataframe:

    0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19
0   75  28  74  76  38  82  28  90  51  54  64  61  48  70  59  52  84  45  60  30
1   49  54  87  18  66  63  56  23  59  100 69  59  24  33  61  15  29  33  75  74
2   45  66  41  51  87  49  63  41  51  35  95  53  40  22  29  76  34  19  48  75
3   60  69  73  75  60  56  52  74  88  38  66  30  49  56  45  59  79  61  67  27

We prepare it for the expected calculations, as follows.

import pandas as pd

# Ordenación de los datos
datos = np.sort(tabla, axis=None)
lista_valores = tabla.values.tolist()
lista_ordenada = np.sort(tabla, axis=None)
# Create ndArray from a list
array_datos_ordenados = np.array(lista_ordenada.reshape(4,20))
# Generamos el dataframe valores
valores =  pd.DataFrame(lista_ordenada)
valores.columns = (["Valores"])
valores_ordenados = pd.DataFrame(array_datos_ordenados)

Using these tables and lists, I calculate various measures of the descriptive statistics, using the equivalent functions of pandas, numpy and python statistics. I notice the following differences in some results, for example. Standard deviation.

# Con pandas. Cálculo sobre una columna
valores["Valores"].std()

Returns 19.925494133288044

# Con pandas. Cálculo sobre una serie
lista_ordenada.std()

Returns 19.80056817366613

# Desviación estádard
import numpy as np
np.std(lista_ordenada)

Returns 19.80056817366613

variance

# Con pandas. Cálculo sobre una columna
valores["Valores"].var()

Returns 397.0253164556962

import statistics as st
st.variance(lista_ordenada)

returns 397

import numpy as np
np.var(lista_ordenada)

Returns 392.0625. What could be the reason for these differences, and which of these options should we consider more accurate?

3 Answers

Voted

FJSevilla · Answer 1 · 2020-07-17T08:44:24+08:00

First clarify that it valores["Valores"]is a Series (a column of a DataFrame does not stop being a Series). For its part lista_ordenada, it is a NumPy array, so it np.std(lista_ordenada)is the same aslista_ordenada.std()

There is no discrepancy in accuracy between Pandas and NumPy, the difference is because Pandas by default uses 1 degree of freedom when calculating the variance, using as denominator n - 1, while NumPy by default uses 0 degrees of freedom , using as denominator ndirectly ( where nis is the sample size). This is determined by the argument ddofand is what is known as the Bessel correction .

That is, the formula to calculate the default variance in NumPy would be:

while Pandas by default uses:

This is important when you want to estimate the standard deviation as a statistical indicator of a population from a sample of it . NumPy thus provides by default a biased estimate of the population variance (without the Bessel correction) while Pandas provides an unbiased estimate of the variance of a hypothetical infinite population (with the Bessel correction).

If you want to achieve the same result as in NumPy simply modify the parameter ddof, which accepts both pandas.Series.var()/ pandas.Series.std()and numpy.var()/ numpy.std()depending on whether or not you want to correct the estimate ( ddof = 1-> Bessel correction (unbiased estimate), ddof = 0-> No correction (biased estimate) :

>>> valores["Valores"].std()        # pd.Series.std (valores["Valores"].std(ddof=1))
19.925494133288044
>>> lista_ordenada.std(ddof=1)      # np.std
19.925494133288044

>>> valores["Valores"].var()        # pd.Series.var (valores["Valores"].var(ddof=1))
397.0253164556962
>>> lista_ordenada.var(ddof=1)      # np.var
397.0253164556962

>>> valores["Valores"].std(ddof=0)  # pd.Series.std 
19.80056817366613
>>> lista_ordenada.std()            # np.std (lista_ordenada.std(ddof=0))
19.80056817366613

>>> valores["Valores"].var(ddof=0)  # pd.Series.var
392.0625
>>> lista_ordenada.var()            # np.var (lista_ordenada.var(ddof=0))
392.0625

The std module statistics, like Pandas, uses the Bessel correction on both statistics.variance(sample variance) and statitics.std (sample standard deviation). In both cases there is no direct way to modify this like there is with NumPy/Pandas.

On the other hand , statisticsit also provides the functions statistics.pstdev(population standard deviation) and statistics.pvariance(population variance) that they obviously use Nand not n - 1as their sample counterparts.

Edition

There is an apparent discrepancy between NumPy/Pandas (with ddof1) and statistics.variance/ statistics.std:

>>> lista_ordenada.std(ddof=1)
397.0253164556962
>>> st.variance(lista_ordenada)
397

Doing some research I have found the cause, it is not related to precision, it is just a side effect of the final function in charge of converting the data type returned by the different functions of statistics:

def _convert(value, T):
    """Convert value to given numeric type T."""
    if type(value) is T:
        # This covers the cases where T is Fraction, or where value is
        # a NAN or INF (Decimal or float).
        return value
    if issubclass(T, int) and value.denominator != 1:
        T = float
    try:
        return T(value)
    except TypeError:
        if issubclass(T, Decimal):
            return T(value.numerator)/T(value.denominator)
        else:
            raise

Since the original data is of type np.int64because of the return T(value)(where Tis the type), a casting is done np.int64which truncates the result by removing the decimals. Tis not modified in the conditional because numpy.int64it is not a subclass ofint (standard Python type), so trythe line return T(value)is left as np.int64(value). It's not really statisticsmeant explicitly to be used with NumPy types but with the standard Python types ( int, float, Decimaland Fraction) so we may run into this sort of thing, but we can solve the problem in several ways:

Casting to intor floatprevious:

>>> st.variance(int(n) for n in lista_ordenada)
397.0253164556962

>>> st.variance(map(float, lista_ordenada))
397.0253164556962

Make the original array of type numpy.float:

>>> st.variance(np.array(lista_ordenada, dtype=np.float64))
397.0253164556962

However, if we are using numpy/pandas, its functions should be used since they allow operations to be vectorized.

abulafia · Answer 2 · 2020-07-17T08:46:23+08:00

You say:

# Con pandas. Cálculo sobre una serie
lista_ordenada.std()
# 19.80056817366613

but that calculation does not use pandas, since it lista_ordenadawas created with numpy. This calculation uses therefore numpyand therefore it comes out the same as:

# Desviación estándard
import numpy as np
np.std(lista_ordenada)
# 19.80056817366613

Why doesn't it come out the same as

# Con pandas. Cálculo sobre una columna
valores["Valores"].std()
# 19.925494133288044

? Actually pandas below uses it numpyto do its calculations, the problem is that it numpy.std()admits an optional parameter called ddofwhich is the correction to be applied to the denominator, which by default is equal to 0, but when pandas invokes this function it passes a 1 in that parameter. If we do the same, it will come out the same:

# Desviación estándard
import numpy as np
np.std(lista_ordenada, ddof=1)
# 19.925494133288044

The same goes for the variance.

This fix should be applied (i.e. passed 1to it like pandas does) when the amount of data is small.

efueyo · Answer 3 · 2020-09-06T09:35:18+08:00

I was left wondering what the "ddof" parameter was for. I found this explanation that I find convenient to provide. To calculate, for example, the variance and standard deviation of a data set, we use the x.std() and x.var() methods. In the variance and standard deviation formulas, the default data set is assumed to be the entire population. However, to change this behavior, we can use the ddof (delta degrees of freedom) argument. The denominator in the variance formula is the number of elements in the array minus ddof, so to compute the unbiased estimate of the variance and standard deviation for a sample, we need to set ddof = 1

Statistical functions. Same function different results in pandas, scipy and python statistics

Edition

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?