I have the following "table" dataframe:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 75 28 74 76 38 82 28 90 51 54 64 61 48 70 59 52 84 45 60 30
1 49 54 87 18 66 63 56 23 59 100 69 59 24 33 61 15 29 33 75 74
2 45 66 41 51 87 49 63 41 51 35 95 53 40 22 29 76 34 19 48 75
3 60 69 73 75 60 56 52 74 88 38 66 30 49 56 45 59 79 61 67 27
We prepare it for the expected calculations, as follows.
import pandas as pd
# Ordenación de los datos
datos = np.sort(tabla, axis=None)
lista_valores = tabla.values.tolist()
lista_ordenada = np.sort(tabla, axis=None)
# Create ndArray from a list
array_datos_ordenados = np.array(lista_ordenada.reshape(4,20))
# Generamos el dataframe valores
valores = pd.DataFrame(lista_ordenada)
valores.columns = (["Valores"])
valores_ordenados = pd.DataFrame(array_datos_ordenados)
Using these tables and lists, I calculate various measures of the descriptive statistics, using the equivalent functions of pandas, numpy and python statistics. I notice the following differences in some results, for example. Standard deviation.
# Con pandas. Cálculo sobre una columna
valores["Valores"].std()
Returns 19.925494133288044
# Con pandas. Cálculo sobre una serie
lista_ordenada.std()
Returns 19.80056817366613
# Desviación estádard
import numpy as np
np.std(lista_ordenada)
Returns 19.80056817366613
variance
# Con pandas. Cálculo sobre una columna
valores["Valores"].var()
Returns 397.0253164556962
import statistics as st
st.variance(lista_ordenada)
returns 397
import numpy as np
np.var(lista_ordenada)
Returns 392.0625. What could be the reason for these differences, and which of these options should we consider more accurate?
First clarify that it
valores["Valores"]
is a Series (a column of a DataFrame does not stop being a Series). For its partlista_ordenada
, it is a NumPy array, so itnp.std(lista_ordenada)
is the same aslista_ordenada.std()
There is no discrepancy in accuracy between Pandas and NumPy, the difference is because Pandas by default uses 1 degree of freedom when calculating the variance, using as denominator
n - 1
, while NumPy by default uses 0 degrees of freedom , using as denominatorn
directly ( wheren
is is the sample size). This is determined by the argumentddof
and is what is known as the Bessel correction .That is, the formula to calculate the default variance in NumPy would be:
while Pandas by default uses:
This is important when you want to estimate the standard deviation as a statistical indicator of a population from a sample of it . NumPy thus provides by default a biased estimate of the population variance (without the Bessel correction) while Pandas provides an unbiased estimate of the variance of a hypothetical infinite population (with the Bessel correction).
If you want to achieve the same result as in NumPy simply modify the parameter
ddof
, which accepts bothpandas.Series.var()
/pandas.Series.std()
andnumpy.var()
/numpy.std()
depending on whether or not you want to correct the estimate (ddof = 1
-> Bessel correction (unbiased estimate),ddof = 0
-> No correction (biased estimate) :The std module
statistics
, like Pandas, uses the Bessel correction on bothstatistics.variance
(sample variance) andstatitics.std (
sample standard deviation). In both cases there is no direct way to modify this like there is with NumPy/Pandas.On the other hand ,
statistics
it also provides the functionsstatistics.pstdev
(population standard deviation) andstatistics.pvariance
(population variance) that they obviously useN
and notn - 1
as their sample counterparts.Edition
There is an apparent discrepancy between NumPy/Pandas (with
ddof
1) andstatistics.variance
/statistics.std
:Doing some research I have found the cause, it is not related to precision, it is just a side effect of the final function in charge of converting the data type returned by the different functions of
statistics
:Since the original data is of type
np.int64
because of thereturn T(value)
(whereT
is the type), a casting is donenp.int64
which truncates the result by removing the decimals.T
is not modified in the conditional becausenumpy.int64
it is not a subclass ofint
(standard Python type), sotry
the linereturn T(value)
is left asnp.int64(value)
. It's not reallystatistics
meant explicitly to be used with NumPy types but with the standard Python types (int
,float
,Decimal
andFraction
) so we may run into this sort of thing, but we can solve the problem in several ways:Casting to
int
orfloat
previous:Make the original array of type
numpy.float
:However, if we are using numpy/pandas, its functions should be used since they allow operations to be vectorized.
You say:
but that calculation does not use pandas, since it
lista_ordenada
was created withnumpy
. This calculation uses thereforenumpy
and therefore it comes out the same as:Why doesn't it come out the same as
? Actually pandas below uses it
numpy
to do its calculations, the problem is that itnumpy.std()
admits an optional parameter calledddof
which is the correction to be applied to the denominator, which by default is equal to0
, but when pandas invokes this function it passes a 1 in that parameter. If we do the same, it will come out the same:The same goes for the variance.
This fix should be applied (i.e. passed
1
to it like pandas does) when the amount of data is small.I was left wondering what the "ddof" parameter was for. I found this explanation that I find convenient to provide. To calculate, for example, the variance and standard deviation of a data set, we use the x.std() and x.var() methods. In the variance and standard deviation formulas, the default data set is assumed to be the entire population. However, to change this behavior, we can use the ddof (delta degrees of freedom) argument. The denominator in the variance formula is the number of elements in the array minus ddof, so to compute the unbiased estimate of the variance and standard deviation for a sample, we need to set ddof = 1