Comparing 2 different ways of computing variance in Python
The default variance functions in the "NumPy" and "statistics" packages use different formulas
In Python, there are two common functions for calculating the variance of a dataset.
The “NumPy” package has the function var().
The “statistics” package has the function variance().
However, their default settings produce two different outputs. Why does this happen?
It turns out that they are calculating two different variances.
By default, numpy.var() calculates the population variance, which divides the sum of the squared deviations by "n".
statistics.variance() calculates the sample variance, which divides by "n-1".
Here is an example to illustrate this contrast. (Make sure that you install these packages first. If you want to skip those steps and jump straight into coding, you can use Google Colab.)
import numpy
import statistics
x = range(10)
x_mean = numpy.mean(x)
x_squared_diff = (x - x_mean)**2
x_population_variance = sum(x_squared_diff) / len(x)
x_sample_variance = sum(x_squared_diff) / (len(x) - 1)
print('The calculated population variance is', x_population_variance)
print('The calculated sample variance is', x_sample_variance)
print("By default, the default variance function from the NumPy package computes", numpy.var(x))
print("The variance function from the statistics package computes", statistics.variance(x))
If you run this in Python, you will get the following output.
The calculated population variance is 8.25
The calculated sample variance is 9.166666666666666
By default, the default variance function from the NumPy package computes 8.25
The variance function from the statistics package computes 9.166666666666666
As you can see, numpy.var() calculated the population variance, while statistics.variance() calculated the sample variance.
If you want to use the “NumPy” package to calculate the sample variance, you need to run numpy.var(x, ddof=1).
If you want to use the “statistics” package to calculate the population variance, you need to use an entirely different function: statistics.pvariance(x).
I encourage you to run these two functions by yourself to verify their outputs.
There is an important lesson here: In data & analytics, do NOT use Python functions blindly. Make sure that you understand what they are doing and choose the appropriate one for a given problem.