Verifying Central Limit Theorem¶

The Central Limit Theorem states that the sampling distribution of the sampling means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. This fact holds especially true for sample sizes over 30. All this is saying is that as you take more samples, especially large ones, your graph of the sample means will look more like a normal distribution.

In [11]:

Copied!





import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Generate 1k random integers¶

Let us use NumPy to generate 1000 random integers between the range 0-100. Our objective is to calculate the population mean and verify if the mean obtained using CLT comes close to population mean.

In [2]:

Copied!

rand_1k = np.random.randint(0,100,1000)
rand_1k = np.random.randint(0,100,1000)

In [3]:

Copied!

rand_1k.size
rand_1k.size

Out[3]:

In [12]:

Copied!

sns.distplot(rand_1k)
sns.distplot(rand_1k)

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a19f2c048>

No description has been provided for this image

Thus the population follows a uniform distribution, not a normal distribution. Still, we will see the distribution of our means will follow a normal distribution.

Calculate population mean¶

In [4]:

Copied!

np.mean(rand_1k)
np.mean(rand_1k)

Out[4]:

48.826

Try out creating a subset and finding its mean¶

In [5]:

Copied!

subset_100 = np.random.choice(rand_1k, size=100, replace=False)
subset_100.size
subset_100 = np.random.choice(rand_1k, size=100, replace=False)
subset_100.size

Out[5]:

In [6]:

Copied!

np.mean(subset_100)
np.mean(subset_100)

Out[6]:

43.2

The mean of this subset of 100 integers is 43.2. Not close enough.

Apply CLT.¶

We will generate 50 samples with 100 items each and find their means.

In [7]:

Copied!





# generate 50 random samples of size 100 each
subset_means = []
for i in range(0,50):
    current_subset = np.random.choice(rand_1k, size=100, replace=False)
    subset_means.append(np.mean(current_subset))
# generate 50 random samples of size 100 each
subset_means = []
for i in range(0,50):
    current_subset = np.random.choice(rand_1k, size=100, replace=False)
    subset_means.append(np.mean(current_subset))

Calculate the mean of means (its meta :))

In [33]:

Copied!

clt_mean = np.mean(subset_means)
clt_mean
clt_mean = np.mean(subset_means)
clt_mean

Out[33]:

48.9768

Calculate the SD of the means

In [34]:

Copied!

subset_sd = np.std(subset_means)
subset_sd
subset_sd = np.std(subset_means)
subset_sd

Out[34]:

2.657234983963594

In [37]:

Copied!





ax = sns.distplot(subset_means, bins=10)
# draw mean in black
ax.axvline(clt_mean, color='black', linestyle='dashed')

# draw mean +- 1 SD
ax.axvline(clt_mean + subset_sd, color='red', linestyle='dotted')
ax.axvline(clt_mean - subset_sd, color='red', linestyle='dotted')
ax = sns.distplot(subset_means, bins=10)
# draw mean in black
ax.axvline(clt_mean, color='black', linestyle='dashed')

# draw mean +- 1 SD
ax.axvline(clt_mean + subset_sd, color='red', linestyle='dotted')
ax.axvline(clt_mean - subset_sd, color='red', linestyle='dotted')

/Users/atma6951/anaconda3/envs/pychakras/lib/python3.6/site-packages/matplotlib/axes/_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

Out[37]:

<matplotlib.lines.Line2D at 0x1a1ac5f908>

Difference between mean of means and the population mean

In [38]:

Copied!

np.mean(rand_1k) - clt_mean
np.mean(rand_1k) - clt_mean

Out[38]:

-0.15079999999999671