# Estimation with Small Samples

Here’s another installment in *Data Q&A: Answering the real questions with Python*. Previous installments are available from the *Data Q&A* landing page.

# Estimation with Small Samples¶

Here’s a question from the Reddit statistics forum.

Hey, so imagine I only have 6 samples from a value that has a normal distribution. Can I estimate the range of likely distributions from those 6?

Let’s be more specific. I’m considering the accuracy of a blood testing device. I took 6 samples of my blood at the same time from the same vein and gave them to the machine. The results are not all the same (as expected), indicating the device’s inherent level of imprecision.

So, I’m wondering if there’s a way to estimate the range of possibilities of what I would see if I could give 100 or 1000 samples?

I’m comfortable assuming a normal distribution around the “true” value. Is there any stats method to guesstimate the range of likely values for sigma? Or would I just need to drain my blood dry to get 1000 samples to figure that out?

Fyi, not a statistician.

Because the sample size is so small, this question cries out for a Bayesian approach. Why?

Bayesian methods do a good job of taking advantage of background information and extracting as much information as possible from the data,

With a small sample size, the uncertainty of the result is large, so it is important to quantify it, and

The motivating question is explicitly about making probabilistic predictions, which is what Bayesian methods do and classical methods don’t.

As an example, OP suggested looking at blood potassium (K+) levels, with the following data:

4.0, 3.9, 4.0, 4.0, 4.7, 4.2 (mmol/L)

OP also explained that the blood samples were collected “continuously from a single puncture … within seconds, not minutes,” so we don’t expect the true level to change much between samples. In that case, the variation in the measurements would largely reflect the precision of the testing device.

With that assumption, let’s see what we can do.

I’ll download a utilities module with some of my frequently-used functions, and then import the usual libraries.

```
from os.path import basename, exists
def download(url):
filename = basename(url)
if not exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print("Downloaded " + str(local))
return filename
download('https://github.com/AllenDowney/DataQnA/raw/main/nb/utils.py')
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from utils import decorate
```

```
# install the empiricaldist library, if necessary
try:
import empiricaldist
except ImportError:
!pip install empiricaldist
```

## Bayesian estimation¶

Here’s the data.

```
data = [4.0, 3.9, 4.0, 4.0, 4.7, 4.2]
np.mean(data), np.std(data)
```

(4.133333333333334, 0.2687419249432851)

Now we need prior distributions for `mu`

and `sigma`

.
For the prior distribution of `mu`

we can use the distribution in the general population.
The “normal range” for potassium — which is usually the 5th and 95th percentiles of the population — is 3.5 to 5.4 mmol/L.
So we can construct a normal distribution with these percentiles.

```
from scipy.stats import norm
hyper_mu = (3.5 + 5.4) / 2
hyper_sigma = 0.6
dist = norm(hyper_mu, hyper_sigma)
dist.ppf([0.05, 0.95])
```

array([3.46308782, 5.43691218])

Now I’ll make a `Pmf`

object that represents this distribution.

```
mus = np.linspace(2, 7, 51)
ps = dist.pdf(mus)
```

```
from empiricaldist import Pmf
prior_mu = Pmf(ps, mus)
prior_mu.index.name = 'mu'
prior_mu.normalize()
```

9.99977690541445

```
prior_mu.plot(label='prior')
decorate(ylabel='PDF')
```

For the prior distribution of `sigma`

, we could use some background information about the device.
For example, based on similar devices, what values of `sigma`

would be expected?
I’ll use a gamma distribution to construct a prior PMF, but it could be anything.

```
from scipy.stats import gamma
alpha = 2
beta = 0.5
sigmas = np.linspace(0.01, 5, 101)
ps = gamma(alpha, scale=beta).pdf(sigmas)
```

```
prior_sigma = Pmf(ps, sigmas)
prior_sigma.index.name = 'sigma'
prior_sigma.normalize()
```

20.03019853279729

```
prior_sigma.plot(label='prior')
decorate(ylabel='PDF')
```

This distribution represents the background knowledge that we expect the standard deviation to be less than 2, and we’d be surprised by anything much higher than that.

I’ll use this function to make a Pandas `DataFrame`

to represent the joint prior.

```
def make_joint(s1, s2):
"""Compute the outer product of two Series.
First Series goes across the columns;
second goes down the rows.
s1: Series
s2: Series
return: DataFrame
"""
X, Y = np.meshgrid(s1, s2, indexing='ij')
return pd.DataFrame(X*Y, index=s1.index, columns=s2.index)
```

```
prior = make_joint(prior_mu, prior_sigma)
prior.shape
```

(51, 101)

I’ll use the following function to make a contour plot of the prior.

```
def plot_contour(joint):
"""Plot a joint distribution.
joint: DataFrame representing a joint PMF
"""
low = joint.to_numpy().min()
high = joint.to_numpy().max()
levels = np.linspace(low, high, 6)
levels = levels[1:]
cs = plt.contour(joint.columns, joint.index, joint, levels=levels, linewidths=1)
decorate(xlabel=joint.columns.name,
ylabel=joint.index.name)
return cs
```

```
plot_contour(prior)
decorate()
```

## The update¶

To use the data to update the prior, we have to compute the likelihood of the data for each possible pair of `mu`

and `sigma`

.
Can can do that by creating a 3-D mesh with the possible values of `mu`

and `sigma`

, and the observed values of the data.

```
MU, SIGMA, DATA = np.meshgrid(prior_mu.index, prior_sigma.index, data,
indexing='ij')
MU.shape
```

(51, 101, 6)

Now we can evaluate the normal distribution for each data point and each pair of `mu`

and `sigma`

.

```
densities = norm(MU, SIGMA).pdf(DATA)
densities.shape
```

(51, 101, 6)

The likelihood of each pair is the product of the densities for the data points.

```
likelihood = densities.prod(axis=2)
likelihood.shape
```

(51, 101)

The unnormalized posterior is the product of the prior and the likelihood.

```
posterior = prior * likelihood
```

We can normalize it like this.

```
def normalize(joint):
"""Normalize a joint distribution.
joint: DataFrame
"""
prob_data = joint.to_numpy().sum()
joint /= prob_data
return prob_data
```

```
normalize(posterior)
posterior.shape
```

(51, 101)

And here’s what the result looks like.

```
plot_contour(posterior)
decorate()
```

Here’s the posterior distribution of `sigma`

compared to the prior.

```
def marginal(joint, axis):
"""Compute a marginal distribution.
axis=0 returns the marginal distribution of the first variable
axis=1 returns the marginal distribution of the second variable
joint: DataFrame representing a joint distribution
axis: int axis to sum along
returns: Pmf
"""
return Pmf(joint.sum(axis=axis))
```

```
posterior_sigma = marginal(posterior, 0)
```

```
prior_sigma.plot(color='gray', label='prior')
posterior_sigma.plot(label='posterior')
decorate(xlabel='sigma', ylabel='PDF')
```

The posterior mean of `sigma`

is about 0.4, somewhat higher than the standard deviation of the data.

```
posterior_sigma.mean(), np.std(data)
```

(0.40440859064493323, 0.2687419249432851)

## Predictions¶

The last part of OP’s question is “So, I’m wondering if there’s a way to estimate the range of possibilities of what I would see if I could give 100 or 1000 samples?”

We can simulate a future experiment with larger sample size by drawing random pairs from the joint posterior distribution and generating simulated measurements. That’s easier to do if we stack the posterior PMF.

```
posterior_pmf = Pmf(posterior.stack())
posterior_pmf.head()
```

probs | ||
---|---|---|

mu | sigma | |

2.0 | 0.0100 | 0.0 |

0.0599 | 0.0 | |

0.1098 | 0.0 |

Now we can use NumPy’s choice function to choose a random pair of `mu`

and `sigma`

.
For each random pair, we can generate 1000 simulated measurements from a normal distribution, and plot the distribution of the results.

```
for i in range(11):
mu, sigma = np.random.choice(posterior_pmf.index, p=posterior_pmf)
sample = np.random.normal(mu, sigma, size=1000)
sns.kdeplot(sample, alpha=0.3)
decorate(xlabel='K+ (mmol/L)')
```

Each line shows the distribution of a possible sample of 1000 measurements. You can see that they vary in both location and spread, due to the uncertainty represented by the posterior distribution of `mu`

and `sigma`

.

## Discussion¶

The normal distribution is probably a good enough model for this data, but for some pairs of `mu`

and `sigma`

in the prior, the normal distribution extends into negative values with non-negligible probability — and for measurements like these, negative values are nonsensical.

An alternative would be to run the whole calculation with the logarithms of the measurements. In effect, this would model the distribution of the data as lognormal, which is generally a good choice for measurements like these.

In this example, the sample size is small, so the choice of the prior is important.
I suspect there is more background information we could use to make a better choice for the prior distribution of `sigma`

.

This example demonstrates the use of grid methods for Bayesian statistics. For many common statistical questions, grid methods like this are simple and fast to compute, especially with libraries like NumPy and SciPy. And they make it possible to answer many useful probabilistic questions.