There is variation in measurement. Some of that variation is random. This is the entire reason for the existence of the discipline of Statsitics!

# Randomness and statistics

# Variability of statistics

In the web page "Statistics are random," we found out that sample statistics vary, in some random way. If we can’t say anything useful about the variation of our sample statistics, we won’t be able to say anything useful at all! Let us examine the effect of population size and sample size on the variability of our sample statistics.

**Effect of sample size:** I took seven numbers out of the thin blue air, and pretended they were a population. They were 69, 78, 75, 90, 88, 64, and 76. I made a histogram of the seven measurements and found their mean. The histogram was nothing special. (Click here for a PDF document that contains the graphics for this web page) The mean was 77.14286.

Next, I wrote down every sample of 2 measurements that could possibly be made from the seven measurements. I found the sample means of all those samples. The collection of all those sample means is the **sampling distribution of the sample means for samples of size 2**. To see what the sampling distribution was like, I created a histogram of the sample means (which is also found in this PDF file). I saw that the sample means seemed more “evenly” distributed than the original measurements did. Very pleasant. I also calculated the mean of the sample means and got 77.14286, same as the mean of the original measurements.

I did the same for samples of size three, four, five, and six. Each time, I found the mean of the sample means to be 77.14286. Fascinating! But what about the histograms? For n=3, the histogram was mound-shaped, and had fewer extreme values in it. For n=4, it was bell-shaped, and narrower than the other histograms. Hmm…Bell-shaped... As in "normal..." This is not a coincidence.

As the sample size increases, the distribution of sample means gets closer and closer to being a normal distribution.

This very handy fact is known as "the Central Limit Theorem."

The fact that the sample means occupied less and less of the number line indicated that the variation in the sample means was decreasing as the sample size increased. This is not a coincidence.

The variation in the sample means decreases as the sample size increases.

There is a specific formula that expresses the *manner* in which the variation decreases. It's

(standard deviation of sample means) = (standard deviation of individual measurements) (divided by) (the square root of the sample size).

(I'l put this in symbols when I get a chance.) This is a *tremendously* important fact. *It lets us control the variability of our sample means! *It also gives us some hope that using “large enough” samples will result in sample means that are quite close to the population means they estimate. It gives us an excuse for using sample statistics to estimate parameters with, as the same sort of thing turns out to be true for a variety of useful statistics. This gets discussed at some length in my classes.

To see the samples of each size, and their means, click here. (If your computer does not have a compatible spreadsheet program on it, this link may not work for you. Sorry!)

**Effect of population size:** Presumably, the larger your population, the more variety there could be in the measurements you might possibly make. It seems like increasing the population size should increase the variability of a statistic. This is true, but it turns out that the bigger the population is, the less noticeable the effect is. In effect, if a population is large enough, it might as well be infinite. Believe it or not, assuming your population is infinite can actually simplify arithmetic. To see an example of this, click here.

So how large is “large enough?” Statisticians have examined this question and decided that it depends on the statistic you’re using. For the statistics we will discuss in this course, here’s a safe rule of thumb: **If the population is at least 100 times a big as the sample, the variability in the statistic does not depend on the population size**. At least, not enough to notice in practice. Here are two more ways of saying the same thing:

- If the sample size is less than one percent of the population size, the variability in the statistic does not depend on the population size.
IfN> 100n, then the variability in the statistic does not depend on the population size.

For a discussion of bias in this context, click here.