## (Sample) Size Matters

### Random sampling distribution are really something delightful

Published: Thursday, September 9, 2010 - 08:51

Last month I wrote about how the random sampling distribution (RSD) of various sample statistics are the basis for pretty much everything in statistics. If you understand RSDs, you understand a lot about why we do what we do in hypothesis testing, inferential statistics, and estimation of confidence intervals. Understanding RSDs gives you a huge advantage as you seek to use data in business, so let's take a closer look.

First, a brief refresher from last month. We have the big honkin' distribution (BHD), which is the distribution of the entire population of interest as shown in figure 1:

To be able to make decisions in less than the amount of time necessary to evolve life on Earth, we only want to take a sample of some size n from the population. However, I have to understand how the sample statistics are distributed to make a decision, and that is where the RSD comes in. The RSD is the distribution of all possible samples of size n. For example, the RSD of the means for n = 10 for the population above looks something like figure 2 on the same scale as the BHD:

If the RSD of the means is somehow related back to the population of the individuals, we can make some statements about the population from a single sample mean. That relationship is in fact known from the central limit theorem, the one piece of real magic in the universe.

The central limit theorem states:

• The average of the RSD of the means is the mean of the individuals

• The standard deviation of the RSD of the means is related to the standard deviation of the individuals by the inverse of the square root of the sample size:

• The shape of the RSD of the means of any distribution tends to become more normal as the sample size increases.

The theorem's third attribute allows us to use some more powerful statistics if we are investigating averages, because we can count on a normal distribution. I'll discuss more about that in a later article.

So how can we use the idea of an RSD to help us make decisions?

Let's say we are trying to increase the strength of a weld by changing welding parameters. Bobby Jo is fresh from welding school and thinks she has a procedure that will make it stronger.

The first step is to have a chat with management about risks to determine the correct sample size. You need to decide what minimum increase in strength is needed before you are going to change the current process. If the weld strength goes up by 0.0000001 ksi, we probably don't care to make any changes because it will cost more to do the experiment than we would ever make back from such a small increase. This minimum effect size is often called Δ, and it might be an engineering decision or a management calculation based on needed financial benefit and required return on investment. (Hmm, another idea for a future article.) Let's say that we do the calculations, and we need to have at least a 5 ksi improvement over the current weld strength to justify the expense of changing the process.

Ask the managers what probability they can live with of concluding that Bobby Jo's welding procedure increases the weld strength, when in fact it does not. (This is Type I or α error.) When your managers say, "Well, zero percent; I want you to be right!" you get to say, "Well then, we have to have an infinite sample size." And then you get an opportunity to teach them about statistical risks, which they should have learned in business school but because most business schools are data-averse, they learned about outsourcing instead. Once you get through that, they decide on α = 0.05.

Next they should decide what probability they are willing to tolerate of missing an actual improvement in weld strength of the minimum amount (Δ). (This is Type II or β error.) But the managers interrupt your fascinating discussion and say they want to have a sample size of 10. (Presumably because they have 10 fingers and can't count higher without taking off their shoes.)

What does that mean for your experiment? What does it have to do with RSDs?

When we calculate sample size, we are actually using the RSD of the means (or whatever statistic we are testing). But now we have to consider the size of the difference that we want to see. Let's assume normality for simplicity's sake and graph it out (see figure 3):

The top distributions are three possibilities that the BHD might be. The blue one is the distribution if Bobby Jo's procedure has absolutely no effect on weld strength (known as the null hypothesis, or H0). The greenish teal and salmon ones (anyone know why Excel chooses difficult-to-name colors?) are what the weld strength would look like if her procedure increases or decreases the strength by the minimum amount necessary to consider changing our standard operating procedure (or making sure we never do it, if it is a shift down). Looking at the individuals, there is a lot of overlap, so any one measurement is not going to tell you if there is a change—we are going to have to take a sample and get an average to tell.

The middle distribution is the RSD of the means *if*** **there is no effect due to Bobby Jo's welding procedure. The red areas sum up to our α of 0.05 on that RSD (0.025 on each side), and if we get an average of our 10 welds that is less than 93.80205 or more than 106.1980, we are going to say that the chances of that average coming from the blue BHD are pretty small—in fact, less than 5 percent of the time. So, as we explained to our managers, we have to tolerate some chance of saying that there is a change when in fact there is not, to be able to make any decision at all.

That is how we are going to make our decision, but what if the new weld procedure in fact makes a difference? That is what the bottom graph shows: two RSDs of the means if the weld procedure increases or decreases the weld strength by at least Δ. We already know how we are making our decision: We conclude there was a change when we see an average in one of the red areas on the middle graph. But is it possible that there was a change of Δ even though we got an average in the blue area between our two critical values? You bet it is, and that is shown by the purple area on each of those bottom distributions. (You can see that the purple areas are lined up with the blue area on the middle graph.) That is the probability *if *the new welding procedure increases or decreases the weld strength by Δ; we miss it and say that there was no difference. This is known as a β or Type II error. In this case, if there is a difference of Δ, we run a 64.8-percent chance of missing it. That seems high if this is something that we want to notice to improve the process.

You can see the importance of each of these inputs into the sample size. Type I error determines where I make my decision as to whether there was a change or not, and choosing a smaller Δ makes it harder to see differences that are there because those two bottom distributions get closer together.

Given where we are, what can we do to increase our ability to see changes that are there? Well, we can change where we make our decision by increasing our α, say to 0.10, as shown in figure 4:

That is still a lot of purple, though—about 52.5-percent chance of missing a real change of ±Δ. And of course, we bought that reduction with an increased chance of saying that there is a difference when there isn't (the red areas are larger than before).

We could change Δ, say to 20, but we should have had a really good reason to have selected 10 in the first place, so that is probably out. If we did, the BHDs showing the effects of the new procedure would move further apart, and the RSDs move with them (see figure 5):

We obviously decreased our β error (the purple is so tiny you can't even see it), but all this is saying is if there is an enormous difference, we can detect it. As the vernacular would have it, "Well, duhh!"

I guess we have to make those managers reexamine their sample size of 10. Otherwise, why even run the experiment if it only has a 50–50 chance of detecting the minimum change you want to detect?

As we increase the sample size, the width of those RSDs decreases, because you are dividing by the square root of the sample size. Because Δ isn't changing, the averages of the distributions stay the same. Let's take a look at sample sizes of 20, 30, and 40 shown in figure 6:

As you can see, each time the sample size goes up, the RSDs get skinnier (if only it worked that way for diets). As the RSDs get skinnier, the purple areas of β error get smaller.

Instead of iterating to a sample size, we usually get the computer to do it for us by entering in α, β, Δ, and σ. Now that the managers understand β error, they meekly tell you that they would like no more than a β = 10-percent chance of that, please. If we continue our assumption of normality, the sample size required would be 42 (which, as it turns out, really is the answer to life, the universe, and everything—or at least our experiment).

So we have seen how the RSD is the basis of sample size calculations, and I sneaked in how they are also the basis of hypothesis testing (making the decision to accept or reject the null hypothesis). Next month I thought I'd take a look at that bit in the central limit theorem about the RSDs becoming more normal as the sample size increases, regardless of the distribution of the individuals.

RSDs are really *s*omething delightful, aren't they?

By the way, if you want to play with the spreadsheet I used to generate the RSD and the effect of sample size, you can download it here:

## Comments

## thanks great article

Just got a chance to read your article and wanted to thank you for making something I've learned over and over and over again the easiest to learn over again. I've been teaching a clinical research methods course for the last 5 years and I always received at least one comment that someone doesn't like how I explained power and sample size. I give them an article on the topic, a lecture, and we go to a web site to foul with changing alpha, effect size, etc and the result ends with some not getting it. (I always consider one complaint means more than one has that same complaint.) I always consider whether I know the subject well enough so I think your article solidified some points and look forward to sharing your excel sheet in class. Maybe we (me and the students) will make some progress this next time out, thanks again.

## Thanks Karen!

Honestly Karen, that is the highest accolade I can receive for an article like this. In my experience, sample size is not well understood by applied researchers (including Black Belts and Master Black Belts) and the cost of decisions made in the absence of such understanding must be staggering.

.

Good luck, and let me know how it goes!

## Hi again ADB, and thanks for reading.

But then again, you could be wrong.

## Six Sigma Heretic Strikes Back!

Hey! A reader of my first columns! (I used to sign all my columns that way...until it got too annoying even for me.)

.

Indeed, as the scientific method has taught us, everyone has to humbly admit that some amount of what they think they know is wrong. Being wrong is an opportunity to learn something new! Just because someone writes an article or a book, does not put them above question.

.

Reality is a harsh place to live, but it beats the alternatives...

.

Keep on learnin'!

## Weld Improvement Starting Point

Rather that designing an experiment and evaluating distributions, wouldn't a better approach begin with plotting the weld strength data in the order of production to see if only one distribution actually exists?

Rich

## Absolutely Right!

Yep, if we were doing this in "real life" you would need four things for each data set before doing any experiment: shape, spread, location, and through time variability. The first three are shown in various ways and described by statistics, the last one by a control chart. In the absence of control (significant changes in shape/spread/location through time), it becomes much more difficult (though not impossible) to perform and analyze an experiment. If it is out of control, you'll end up with larger sample sizes, possibly some blocking, maybe variable limitation (and its threat to external validity), and always run the risk that that effect you think is so great was actually a special cause you don't understand. So while you can run an experiment on an out-of-control process, you had better know about that before hand to try to mitigate.

.

Buuuuut, since I am really only using this process as a vehicle for showing you how to use RSDs, and since my editors already give me grief about the huge articles I write... :) But understanding BHDs and RSDs is critical in understanding how to design and analyze any experiment.

.

Actually, if this were real life, I probably would also be looking at existing data (if it existed) and doing some data mining to see if I could find some factors to put in my experiment, rather than just trying Bobby Jo's technique, but again, I am trying to show you RSDs. And maybe Bobby Jo is the CEO's daughter...

## Product distribution

Your first assumption is meaningless. We never know the product distribution. A vast amout of data is required to even roughly estimate it and the process will change during any attempt to estimate it.

## Meaningless...or Masterful?

Hi again ADB, and thanks for reading.

.

This is what is known technically as a "simplification for instructional purposes." :) No actual products were harmed during the writing of this article, which was intended as an introduction to RSDs and why they are so important. So if my pretend process happens to be normally distributed, so it is.

.

However, to your point. Real processes do tend to have distributions of some sort, at least at the level to where they are, in Box's words, useful models. Thankfully, the physical limitations of the universe mean that we don't often find processes that are totally random walks. There is often a confluence of influences that result in knowable distributions: the "memorylessness" of the exponential and its relation to the Poisson, the binomial distribution of a Bernoulli process, the many small sources of random variability that result in a normal distribution, all of these can be useful models for real process behavior. As long as you don't forget that they are models. But in business, we often have to make assumptions in order to make decisions - otherwise we end up in analysis paralysis never making a decision because nothing is *exactly* normally distributed, or we make decisions based on our gut in the absence of any data or analysis. (We do need to test the validity of our assumptions, though.)

.

A common error made about hypothesis testing is that we are saying a process is *really* distributed as something mathematical. We are only using useful models (hopefully with some theoretical basis in reality) to help reduce the error rate of our decisions and give us a repeatable way of making decisions, rather than relying on "I believe it must be so" reasoning. When I teach statistics, I test, in order, "shape, spread, location." If the shape is reasonably approximated by some known distribution, or can be transformed as such, then the probability of us making a bad decision due to that assumption is low. If it cannot, then maybe we can use some other non-parametric test to still help with our decision-making. If I took all the decisions I have made using statistics and knew which ones were "right" and "wrong" I would grant you that the number wrong would be some amount higher than the declared alpha error level, due to differences in assumed distributions, unknown systematic errors and threats to external validity, uncontrolled factors, etc. But not much higher, and much more correct than if managers used their gut to make their decisions for them.

.

Of course, if we have a process where we have control charts on all critical process variables, and we react as they go out of control, then it is pretty likely that we do see a nice, model-able, useful distribution of some sort. And a cool thing about RSDs of means is that, regardless of the distribution of the individuals and given a large enough sample size, the RSD tends to be normally distributed. So we are probably not far off when we make decisions based on a sample with a t-test, for example.

.

When we get stuck modeling a process empirically (like using the Johnson family to fit a distribution to sample statistics) then things do get nasty, and that is my least-preferred way to make decisions. But even so, sometimes decisions must be made, and at least I have a repeatable, gut-feel-independent process I followed in making them. I just need to keep in mind that that model is highly provisional and subject to rather large errors.