Featured Product
This Week in Quality Digest Live
Six Sigma Features
Donald J. Wheeler
How you sample your process matters
Paul Laughlin
How to think differently about data usage
Donald J. Wheeler
The origin of the error function
Donald J. Wheeler
Using process behavior charts in a clinical setting
Alan Metzel
Introducing the Enhanced Perkin Tracker
Six Sigma News
How to use Minitab statistical functions to improve business processes
Sept. 28–29, 2022, at the MassMutual Center in Springfield, MA
Elsmar Cove is a leading forum for quality and standards compliance
Is the future of quality management actually business management?
Too often process enhancements occur in silos where there is little positive impact on the big picture
Collect measurements, visual defect information, simple Go/No-Go situations from any online device
Good quality is adding an average of 11 percent to organizations’ revenue growth
Floor symbols and decals create a SMART floor environment, adding visual organization to any environment
A guide for practitioners and managers
Six Sigma

## Making Decisions in a Non-Normal World

### The power of the central limit theorem

Published: Sunday, October 10, 2010 - 09:26

Throughout the last couple of articles, I have explained and illustrated that understanding the random sampling distribution (RSD) of a statistic is key to understanding the entire basis of inferential statistics. Which is just a fancy way of saying “avoiding career-terminating decisions.” This month I’ll show you how the central limit theorem is your best friend, statistically speaking.

As I have mentioned before, there are four characteristics that we need to know (and often test) about any data set: shape, spread, location, and behavior over time.  In the article, “The Omnipotence of Random Sampling Distribution,” I showed you the RSDs for some of the statistics we use to measure shape, spread and location.  (The behavior over time is what control charts monitor.)  In my last article, “(Sample) Size Matters,” we made an assumption about the shape of the population we were testing and said it was normally distributed.  Although there is a strong theoretical basis for the normal distribution showing up, it certainly is not the only distribution you will see in the real world.  So what happens if it is some other distribution?

Let’s say we are interested in the population in figure 1:

Figure 1: Slightly skewed population

The skewness and kurtosis numbers are tested against those expected from a sample size of 450,000 and are highlighted in yellow if the probability of getting that statistic is less than 5 percent, thus rejecting that the normal is a good approximation. It is moderately skewed and leptokurtic, so it is certainly not normal, and we wouldn’t want to approximate it as normal, say for purposes of a capability study, because we would get the wrong results, as illustrated in figure 2. As we see there, the UPL and LPL are where you would say the process natural tolerance is if you assumed normality, but they would clearly give you incorrect probabilties:

Figure 2: Bogus normal limits on the skewed distribution

Now there is nothing wrong with a non-normal distribution—it is the nature of this process and might even be preferred to a normal distribution. But what if we were interested in testing to see if a new vendor might be able to shift that average up by five points? To calculate the sample size, we need to know the RSD of the means for this distribution, as I described in “(Sample) Size Matters.”

Obviously, the mean of all possible means of size n is going to be the mean of the individuals—it’s just the same numbers rearranged. But what shape will the distribution of means be?  It turns out that as the sample size increases, the distribution of the means gets more and more normal. So with non-normal distributions, the sample size needed to detect the change in the average, which we are looking for, also has to be large enough so that the RSD is reasonably approximated by the normal distribution. (If you can’t get that many, you can always use a nonparametric test like the sign test for location at the cost of some power to detect the shift.)

How big does the sample size used to calculate the average need to be for the RSD of the means to be normal? The statisticians’ favorite answer… it depends.

For moderately non-normal distributions, five, 10, or 15 samples should be just fine. For the most extreme distributions, you may need more.

In our example above, figure 3 shows what happens to the skewness and kurtosis of the RSD of the means as the sample size goes up:

Figure 3: What happens to the skewness and kurtosis of the RSDof means as the sample size goes up

As you can see, pretty quickly the kurtosis is eliminated as a problem for using the normal approximation. Skewness lingers on in ever-decreasing amounts. However, we are failing the statistical tests at a pretty massive sample size. Is the normal approximation a reasonable (if not exact) model for this RSD?  Let’s look at normality tests on a random sample of 1,000 from this RSD, as shown in figure 4:

Figure 4: Normality tests on a random sample of 1,000

So within 15 samples, even 1,000 data points look normal (though of course, the RSD really is still a teensy bit skewed).

Figure 5 shows a histogram at n = 15 for those 1,000:

Figure 5: Random sampling distribution of the mean of sample size 15 for the skewed population
with the normal distribution superimposed

The upshot is that if I am making a decision based on assuming the RSDs of the means are normal, I probably am not far off.

What about the most extreme example?

Here in figure 6 is a population that is exponentially distributed:

Figure 6: Exponentially distributed population

(You run into the exponential distribution fairly frequently in real life. For example, a serious injury rate might follow a Poisson distribution, and if so, the time between serious injuries would be exponential.)

You can’t see it, but with a sample of 1.05 million, the high is 153, so that is pretty skewed. The average is 10, as is the standard deviation. (Weird fact: An exponential distribution with a lower bound of zero has a standard deviation equal to the mean.)  Obviously, assuming this is a normal distribution would be a bad idea (see  figure 7):

Figure 7: Exponentially distributed population totally bogusly approximated by the normal distribution

It would be a bad idea because I would have a fairly high proportion on negative time between injuries, which I guess means that injuries are occurring before they happen. This means either I have invented a dangerous time machine, or I screwed up the assumption of normality.

To get an RSD of the means that is normal, we are going to have to take more than five samples, I bet.
In figure 8, we see a similar pattern for the RSD of the means as we did before:

Figure 8: A similar pattern as before for the RSD of the means

I even made an animation (in figure 9) to watch the changes in the shape of the RSD as the sample size increases. (OK, I am so totally a stats geek.)

Figure 9: The effect of increasing sample size on the RSD of the means from an exponential distribution

Let’s do a similar exercise as before. Let’s take 100 random samples off of the RSD of the means for different sample sizes and test for normality. (If we take 1,000 like before, nothing passes the skewness test.)  We get the following results for distribution shape (see figure 10):

Figure 10: Results for the distribution shape

Yellow boxes fail at α = 0.05, which is the Type I error I advise for distribution shape testing. So for these random samples, we pass the normality tests when we calculate the mean from 10 samples. That is probably too low of a sample size to be reliable, though, so I would take means from 20 or more samples in real life. Usually by a sample size of 30, pretty much any distribution’s RSD of the means looks normal enough to be helpful.

Here in figure 11 is the distribution of the population and the RSD of the means with n = 20:

Figure 11: The exponentially distributed population and its RSD of the means of 20 samples (to scale)

If I was to take more than 100 random samples from those RSDs, we would still see remnants of the skewness, which would result in failing the tests for normality, first on the RSDs of the smaller sample sizes and then on the larger ones. But as George Box famously (famous among statisticians, anyway) said, “Essentially, all models are wrong, but some are useful.”  If an RSD of the means is normal enough to pass the skewness and kurtosis tests with 100 samples, it is probably close enough to be useful in making decisions.

For example, if I am interested in finding out if my project has significantly decreased the average time between serious injuries, I can use the relationship of the RSD of the means back to the actual population average to find out.

Because the averages of n = 20 fall on a nice (almost) normal distribution, all I have to do is take a single sample of 20 (presuming 20 is a large enough sample size to notice the effect size that I am looking for) and test that mean against the original mean using a t-test, even though the individuals are distributed as an exponential.

Check this out. The final prediction of the central limit theorem is that the standard deviation of the RSD of the means is related to the standard deviation of the individuals like this:

The standard deviation of the population was 10, so we would expect to see a standard deviation of the RSD of the mean like so:

And we see 2.2107 with a sample of 15,000. Not too far off for a population that started off as exponential. Larger sample sizes, of course, end up being closer to the theoretical value.

This is massively useful in real life, since we do often encounter non-normal distributions, but we still have to make decisions based on samples from them. Of course, if the population could be normally distributed, we first check to see if that is a reasonable approximation. But even if it is not, the power of the central limit theorem is there to help us make reasonable decisions.

### Steven Ouellette

Steve Ouellette, ME, CMC started his career as a metallurgical engineer. It was during this time that his eyes were opened to find that business was not a random series of happenings, but that it could be a knowable enterprise. This continues to fascinate him to this day.

He started consulting in 1996 and is a Certified Management Consultant through the Institute for Management Consulting. He has worked in heavy and light industry, service, aerospace, higher education, government, and non-profits. He is the President of The ROI Alliance, LLC. His website can be found at steveouellette.com.

Steve has a black belt in aikido, a non-violent martial art, and spent a year in Europe on a Thomas J. Watson Fellowship studying the “Evolution, Fabrication, and Social Impact of the European Sword."

### Normality?

As Dr. deming said in a four-day seminar I attended in Cincinatti (1991)..."Normal distribution?....I never saw one." This leads me to wonder.... Why are we so obsessed with "checking for normality"? The so-called Goodness of Fit tests should actually be called Lack of Fit tests (and sometimes they are called this), and, given enough data, virtually any data set collected from a "real" process will show a lack of fit for the normal distribution, even if the histrogram "looks" like a normal bell curve; in which case, the distribution generally shows a lack of fit near the tails of the distribution. So, the world in general IS non-normal.

### Deming Knew from Normality

Hi Steve, and thanks for reading!
.
Yep, Deming said that, but he was very aware of the theoretical basis for making decisions using statistics and testing for normality - my colleague Dr. Jeff Luftig worked for Deming at Ford (and is mentioned by Deming in Out of the Crisis) and he is a fanatic on testing for normality. You have merely misinterpreted what Deming's comment was intended to demonstrate. (If I recall correctly, he made that comment when talking about the difference between enumerative and analytical statistics and would not have meant to convey that testing for normality was unimportant in decision-making.)
.
As I quoted Box in the article, “Essentially, all models are wrong, but some are useful.” We test for normality in order to determine if it is *useful* to use the normal distribution as an approximation. The normal distribution has a sound theoretical basis for occurring, at least some of the time, and it is incredibly useful as an approximation since it allows you to do a lot of powerful tests with it. (Power here means "able to see effects for a lot less time, trouble, and money than otherwise").
.
If it fails a test for normality, by definition it is too different from a normal distribution for that approximation to be useful. Using the normal distribution as an approximation would mislead us a lot more frequently than a test's stated alpha and beta error rates. Don't be worried about that, though, since there are a number of ways we can handle distributions that are not normal. We just don't want to use the normal approximation when it might mislead us (as in Figure 7 above).
.
As to real process data, that is why I recommend the use an alpha of 0.05 with the Anderson-Darling test for n<25 and the skewness and kurtosis moment tests for larger sample size. This is a good balance between falsely rejecting a distribution that is reasonably normal (alpha error), and missing a significant deviation from the normal probabilities (beta error). In my experience, about 30-40% of real processes are reasonably approximated by the normal distribution (higher when dealing with measurement error).
.
But that is only part of the story too - a process that is out of control (as determined by a control chart) might fail (or heck, even pass) a normality test too. That is why you need to understand all four characteristics of a data set: shape, spread, location, and through-time behavior in order to understand what is going on.
.
You say, "The world in general IS non-normal," to which I say, "Yep. Thank goodness for the CLT and the RSD which allow us to use the heuristic of statistics to help us make (some) decisions!

### Stats Geek

Steven,

Most of the readers of QD are not a "totally stats geek" which may be the problem. As Quality Engineers we are tasked to improve proceesses/systems. We frankly don't like statistics or have no time to play with numbers and histograms. What we need, and what Dr. Wheeler provides, are simple tools to help us understand the UNDERLYING PROCESS that produces the numbers so we can improve these processes. Examples include:

-If a process is stable-perform capability or Wheeler's EP&U metric
-If a process is not stable-hunt down those special causes first before performing a Capability analysis
-If a process is not stable-hunt down those special causes first before running a DOE

Now for a QA Engineer that's useful stuff. As a degreed stats guy you could contribute much in this needed area of applied data analysis. You would be shocked how poorly data analysis is performed in industry. It's awful.

You are right that not everyone agrees with Dr. Wheeler but the same could be said about Deming.

Rich

### Stats Geek? Me??? :-)

Hi Rich!
.
.
Control charts are very powerful tools for understanding a process - I love using them and teaching them and they are essential tools in the toolkit. No one ever said that you can't use a control chart if the process is non-normal, or even if a process is out of control. After all, that is the whole purpose of the thing! Anyone saying either of those things is presenting a strawman argument. But this article wasn't about that.
.
There are simple tools to understand and improve a process: e.g. the Seven Basic Tools. Nothing wrong with that, but they will only take you so far. If you are in health-care, you can probably make HUGE improvements just using those tools to pick the low-hanging fruit. If you are in manufacturing, you have probably been using them for years. To get the "higher, sweeter fruit" you are going to have to learn a new level of understanding about the strengths (and limitations) of statistics, including experimental design. Statistics is just another word for data-based decision-making. By the way, I would not be shocked at the level of analysis in industry - I have been in it consulting and teaching since 1991. I know just how bad it is, even at big companies with lots of Master Black Belts. I like to think my little articles might help influence a change in that. Oh and don't forget, I started off as a process and product engineer, so I know whereof I speak, and it ain't from no ivory stats tower my friend! I tell you from experience - learning to love that extra level of knowledge will in fact make a QE's job easier and make them more effective.
.
No lie - most of what I see stats-lacking Black Belts, Master Black Belts, and Quality Engineers do as part of their job is 1) not needed and 2) misleading them. (I won't say "stats-hating" since I haven't had a chance to teach them yet!) I particularly dislike what I call "black box" statistics, where people are taught to just put the data into the software and push the button, without understanding anything about what they are doing. This is where ADB and I agree.
.
If a process is stable and non-normally distributed, calculating the capability of the process to meet customer requirements will give you the wrong answer for two reasons: first, the estimate of the variance will be wrong since there is a non-robust assumption of normality to use the dispersion metric to estimate the true process variability; and second, the capability indices assume normality the way most people calculate them. So you have to big errors that result in a capability index that is just not all that related to what you actually have. The capability from such processes can easily be calculated, you just can't use what pops out of some software's control chart for it.
.
You *can* run a DOE on a process that is out of control...but it will be more expensive (due to a larger sample size, process controls implemented for the experiment, etc) and your results will obviously not confirm as frequently as you would expect from the alpha level you chose. Still it can (and has) been done.
.
Actually, Wheeler and I probably agree on more than we disagree. But what is the fun in that? Or another way of saying it, what do you learn from that? It is points of disagreement that lead to new knowledge. Otherwise I would write a bunch of articles saying, "Yeah me too!" Bleahh.

### Back to the dark ages

This sort of nonsense, which is a product of the even greater rubbish of "Six Sigma", has sent quality backwards a century, to the days before Shewhart. You would do well to read Dr Wheeler's excellent book "Normality and the Process Behaviour Chart". He examines 1143 distributions in detail and shows that Shewhart charts work well for the entire range of skewness and kurtosis in this article. (see page 88)