## Are You Sure We Don’t Need Normally Distributed Data?

### More about the misuses of probability theory

Published: Monday, November 1, 2010 - 05:00

Last year I discussed the problems of transforming data prior to analysis (see my August 2009 column,Do You Have Leptokurtophobia?, my September 2009 column, “Transforming the Data Can Be Fatal to Your Analysis,” and my October 2009 column,“Avoiding Statistical Jabberwocky.” There I demonstrated how nonlinear transformations distort the data, hide the signals, and change the results of the analysis. However, some are still not convinced. While I cannot say why they prefer to transform the data in spite of the disastrous results described in the three columns, I can address the fallacious reasoning contained in the current argument used to justify transforming the data. Since the question of transforming the data is closely associated with the idea that the data have to be normally distributed we shall discuss these two questions together.

### An argument for transforming the data

This argument begins with the assumption that three-sigma limits are based on a normal distribution, and therefore the normal theory coverage of **P** = 0.9973 for three-sigma limits must be a characteristic of the process behavior chart. When three-sigma limits are used with other probability models we immediately find that we end up with different coverages **P**. So what do these different values of **P** mean? By analogy with other statistical procedures which are characterized by the risk of a false alarm, or alpha level, these different coverages **P** are converted into the sequential equivalent: the average run length between false alarms, or *ARL*_{0}. These average run length values are then interpreted as a characterization of how the three-sigma limits will work with different probability models.

When signals consist of points outside the three-sigma limits a well-known result establishes that the theoretical *ARL*_{0} values will simply be the inverse of [1–**P**]. Thus, the normal theory value for **P** of 0.9973 corresponds to an *ARL*_{0} of 370. See figure 1 for the results for some other probability models.

The smaller *ARL*_{0} values for the skewed models are cited as evidence that the process behavior chart does not work with skewed data. These values are interpreted to mean that the chart will give too many false alarms. Therefore, skewed data need to be transformed to make them look “more normal.” This is said to reduce the number of false alarms and make the chart work more like it does in the normal theory case. Thus, it is said that any analysis should begin with a test for normality. If there is a detectable lack of fit, then a normalizing transformation is needed before the data are placed on a process behavior chart.

### An early critique

Many years ago Francis Anscombe, a fellow of the American Statistical Association, summarized the strength of the argument above when he said, “Testing the data for normality prior to placing them on a control chart is like setting to sea in a rowboat to see if the Queen Mary can sail.” Those of us who have published articles on lack-of-fit tests in refereed journals know about the weaknesses that are inherent in all such tests. However, there are more fundamental problems with the argument above than just the weakness of lack-of-fit tests. To fully explain these problems, it is necessary to go into some detail about the origins of three-sigma limits and the inappropriateness of using the *ARL*_{0} values in the manner they are used in the preceding argument.

### Shewhart’s argument

On pages 275–277 of *Economic Control of Quality of Manufactured Product* Walter Shewhart discussed the problem of “establishing an efficient method for detecting the presence” of assignable causes of exceptional variation. He began with a careful statement of the problem which can paraphrased as follows: If we know the probability model that characterizes the original measurements, *X*, when the process satisfies the differential equation of statistical control, then we can usually find a probability model, *f(y,n),* for a statistic *Y *calculated from a sample of size *n* such that the integral:

will define the probability **P** that the statistic *Y* will have a value that falls in the interval defined by **A** and **B**. A good name for this approach would be the statistical approach

**The Statistical Approach:**

Choose a fixed value for **P** that is close to 1.00,

then for a specific probability distribution *f(y,n),*

find the critical values **A** and **B**.

When *Y* falls outside the interval **A** to **B** the observation may be said to be inconsistent with the conditions under which **A** and **B** were computed. This approach is used in all sorts of statistical inference techniques. And as the argument says, once you fix the value for **P**, the values for **A** and **B** will depend upon the probability model *f(y,n)*.

But Shewhart did not stop here. He went on to observe that: “For the most part, however, we never know *f(y,n)* in sufficient detail to set up such limits.” Therefore, Shewhart effectively reversed the whole argument above and proceeded as follows:

**Shewhart’s Approach:**

Choose GENERIC values for **A** and **B** such that,

for ANY probability model *f(y,n),*

the value of **P** will be reasonably close to 1.00.

Such generic limits will still allow a reasonable judgment that the process is unlikely to be predictable when *Y* is outside the interval **A** to **B**. Ultimately, we do care about the exact value of **P**. As long as **P** is reasonably close to 1.00 we will end up making the correct decision virtually every time. After consideration of the economic issues of this decision process, Shewhart summarized with the following:

“For these reasons we usually choose a symmetrical range characterized by limits *Average* ±* t sigma* symmetrically spaced in reference to [the Average}. Chebyshev’s theorem tells us that the probability **P** that an observed value of *Y* will lie within these limits so long as the quality standard is maintained satisfies the inequality

“We are still faced with the choice of t. Experience indicates that t = 3 seems to be an acceptable economic value.”

*Thus, Shewhart’s approach to the problem of detecting the presence of assignable causes is the complete opposite of the approach used with techniques for statistical inference.* The very idea that the data have to be normally distributed before they can be placed on a process behavior chart is an expression of a misunderstanding of the relationship between the process behavior chart and the techniques of statistical inference. As Shewhart later noted, “We are not concerned with the functional form of the universe, but merely with the assumption that a universe exists.” The question of homogeneity is the fundamental question of data analysis.

If the data are homogeneous, then the process behavior chart will demonstrate that homogeneity and it will also provide reasonable estimates for both the process location and dispersion. At the same time the histogram will show how the process is performing relative to the specifications. Thus, virtually all of the interesting questions will be answered by the homogeneous data without having to make reference to a probability model. So while it might be plausible to fit a probability model to a set of homogeneous data, it is, in every practical sense, unnecessary.

If the data are not homogeneous, then the process behavior chart will demonstrate that the process is subject to the effects of assignable causes of exceptional variation. When this happens the histogram will be a mixture the process outcomes under two or more conditions. While the histogram will still describe the past, it loses any ability to characterize the future. Here it is not probability models or computations that are needed. Rather, you will need to take action to find the assignable causes. Fitting a probability model to nonhomogeneous data, or determining which nonlinear transform to use on nonhomogeneous data, is simply a triumph of computation over common sense. It is also indicative of a fundamental lack of understanding about statistical analysis.

The software will give you lack-of-fit statistics and the statistics for skewness and kurtosis (all of which are numbers that you would never have computed by hand). However, just because they are provided by the software does not mean that these values are appropriate for your use. The computation of any lack-of-fit statistic, just like the computation of global skewness and kurtosis statistics, is predicated upon having a homogeneous data set. *When the data are nonhomogeneous, all such computations are meaningless.* Any attempt to use these values to decide if the data are detectably non-normal, or to decide which transformation to use, is complete nonsense.

Thus, fitting a probability model to your data is simply a red herring. When the data are homogeneous it will only be a waste of time. When the data are nonhomogeneous it will be meaningless. Moreover, whenever a lack-of-fit test leads you to use a nonlinear transformation on your data you will end up distorting your whole analysis.

### The coverage of three-sigma limits

In the passage above we saw how Shewhart used the Chebyshev inequality to establish the *existence* of some number *t* that will result in values for **P** that are reasonably close to 1.00. Following this existence theorem, he then turned to empirical evidence to justify his choice of *t *= 3. While the Chebyshev inequality will only guarantee that three-sigma limits will cover at least 89 percent of the area under the probability model, the reality is that three-sigma limits will give values for **P** that are much closer to 1.00 in practice.

To illustrate what happens in practice, I decided to look at different probability models and to compute the area under the curve included within the “three-sigma interval” defined by the parameters:

*MEAN(X)*± 3

*SD(X)*

I began with 100 binomial distributions. Next I looked at 112 Poisson distributions, 43 gamma distributions, 100 F-distributions, 398 beta distributions, 41 Weibull distributions, and 349 Burr distributions. These 1,143 distributions contain all of the chi-square distributions, and they cover the regions occupied by lognormals, normals, and Student *t* distributions. They include U-shaped, J-shaped, and mound-shaped distributions. Since each of these probability models will have a skewness parameter and a kurtosis parameter, we can use the values of these two shape parameters to place each of these 1,143 distributions in the shape characterization plane.

In figure 2 each of the 1,143 probability models is shown as a point in this shape characterization plane. For any given probability model the horizontal axis will show the value of the square of the skewness parameter, while the vertical axis will show the value for the kurtosis parameter. The lines shown divide the shape characterization plane into regions for mound-shaped probability models, J-shaped probability models, and U-shaped probability models.

By considering the coverage of the three-sigma intervals for these 1,143 models I was able to construct the contour map shown in figure 3.

Starting on the lower left, the bottom region is the region where the three-sigma interval gives 100% coverage. The slice above this region shown by the darker shading is the region where the three-sigma interval will give better than 99.5-percent coverage. Continuing in a clockwise direction, successive slices define regions where the three-sigma intervals will give better than 99-percent coverage, better than 98.5-percent coverage, better than 98-percent coverage, etc. The six probability models of figure 1 are shown in figure 3. This will allow you to compare the **P** values of figure 1 with the contours shown in figure 3.

Inspection of figure 3 will reveal that three-sigma intervals will provide better than 98-percent coverage for *all* mound-shaped probability models. Since most histograms are mound-shaped, this 98 percent or better coverage hints at why Shewhart found three-sigma limits to be satisfactory in practice.

Further inspection of figure 3 will show that most of the J-shaped probability models will have better than 97.5-percent coverage with three-sigma intervals. Since these models tend to cover the remaining histograms we find in practice, we see that three-sigma limits actually do much better than the Chebyshev inequality would lead us to believe.

In the U-shaped region most of the probability models will have 100-percent coverage. However, some very skewed and very heavy-tailed U-shaped probability models will fall into the Chebyshev minimum. Offsetting this one area of weakness is the fact that when your histogram has two humps it is almost certain that you are looking at the mixture of two different processes, making the use of three-sigma limits unnecessary.

Thus, when the process is being operated predictably, Shewhart’s generic three-sigma limits will result in values of **P** that are in the conservative zone of better than 97.5-percent coverage in virtually every case. When you find points outside these generic three-sigma limits, you can be sure that either a rare event has occurred or else the process is not being operated predictably. (It is exceedingly important to note that this is *more conservative* than the traditional 5-percent risk of a false alarm that is used in virtually every other statistical procedure.)

Why were figures 2 and 3 restricted to the values shown? Can not both skewness and kurtosis get bigger than the values in figures 2 and 3? Yes, they can. The choice of the region used in figures 2 and 3 was based on two considerations. First, whenever you end up with very large skewness or kurtosis statistics it will almost always be due to an unpredictable process. When this happens the skewness and kurtosis statistics are not a characteristic of the process, but are describing the mixture of two or more conditions.

Second, predictable processes are the result of routine variation. As I showed in my May column (“Two Routes to Process Improvement”) routine variation can be thought of as the result of a large number of common causes where no one cause has a dominant effect. In other words, a predictable process will generally correspond to some high-entropy condition, and the three distributions of maximum entropy are the normal distribution, the uniform distribution, and the exponential distribution. Figures 2 and 3 contain these three distributions and effectively bracket the whole region of high-entropy distributions that surround them. When considering probability models that are appropriate for predictable processes we do not need to go further afield. The region shown in figures 2 and 3 is sufficient.

### The average run length argument

Any statistical technique that is used to detect signals can be characterized by a theoretical power function. With sequential procedures a characteristic number that is part of the power function is the average run length between false alarms (*ARL*_{0}). As we saw in figure 1 , the *ARL** _{0} * value will be the inverse of [1 –

**P**].

Of course, when we invert a small value we get a big value. This means that two values for **P** that are close to 1.00, and which are equivalent in practice, will be transformed into *ARL*_{0} values that are substantially different. Interpreting the differences in the *ARL*_{0} values in figure 1 to mean that we need to transform our data prior to placing them on a process behavior chart is nothing more or less than a bit of statistical slight-of-hand. There are three different problems with using the *ARL*_{0} values in figure 1 to argue that we need to have “normally distributed data.”

The first problem with figure 1 is that this is an inappropriate use of the *ARL*_{0} values. Average run length *curves* are used to compare different *techniques* using the *same* probability model. The *ARL*_{0} values are merely the end points of these *ARL* curves. As the end points they are not the basis of the comparison, but only one part of a larger comparison. It is this comparison under the same conditions that makes the *ARL* curves useful. Using the same probability model with different techniques creates a mathematical environment where the techniques may be compared in a fair and comprehensive manner. Large differences in the *ARL* curves will translate into differences between the techniques in practice, while small differences in the *ARL* curves are unlikely to be seen in practice. But this is not what is happening in figure 1. There we are changing the probability model while holding the technique constant. As we will see in the next paragraph, regardless of the technique considered, changing the probability model will always result in dramatic changes in the *ARL*_{0} values. *These changes are not a property of the way the technique works, but are merely a consequence of the differences between the probability models.* Thus, the first problem with figure 1 is that instead of holding the probability model constant and comparing techniques, it holds the technique constant and compares probability models (which we already know are different).

The second problem is that the differences in the *ARL*_{0} values in figure 1 do not represent differences in practice. By definition, the *ARL*_{0} values are the inverses of the areas under the extreme tails of the probability models. Since all histograms will have finite tails, there will always be discrepancies between our histograms and the extreme tails of our probability models. Given enough data, you will *always* find a lack of fit between your data and any probability model you might choose. Those who have studied lack-of-fit tests and are old enough to have computed the various lack-of-fit statistics by hand will understand this. Thus, the differences between the *ARL*_{0} values in figure 1 tell us more about the differences in the extreme tails of the probability models than they tell us about how a process behavior chart will operate when used with a finite data set. Thus, the second problem is that the most artificial and unrealistic part of any *ARL* curve is the *ARL*_{0} value.

The third problem with the *ARL*_{0} values in figure 1 is the fact that they are computed under the assumption that you have an infinite number of degrees of freedom. They are expected values. In practice the three-sigma limits will vary. If we take this variation into account and look at how this variation translates into variation in the value for **P** for different probability models, then we can discover how the *ARL*_{0} values vary. In figure 4, the limits are assumed to have been computed using 30 degrees of freedom. The *ARL*_{0} values shown correspond to the **P** values that correspond to the middle 95-percent of the distribution of computed limits. As may be seen the differences in the Mean *ARL*_{0} values disappear in the overwhelming uncertainties given in the last column. Moreover, there is no practical difference in the lower bounds on the *ARL*_{0} values regardless of the probability model used. This is why transforming the data will not improve your *ARL*_{0} to any appreciable degree. Thus, the third problem with comparing the Mean *ARL*_{0} values is that the routine variation in practice will obliterate differences seen in figure 1.

Summary

Therefore, while the argument presented at the start of this column has been used to justify transforming data to make them “more normal” it does not, in the end, justify this practice. The whole argument, from start to finish, is based on a strong presumption that the data are homogeneous and the underlying process is predictable. Of course, according to the argument, the purpose of the transformation is to make the data “suitable” for use in a technique that is intended to examine the data for homogeneity. Thus, this argument could be paraphrased as: “Three-sigma limits won’t work properly unless you have good data.”

Based on my experience, when you start using process behavior charts, you will find at least 9 out of 10 processes to be operated unpredictably. Both **P** and *ARL*_{0} are concerned with false alarms. In the presence of real signals we do not need to worry about the *ARL**0* value and trying to get a specific value for **P** is wasted effort. As long as **P** is reasonably close to 1.00, we will end up making the right decisions virtually every time. And as we have seen, three-sigma limits will yield values of **P** that are greater than 0.975 with any probability distribution that provides a reasonable model for a predictable process.

Fortunately, the process behavior chart has been completely tested and proven in over 80 years of practice. No transformations were needed then. No transformations are needed now. Process behavior charts work, and they work well, *even when you use bad data*.

From the very beginning statisticians, who by nature and training are wedded to the statistical approach, have been nervous with Shewhart’s approach. To them it sounds too simple. So they invariably want to jump into the gap and add things to fix what they perceive as problems. This started with E. S. Pearson’s book in 1935, and it has continued to this day. Transforming your data prior to placing them on a process behavior chart is simply the latest manifestation of this nervousness. However, as perhaps you can begin to see from this column, the simplicity of Shewhart’s process behavior chart is not the simplicity that comes from glossing over the statistical approach, but it is rather the simplicity that is found on the far side of the complexity of a careful and thoughtful analysis of the whole problem.

As Elton Trueblood said so many years ago, “There are people who are afraid of clarity because they fear that it may not seem profound.” Beware those who try to make process behavior charts more complex than they need to be.

## Comments

## Non-Normality in Individual Type Data

According to Dr. Borror, Dr. Douglas Montgomery, and Dr. George Runger, individuals type SPC charts are sensitive to non-normality and you will have false alarms if you assume normality. Tom Pyzdek's article gave an excellent example of that in comparing control limits computed using a normal and a lognormal distribution; the control limit differences are very significant.

For Phase II performance of the Shewart chart, Dr Montgomery states in his "Introduction to Statistical Quality Control" that in-control individual type control charts are sensitive to non-normal distributions. I am not stating anything about X-MR charts with multiple subgroups, only individual types. The suggestion is to do one of three things: (1) use percentiles from the best-fit distribution to determine control limits (2) Fit a distribution that best fits the data using the KS or AD statistic to compute control limits, or (3) transform the data to determine proper control limits. I think the Borror, Montgomery, and Runger studies show evidence that we must fit the best distribution to individual in-control data to compute control limits. I noticed Dr. Wheeler never mentioned anything about individual SPC charts and their sensitivity to non-normality. Why do you think there are statistical programs including these features in statistical software to fit the best distribution to the data, such as MINITAB and many others? If the data is not in control in Phase I, I think we all agree that SPECIAL or assignable causes should be investigated. Using statistical software that fits individual data that contains outliers does not hurt if you generate SPC charts and certainly is not a waste of time. Control limits may not mean much but you can always see suspect outliers skewing the data. Fitting data that contains outliers is not to be taken seriously; I previously said go back and eliminate special causes and redo your SPC chart. Out of control data can be easily seen if you fit the best distribution or assume normality. The object is to explore the data for stability and control before and during phase I. And by the was Rip, Cp, Cpk do not determine stability; Cp and Cpk only have to do with how well data is meeting spec and is meaningless if the process is NOT under control. Both the YIELD and Cp, Cpk are meaningful only after the data is under control and stable, ready for phase II. If I mis-stated myself earlier, my apologies, I stand corrected.

Box-Cox transformations such as ones that MINITAB uses ARE appropriate and do not distort the data for individuals charts. In computing Yield, Dr. Jerry Alderman and Dr. Allan Mense created a program that fits a Johnson distribution to data to determine the Yield. The results were run on thousands of datasets and the results are excellent. Not all data can be fit by a Johnson distribution, but each indicates reasons why. Results were compared to Yields computed using MINITAB and JMP 7. Results are very good. However I prefer to fit the data by the proper distribution then compute the Yield without transforming the data in any way. To the person that said it is meaningless to fit data to telephone book numbers, you need to go back to school; we deal with fitting measured data, not a bunch of meaningless random numbers. Measured data are not random.

Why do I pick on individuals SPC charts? Because this is the only way data is measured in our factories. We may change that process for justifiable reasons in the future, but for now, all data comes as individuals over time. One comment here was made to assume normality for all data and keep it simple stupid. That is a narrow minded approach inviting risks of false alarms and unnecessary rework, including shutting down a line to investigate a false alarm. For Phase II SPC processes just use software that best-fits your data, computes proper control limits, yields, Cp, Cpk, and SPC charts. How do you handle data that is discrete, e.g., data that has a few or many step functions or constant values with a few outliers? Fitting any distribution including a normal distribution and assuming normality is improper. I welcome Dr. Wheeler to respond to the Borror, Montgomery, Runger studies on non-normality of individuals SPC charts. Dr. Montgomery and Dr. Runger each suggest that a univariate EWMA chart would be insensitive to non-normality and could be used if the right lambda and L values would be used. Those non-statisticians doing SPC have no business generating or evaluating SPC charts. Software makes computations and the statistics invisible to the SPC analyst; but the person analyzing them had better know basic statistics of SPC charts. This is the 21st Century, not the 20th Century. Also no one in their right mind computes SPC charts and Yields by hand. Yield = area under the curve between spec limits once a process is under control. Percent nonconformance = 100% - Yield.

## Lloyd Nelson's Article

Does anyone know where I might obtain a copy of Lloyd Nelson's article?

Steve Moore

smoore@wausaupaper.com

## My 2 Cents' Worth

In 1991, I attended Dr. Deming's 4 day seminar in Cincinatti. I distinctly heard him make the following statement: "Normal distribution? I never saw one." At the time, I was astounded and that's why I remember it so well. However, now I understand (at least partially) what he meant.

Dr. Wheeler has always provided elegant proofs (theoretical and practical) of his statements such as found in this article. In addition, his statements are founded in the work of Dr. Shewhart, who, of course, invented (or discovered) the process behavior chart. What more does anyone need to understand that the data do not have to be "normally distributed" for a process behavior chart to give useful insight into the process.

After over 25 years of learning and practising the use of process behavior charts in industrial settings, I can honestly say that I have never had to worry about how the data was distributed when using I-mR charts; and I have helped improve many processes this way. I have also learned that the simplest analysis which gives you the insight needed to solve problems is the best analysis. Start transforming data and trying to lead a team to a solution to a problem and all you will get is glazed-over eyeballs!!! Keep it simple, stupid and people will follow you down the road to continuous improvement. Start transforming data and using other confusing "statistics" and people will stay home!

Back to the normal distribution: Keep collecting data from a stable system after determining that the data follows the normal distribution....and eventually, the data will fail the so-called goodness of fit test. In addition, almost any set of data that fits one distribution model will fit others as well. Normal distribution? I never saw one either (though I used to think I had)!!!

## Thanks for the comments

My thanks to Rich DeRoch and Rip Stauffer for their insightful comments. I had forgotten Lloyd Nelson's comments on this topic. In answer to Steven Ouellette's question, there are many statisticians, and many engineers who take as an article of faith that the data have to be normally distributed or else they have to be transformed to make them approximately normally distributed prior to placing them on a process behavior chart. If you have not run into these, then count yourself fortunate.

Unlike politics, simply repeating an untruth over and over again does not make it true. I have explained my position very carefully here. If my argument is wrong, then point out where I have erred. If my argument is correct, then the result is correct, regardless of how unpalatable it may be for some. Rants are simply not acceptable as mathematical arguments.

Donald J. Wheeler, Ph.D.

Fellow American Statistical Association

Fellow American Society for Quality

## Interesting Discussion

This article is in large part an excellent summary of some of the concepts in Don's book "Normality and the Process Behavior Chart."

David, Don is not suggesting that “factory measured data will be neatly homogeneous and without outliers;” in fact, he stated that his experience is that 90% of processes will be out of control when you look at them. I’m not sure what the rest of your argument is, though. Are you really saying that the “Cp, Cpk determines if the process is in control?” Are you suggesting that the Cp or Cpk have any meaning if the control chart shows that the process is unstable? In one section of your response, it seems that you are claiming that.

Of course, you can’t rely on the control limits when there are out-of-control points or non-random patterns. You have to work to get the process stable before you can use those limits for prediction. You have to have the control limits, though, to know whether the process is stable or not. Trying to fit a distribution to a set of data that have been shown (in a control chart) to be out of control is like trying to calculate the standard deviation of a set of phone numbers…you can do it, but it doesn’t prove anything because it doesn’t mean anything. There is no meaningful distribution for a set of data that are out of control.

Frankly, I don’t know what to do with statements such as “The chart is meaningless except for the Cp, Cpk values and outliers detected. Go back and fix the common causes that are producing the outliers.” Which chart? Are you talking about a capability six-pack from Minitab? Most of the process improvement world is in agreement on one thing: common causes don’t signal as outliers. Usually we try to fix the special or assignable causes.

I agree with much of what’s in your last couple of paragraphs; I’m still not sure that there’s a need to fit a distribution. It is probably worthwhile to test for non-fit, if you’re going to use some test that requires some particular distributional assumption. It would also be useful if you plan to do some Monte Carlo simulations, and want to get parameters for a best-fit distribution in your simulations.

Steve, I don't think Don is responding to you in this article. He's more likely to be responding to Forrest Breyfogle and numerous other Six Sigma practitioners who love to test for normality and transform any non-normal data before they include them in any analysis (including control charts). While most of what Forrest has written has been aimed at tranforming in the case of individuals charts, there can be little doubt that there is an over-emphasis on normality (even in individuals charts), and transforming data generally. It's not at all difficult to find articles or presentations that will tell you to transform your data before placing it on a control chart. I have met people who have told me (and published in their courseware) that the transformation of data via the central limit theorem is "the reason that Xbar-R charts work."

This is hardly a new argument. In the Control Charts menu of Minitab, the first entry is for the Box-Cox Transformation. Why? In the help function for that transformation, it states “Performs a Box-Cox procedure for process data used in control charts.” It also then goes on to quote Wheeler and Chambers on why transformation is probably not needed, but the fact that it’s there, and stated that way, indicates that it’s not an uncommon argument.

Some people just love to transform. I recently saw (in a book purporting to be for preparation for the ASQ Six Sigma Black Belt certification exam) a logarithmic tranformation, to transform a binomial proportion into a Poisson rate, so you could estimate a DPMO! So I transform a proportion non-conforming to an estimate of a Poisson rate, so I can use that to estimate a proportion non-conforming? Another clear victory for calculation over common sense!

It's not just confined to Six Sigma, though. I heard the same argument from statisticians years ago, well before Six Sigma became popular. Many statistics texts and courses don’t cover analytic studies at all; some may cover control charts at a superficial level as an afterthought; this includes texts for business and engineering stats at the undergrad and graduate levels. Most of the literature around ImR (XmR) charts and non-normality relies on the ARL argument; the ARL portion of this article is, I think, a counter to that argument. I heard Don make this argument in a presentation at the Fall Technical conference in 2000; it was well-received by a large audience of statisticians.

## Following comment is sincere!

Hi Rip,

.

No, I really don't think that Dr. Wheeler meant this as a response to my article. It is difficult to convey sincerity in text, I guess. But the first comment I had to my article DID assume that what I was saying was that you needed normality for using control charts, so I wanted to be clear that people did not misunderstand.

.

The Central Limit Theorem is quite different than a transform of the raw data, and it is certainly comes to bear in SPC (the reason we don't put spec limits on an X-bar chart, for example, or how taking a bigger sample affects our ability to detect more subtle shifts, or calculating sample size all relate to the CLT). The concept of the Random Sampling Distribution (in this case of the ranges or standard deviations) is used to get the estimate of the variance of the underlying process from the within-sample dispersion, which is then used to generate the control limits for the location chart. That is the real genius of control charts - the within-variability should be the best you can hope for from the process (since the samples are sequential) and so should more closely represent the underlying process variability than, say, actually looking at the dispersion of all the data points.

.

I get such a kick out of walking my students through that...

## Evil Statiticians

I hace an article by Lloyd Nelson called "Notes on the Shewhart Control Chart". In the article he lists five statements/myths about the charts including:

c) Shewhart charts are based on probabilistic models

d) Normality is required for the correct allpication of an Xbar chart.

He then goes on and states"Contrary to what is found in many articles and books, ......these statements are incorrect"

If someone as well known and respected as Dr. Nelson makes a statement like that, I take it to me true.

I wouldn't say these statisticians are evil, just mis-informed. SPC was based on economic considerations not just probability/mathamatical theory.

Nelson concludes his article by stating ".....wrongly assuming that Shewhart derived the control chart by mathematical reasoning based on statistical distribution theory. His writings clearly indicate that he did not do this"

Rich DeRoeck

## Non-Normality in Data

Surely Dr. Wheeler jests if he expects anyone to claim that factory measured data will be neatly homogeneous and without outliers and worse, to expect a Shewart Chart with 3 sigma control limits to be sufficient and accurate enough to analyze data most of the time (97% to 98% +). Reality is, in phase I of SPC analysis, you will get a outliers and out of control points. The Cp, Cpk determines if the process is in control. If it is not, you fitting any kind of distribution to the data is less meaningful because it is unstable and should be viewed only to see unstability; we do not take the control limits seriously until the process is IN CONTROL. YOU CAN still fit a distribution JUST TO SEE HOW UNSTABLE IT IS IF YOU HAVE A PROGRAM TO DO IT. A statistical program WILL reveal outliers and Cp, Cpk values telling you the data is unstable. The chart is meaningless except for the Cp, Cpk values and outliers detected. Go back and fix the common causes that are producing the outliers. We are not trying to fit or establish a model here, we are merely checking for stability and control. Non-homogeneous data is only one cause of instability, not the sole cause of it. Fix the product and go back to see if the data is now stable and in control. If it is stable, NOW is the time to fit a distribution to the data. Assuming normality on all homogeneous data and "sufficing" to fit a NORMAL distribution to everything is a mistake. I disagree with Dr. Wheeler completely on this. Once outliers problems are fixed, the data is ready to be fit by THE BEST DISTRIBUTION THAT FITS IT and NOT by assuming normality on all data, even if it is homogeneous.

If you have a statistical program that fits a statistical distribution to the data, use it, it will not hurt nor is it a waste of time. The KS or AD tests are statistic examples that can be used to test for the best distribution using a program. Then use the best-fit distribution to compute the two-sided or one-sided control limits. Fitting data with the best probability distribution will reduce false alarms over fitting normal distributions all the time. True, some data will have control limits nearly the same if fit by more than one distribution. But in my case, having analyzed over 2000 sets of variable data, I have found that a non-normal distribution best fits the data 70% of the time. If you have an automated SPC program to compute SPC charts, Cp, Cpk, Yield (Performance), test for outliers, etc., you will be far ahead of the game. There are many examples that show if for example, if you fit a normal distribution to lognormal data, your control limits, Cp, Cpk, and Yield will be significantly different. If any distribution fit, even a normal one, is applied to the data, outliers will skew the data significantly; the SPC may present a picture that will have some value (won't hurt), but the Cp, Cpk and outliers are the important factors to consider to explore a common cause.

We DO NOT EXPECT DATA TO BE NORMAL or homogeneous; we test data for stability and check for outliers, fix the problem(s) and regenerate the data for SPC analysis again. After no ouliers are seen and the process is stable, we fit the BEST DISTRIBUTION to the data in case the data is significantly skewed and re-evaluate it for control. Once control has been established, we set those limits for Phase II SPC analysis. The Yield is simply the area under a best-fit distribution between the spec limits (assumiing two-sided spec limits) and for acceptance testing it should be nearly 100%. Anything less is considered a probablity of nonconformance (PNC = 100% - Yield). If PNC is significant, the product component needs rework before it goes to the customer.

## Strawman Argument

Dr. Wheeler is, as he frequently does, attacking a stawman on this topic.

.

No one says data must be normal, or be transformed, before using a control chart. That would be silly, and for a very simple reason.

.

The purpose of a control chart is to identify a process that is out of control, and if it is, it is *by definition* not a single process and therefore by definition cannot be normal (even if it could "pass" a test for normality). As we all know, the purpose of a control chart is to identify an out-of-control process so that you can eliminate these special causes of variation, not include them in a statistical model in order to get a chart that looks like a process that is in control! Now, if your data are too chaotic you might filter out wild outliers or you might use limits calculated from the median of the dispersion metric in order to establish better control limits for the underlying process to make identifying those special causes easier, but that is another discussion.

.

So I'll arrogate upon myself to state that we all agree: You do not need to have, and in fact do not expect to always get, normal data when using control charts. You can use, and expect to use, control charts in the presence of non-normality.

.

(The is one exception where you might correct control limits for non-normality: individual charts. Due to their sensitivity to distribution shape, assuming normality when the distribution really isn't inflates the probability of making an incorrect decision by a significant, but unknown amount of systematic error. Reasonable people will disagree on what to do, but when justified, I'll adjust control limits for non-normality on individuals charts using a very specific procedure to protect against systematic error inflation. Other people can live with an unknown and not quantified amount of risk or think the adjustments add more systematic error than they remove.)

.

Of course, as implied by Dr. Wheeler above, you must test for normality before performing certain statistical tests (for the reasons I'll outline in this month's Six Sigma Heretic article) to make inferences, or when calculating process capability. Even then one doesn't "need" normality - there are many long-established ways to handle non-normality depending on the situation (the least attractive of which is a mathematical transformation). Just as one wouldn't use a t-test on ordinal data, there is a reason the rules are there and should be followed. I talk more about that in my column this month.

.

It would be unjustifiably self-centered of me to assume that Dr. Wheeler is posting this in response to my article last month ( http://www.qualitydigest.com/inside/six-sigma-column/making-decisions-no... ) and since that article was about making inferences using hypothesis testing and the lovely benefits of the Central Limit Theorem, this article really wouldn't be a response to my article anyway.

.

So I am assuming it is not.

.

For this article to be in response to it, one would have read the title of my article without actually reading the article, and my experience with Dr. Wheeler is that he would not do that. But I thought to clarify for others who might not. Dr. Wheeler and I disagree on a few things, agree on far more, and when we disagree, he has always been courteous enough to spend the time to read the article.

.

I am still waiting to meet those evil statisticians that say you must have normal data before using a control chart, though. Haven't run into one yet. So I do wonder to whom Dr. Wheeler is warning us against in these articles. Has anyone else run into one?