Author clarification--3/5/2015:
It appears that I somewhat overstated my case in this article. I had forgotten that there are some families of distributions where we can estimate the shape of a probability model using the statistics for location and/or dispersion. Because of this, these families get used a lot when fitting models to data, which is rather like the drunk looking for his car keys under the street light, not because that is where he lost them, but simply because that is where the light is.
Shewhart explored many ways of detecting process changes. Along the way he considered the analysis of variance, the use of bivariate plots of location and dispersion, and the idea of probability limits. In the end he settled on the use of a generic approach using symmetric three-sigma limits based on a within-subgroup measure of dispersion. This article updates and expands Shewhart’s arguments regarding probability limits.
ADVERTISEMENT |
Caveat lector
Now, the idea of using probability limits has been around since 1935, and it still has many proponents and adherents today. Here I explain why, after a lifetime of studying this topic, and with the advantages of having been mentored by both W. Edwards Deming and David S. Chambers, I do not teach people how to use probability limits.
It is my intention to persuade you to avoid using probability limits. If you are not open to being persuaded, this article is not for you. I do not wish to engage anyone in a debate, nor do I wish to raise anyone’s blood pressure. This paper explains why I teach what I teach. As always, it’s offered with the intent of helping my readers to see a better way to understand their data.
Probability limits
The problem to be addressed is the practical one of how to define limits for an observable variable, such as individual values, subgroup averages, or subgroup ranges, so that we will know when the underlying process that produced our data has changed. To do this we will need to separate any potential signals of a process change from the probable noise of routine process variation. Thus we shall need to be able to filter out the routine variation found in the data stream created by our process.
Figure 1: f(x)
If we know the appropriate probability model, f(x), then we can use a straightforward approach to computing limits. We would begin by choosing some proportion, P, of the routine variation that we wish to filter out as probable noise. Our limits would then be those points A and B that define the central proportion P under the probability model, f(x), as shown in figure 1. Typically we would choose P to be close to 1.000 so that we would filter out virtually all of the routine variation. Of course, the usual way of expressing the relationships shown in figure 1 is by means of the integral equation:
Thus, using the elements of this integral equation we can summarize the probability approach to process behavior charts in the following manner: For a given probability model f(x), and for a given proportion P, find the specific critical values A and B. Since such limits depend upon the probability P, values of A and B found in this way are known as probability limits.
Shewhart identified this approach on page 275 of Economic Control of Quality of Manufactured Product (American Society for Quality Control, 1980). This approach is consistent with what is typically done in statistical inference, and is well understood by statisticians. Having thus defined probability limits, Shewhart continued: “For the most part, however, we never know [the probability model] in sufficient detail to set up such limits.”
At this point Shewhart leaves the probability approach behind and presents a different way of handling the integral equation above. Rather than fixing the value for P in advance, he uses the Chebychev inequality as an existence theorem to argue that we can use fixed, generic values A and B that will automatically result in a value for P that is close to 1.00 regardless of what probability model f(x) is used.
Notice that Shewhart’s approach to the integral equation is the exact opposite of the probability approach. The probability approach fixes the value for P and finds critical values for a specific probability model. Shewhart’s approach fixes the critical values A and B and lets P vary so that the limits will work for any probability model. So, while the probability approach requires that you start out with a specific probability model, Shewhart’s approach does not.
Shewhart’s argument is that as long as P remains reasonably close to 1.00, we will end up making the right decision essentially every time. In general, whenever a procedure has a value for P that is larger than 0.98 that procedure is considered to be conservative. And as we shall see, Shewhart’s symmetric three-sigma limits are sufficiently conservative to be completely general and nonspecific.
“Well, that was in 1931,” you may be thinking. “Today we have computer software that will identify a probability model for us.”
Okay, let’s consider how that works.
From data to probability model
Since we have to identify a specific probability model in order to compute probability limits, we need to look at how software can “fit” a model to a data set. The software cannot look at the histogram and make a judgment, so it will have to use values computed from the data, i.e., statistics.
The average statistic will tell us everything there is to know about the location of the data set, so we use the average to characterize location. The standard deviation statistic will tell us everything there is to know about the dispersion of the data set, so we use the standard deviation statistic to characterize dispersion. But characterizing the location and dispersion is not enough to specify a particular probability model. Figure 2 shows six probability models that all have the same means and standard deviations, and yet they have radically different shapes.
Figure 2:
So, the first lesson of figure 2 is that knowing the location and dispersion is not sufficient to determine the shape of the probability model. The second lesson is that generic, three-sigma limits will cover virtually all of a probability model regardless of its shape.
In order for any distribution to get even a tiny bit of area out beyond three-sigma it has to compensate for the increased rotational inertia by concentrating a much larger amount of area close to the mean. This can be seen in figure 2 by starting at the bottom. Figure 3 gives the areas beyond three sigma and the areas within one sigma for the six distributions of figure 2.
Figure 3:
• For the lognormal (1, 0.25) to get an extra 5 parts per thousand (ppt) outside three sigma (beyond what the normal has) it has to compensate by increasing the area within one sigma by 17 parts per thousand.
• For the chi-square with 4 degrees of freedom to get an extra 6 ppt outside three sigma it has to compensate by increasing the area within one sigma by 43 ppt.
• For the Student's with 6 degrees of freedom to get an extra 7 ppt outside three sigma it has to compensate by increasing the area within one sigma by 50 ppt.
• For the exponential to get an extra 15 ppt outside three sigma it has to compensate by increasing the area within one sigma by 182 ppt.
• Finally, the lognormal (1, 1) only has 15 parts per thousand more area outside three sigma than the normal, but it has 227 ppt more area within one sigma of the mean.
Thus, compared to the normal distribution, any increase in the infinitesimal areas out beyond three sigma will require a much larger compensating increase in the area within one sigma of the mean. This is an unavoidable consequence of using rotational inertia to characterize dispersion. There is much more to a skewed distribution than merely having an elongated tail. No matter how much you may stretch that tail, you are going to stretch sigma at essentially the same rate. In consequence, no mound-shaped distribution can ever have more than 1.9 percent outside the mean ± three sigma.
So, as may be seen from figure 3, the use of any mound-shaped or J-shaped model with greater kurtosis than the normal will impose a requirement that more of the observations fall within one standard deviation of the mean. Thus, if all you have are measures of location and dispersion, then the absolute worst case probability model that you can fit to your data is a normal distribution. The normal distribution is the distribution of maximum entropy. It spreads the middle 90 percent of the probability out to the maximum extent possible, so that the outer 10 percent of a normal distribution is as far, or further, away from the mean than the outer 10 percent of any other probability model.
Read the above paragraph again. It is completely contrary to what many students of statistics think, yet with the computing power we have today it is easy to verify. For more information see my articles “What They Forgot to Tell You About the Normal Distribution” (Quality Digest Daily, Sept. 4, 2012) and “The Heavy-Tailed Normal” (Quality Digest Daily, Oct. 1, 2012).
Finding a shape for your probability model
So how can your software determine a shape for your probability model? As we saw in figure 2, estimates of location and dispersion will not suffice. Therefore, absolutely the only way your software can fit a non-normal model to your data is to use the shape statistics of skewness and kurtosis. Whether you are aware of this or not, your software has no alternative.
In most cases your software will ask you to choose some family of probability models, and then, based on your data, the software will pick an appropriate member from that family of distributions. Three commonly used families of distributions are the lognormals, gammas, and Weibulls. Figure 4 shows each of these three families of distributions on the shape characterization plane in the broader context of all mound and J-shaped distributions.
Figure 4:
The lognormal, gamma, and Weibull families are shown as lines in figure 4 because they each have only a single shape parameter. To fit these models your software will use some algorithm to estimate the single shape parameter for the model using the skewness and kurtosis statistics of your data. It may not tell you that it is doing this, but it is. It may use some fancy name for the algorithm such as “maximum likelihood,” or “least squares,” or “minimum variance unbiased,” but in the end it absolutely, positively has to make use of the shape statistics to choose between the various models. It cannot do otherwise.
In the following examples I will use Burr and Beta probability models to fit my data because these families each have two shape parameters. This will allow models from anywhere in the mound-shaped or J-shaped regions of figure 4 to be used. By fitting both skewness and kurtosis separately we can obtain very close fits between the data and the probability models without imposing any presuppositions as to which family of probability models is appropriate. We begin with a simple set of 25 values. These values were all generated from the same probability model using a random number generator in Excel.
Figure 5:
The histogram for these 25 values is shown in figure 6. This histogram has an average of 0.10, and standard deviation statistic of 1.01, a skewness statistic (Excel formula) of 2.14, and a kurtosis statistic (Excel formula) of 6.78. (To see the formulas for these shape statistics see my article “Problems with Skewness and Kurtosis, Part Two” (Quality Digest Daily, Aug. 2, 2011).
Figure 6:
Both skewness and kurtosis statistics are highly dependent upon the extreme values of the data set. This can be seen without having to get involved in the formulas: simply move the most extreme value and watch how the skewness and kurtosis change. Since we have a very large extreme value here, I will move it closer to the mean. As long as the value you are moving is the most extreme value, each change will have a pronounced effect upon both the skewness and the kurtosis. As soon as the value you are changing is no longer the most extreme value the skewness and kurtosis statistics will stabilize. Figure 7 shows several such modifications of the data from figure 5, along with the first four descriptive statistics for each set.
Figure 7:
Notice how the skewness and kurtosis statistics barely change between the bottom two data sets, unlike all preceding changes.
Figure 8 shows fitted probability models that match each of the histograms in figure 7. Without going into the details for each model fitted, the information in figure 8 identifies each fitted model and gives the skewness and kurtosis parameters for that model. Each model was then location- and scale-shifted to match the average and standard deviation for the fitted data.
Figure 8:
Clearly, both the skewness and kurtosis statistics are heavily dependent upon the extreme value(s) in your data set. As a result, the shape of your fitted model will also be heavily dependent upon the extreme values rather than upon the overall “shape” of the histogram. The seven models shown in figure 8 do their best to accommodate the largest value, and they do this with very little regard for the other 24 values. (The other 24 values primarily determine the location and dispersion, but they have little effect upon the shape statistics.)
It is this heavy dependency of both skewness and kurtosis statistics upon the extreme values of your data that effectively undermines any attempt to obtain a meaningful fit between a probability model and a data set. Your algorithm may get a model that fits your data very nicely, like we did seven times in figure 8, but more than anything else, your model will be almost certainly fitting the extreme values in the tails of your data.
Remember, the first histogram is the actual data set in this case. So, is the model for the first histogram correct? The model has the right mean; it has the right standard deviation; it has the right skewness; and it has the right kurtosis; and yet the histogram has three values that would be impossible to observe if the fitted model was correct. Since your data always trumps your model, we have to conclude that the J-shaped model is incorrect.
“So what can we do?” you ask. Whether you try to use the shape statistics directly, or indirectly through some algorithm in your software, you will end up fitting the most extreme values. If you restrict yourself to using only the location and dispersion statistics, then the generic, worst-case model is the normal distribution with that location and dispersion. Compare figure 9 with the models in figure 8. The normal distribution does a good job on the 24 observations, and it reveals the largest value to be an outlier.
Figure 9:
While the 25 values here were all obtained from one and the same probability model (a standard normal distribution), this set of 25 values was the most extreme set out of 10,000 such sets generated by the random number generator. So figure 9 tells the correct story here. The value of 3.81 is simply one of those very rare values from a standard normal distribution that fall in the region beyond three sigma.
Summary
Thus, the notion of probability limits is based upon the assumption that we can fit a probability model to our data and then find the exact critical values A and B that will yield a predetermined value for P. However, as we have seen in figure 8, any model that we end up fitting to our data will be highly dependent upon the extreme value(s) in our data. This will, in turn, severely affect the critical values, A and B, which will affect both the results and the interpretation of those results. Ultimately, the problem here is that all of the models in figure 8 assume that the 25 values are homogeneous. The model in figure 9 shows that the extreme value is likely to be an outlier, and this outlier will always skew any probability model fitted to these data.
So why don’t we simply eliminate the outliers before fitting the model? Beautifully simple, yet as soon as we adopt this approach the question becomes: How do you identify an outlier? The process behavior chart, with its generic, three-sigma limits, is an operational definition of what constitutes an outlier! (It defines an outlier, it gives us a procedure for detecting outliers, and it allows us to judge whether a specific point is, or is not, likely to be an outlier.) All other definitions of outliers end up being more conservative than the process behavior chart simply because they are based on the total variation within the data set. So, if we have to delete the outliers in order to fit a model to our data before we can compute the “correct” probability limits for our process behavior chart, then we are indeed without hope.
Moreover, the outliers that get deleted are exactly those signals of changes in our process that we want to detect. Removing outliers from the analysis changes the focus of the analysis from finding and fixing problems to getting a pretty picture from our data. (For those who wish to fit a model to the data in order to estimate the fraction nonconforming, that is discussed in my article “Estimating the Fraction Nonconforming” (Quality Digest Daily, June 1, 2011).
Thus, we return to Shewhart’s statement: “For the most part, however, we never know [the probability model] in sufficient detail to set up such [probability] limits.” While software may blind us to this insufficiency, it does not remove it. This lack of information is not a problem that can be cured by computations. Instead of trying to determine probability limits, why not use Shewhart’s proven approach? As we saw in figures 2 and 3, symmetric, three-sigma limits are sufficiently conservative to work with all types of probability models. Moreover, they are robust enough to work with data that are not homogeneous. They are not unduly disturbed by the extreme values in your data.
In the words of my friend Bill Scherkenbach, “The only reason to collect data is to take action.” You need to separate the probable noise from the potential signals, and the symmetric three-sigma limits of process behavior charts will do this with sufficient generality and robustness to let you take appropriate action. Computing probability limits is all about getting exactly the right false alarm rate. Using process behavior charts is all about detecting the signals of process changes. Since there are generally many more signals to be found than there are false alarms to be avoided, the use of probability limits is focused on the wrong aspect of the decision problem regarding when to take action.
So, if your data happen to come from a process that is being operated predictably (a rare thing), and if you have hundreds or thousands of data without any unusual values (another rare thing), then your probability limits might work as well as Shewhart’s generic, symmetric three-sigma limits. But if not, then your probability limits can result in you missing signals or reacting to noise, and thereby taking the wrong actions. Working harder, to implement a more complex solution, that will only occasionally work as well as a simpler solution, does not make sense.
Caveat computor.
Postscript
While there are many processes out there that are very nicely modeled using gamma and Weibull and lognormal distributions, etc., this is no excuse for using these models in analysis. The primary question of analysis is “How is your process behaving?” To answer this question you will need to actually examine your process behavior, rather than acting like the mother of the defendant and claiming that your process “wouldn’t dream of misbehaving.” When we model a process we may well use an appropriate probability distribution. But when we analyze data we need to listen and let the data speak for themselves. In this world, your data are never generated by a probability model; they are always generated by some process. And those processes, like everything in this world, are always subject to change. “Has a change occurred?” is a question that can never be answered by “Assume a model…”
Comments
Modelling "when to take action"
Dr. Wheeler,
Thanks for a very well-written and informative article. I have often wondered about why only the normal distribution model is used in control charting (or process behavior charting). This indeed looks like a pragmatic approach embraced by engineers and not statisticians. it seems that we often want to over-complicate things, but often, from a practical perspective, it's not necessary. BTW, I attended Bill Scherkenbach's workshop and he often says what you said in the article, "the only reason to collect data is to take action". I would contend that there are other reasons such as determining when action is needed and what kind of action is needed. Charting can definitely help with that too.
Very clear article-for those with an open mind
If my memory selves me well, I remember a DEN (Deming Electronic Network) post by Don many years ago explaining how Shewhart had turned the traditional statistical model on it's head to set the process limits. Like his latest article, it was very enlightening. Deming said in 1980 that it may well take 50 years for people (read statiticians) to "get" Shewhart. I think he was off by a lot!
Rich
Great point, Rich!
Your point is very well-taken, Rich. I think Deming was so optimistic because he expected that--with all the momentum at the time--universities would begin to teach courses that included analytic studies and SPC. It just hasn't happened. Statistics is still firmly entrenched in the school of mathematics, and business statistics courses are mostly about tests of hypotheses, how many transistors will be defective if I select three from a box containing 17 good ones and 8 bad ones, and (in the more advanced ones) risk analysis and correlation. How, then, do we advance these ideas?
I have often thought that--while there are still some alive--we should get some of the most-recognized people in this milieu to form an "Analytic Studies" association, with its own peer-reviewed journal.
When you MUST use probability limits
"But characterizing the location and dispersion is not enough to specify a particular probability model" is entirely correct, You can, for example, estimate the shape and scale parameters for a gamma distribution from its average and standard deviation, but it is better to use the maximum likelihood estimate, which is how StatGraphics and Minitab do it. It is also how I learned to do it in a graduate course on reliability statistics.
This is of course not a criticism of Deming or Shewhart; to say they were "wrong" for not doing this would be like asking why Joseph Lister (the father of antiseptic surgery) did not also use fiber optic surgery rather than open incisions, or asking why Edward Jenner did not develop a polio vaccine to go with his smallpox vaccine. The technology was simply not available at the time, and these two doctors (like Deming and Shewhart) did better than any of their contemporaries with the technology that was available. Now, however, we have the technology, so we should leverage it where appropriate, and Dr. Wheeler's article points out correctly that it is not univerally appropriate.
If you haven't a clue as to what the real distribution is, you can't fit a valid model. I share Dr. Wheeler's dislike of just using the variance, skewnesss, and/or kurtosis to fit some kind of artificial model, and I don't really trust the Johnson distributions--the model may fit, but you don't really know what you are getting. Remember that tests for goodness of fit never PROVE you have a normal distribution, or whatever distribution you are trying to use; all they can do is prove beyond a reasonable doubt (the Type I risk) that the distribution is a poor fit.
Also, if you have a bimodal distribution, or other situation in which assignable causes are operating, you can't set up a SPC chart, or quote a process performance index, because you already know you do not have a stable process. That is, if I have a bimodal process, I don't need a point outside a control limit to tell me there is an assignable cause. If the data are riddled with outliers, some if not all of them also will be outside the control imits but, again, I already know there is a problem. Assignable causes are present that must be removed before I can say anything about the process parameters, or the capability.
If, on the other hand, you do know the underlying distribution, e.g. from experience--it is known in the electronics industry, for example, that failure times follow the exponential distribution when the hazard rate is constant, and there is good evidence that impurities follow the gamma distribution--you MUST use that distribution to estimate the process performance index, at least as far as the AIAG's SPC manual is concerned. The consequences of using the normality assumption for an SPC chart might be a false alarm risk 10 or 20 times the expected 0.00135 at one of the control limits, so the worst that will happen is that you chase false alarms, but the consequences of using it to estimate the nonconforming fraction is being off by orders of magnitude--as in, "Your centered 'Six Sigma' process is sending us 500 defects per million opportunities!" (versus one part per billion at each specification limit, if there are indeed two limits).
The EPA also seems very focused on the use of the gamma distribution to model environmental data. http://www.epa.gov/osp/hstl/tsc/ProUCL_v3.0_fact.pdf
"ProUCL also computes the UCLs of the unknown population mean based upon the positively skewed gamma distribution, which is often better suited to model environmental data sets than the lognormal distribution. For positively skewed data sets, the default use of a lognormal distribution often results in impractically large UCLs, especially when the data sets are small." ProUCL may in fact be free, but I couldn't install it on my last computer due to an older operating system.
The AIAG manual also discusses transformations, but I do not trust them--especially when we talk about ppm or parts per billion (Six Sigma, no process shift) nonconforming fractions. The transformations actually work better when quality becomes worse, just like normal approximations for defects and nonconformances. I did a semilog plot of the nonconforming fraction versus the process performance index for the gamma distribution and the cube root transformation (which is recommended for the gamma distribution), and the two lines divereged visibly in the low ppm and high ppb regions. In the parts per thousand region, though, they were essentially indistinguishable. Parts per thousand nonconforming is also a non-capable process.
Since you MUST use the actual distribution (again stipulating that it can be identified from past experience and/or the nature of the process, and also that it is then not rejected by tests for goodness of fit) to estimate the nonconforming fraction, why not use it to set SPC control limits as well?
Really?
In the fourth paragraph of your commentary you agrue that we cannot create a process behavior chart unless our process is already being operated predictably. If that were true, we would not be writing about this topic. (I call this Myth Four.) The objective of a process behavior chart is not estimation, but rather the characteriztion of the process behavior. We can make this characterization with imperfect limits, and the computations are sufficiently robust that we can get usable limits from imperfect data. In the words of Shewhart in his rebuttal of E. S. Pearson on exactly this point: "We are not concerned with the functional form of the universe [i.e. the probability model], but merely with the assumption that a universe exists [i.e. that the process is being operated predictably]."
I once thought I knew what SPC was all about. I had read Grant. I had taught the course. Then my mentor told me that I needed to read Shewhart. I did, and discovered there was more to it than I had thought. Then about a year later when I thought I really knew what SPC was all about, my mentor told me I needed to reread Shewhart all over again. I ended up rewriting my class notes five times in five years as I kept discovering that SPC was much more profound than it appears at first. This is why I am convinced the greatest obstacle to understanding SPC is an education in statistics. Other statisticians have written me that they have had the same experience.
My experience
Don,
I suppose I should describe my own experience, noting especially "the greatest obstacle to understanding SPC is an education in statistics." My original education was in chemical engineering, where I learned the obvious desirability of feedback process control. These controls are automated and, because they work on continuous processes, are generally better than SPC.
Then I worked for IBM, which made discrete computer parts rather than products that flow and pour. I got a night school MBA (initially, and then also an M.S. in applied statistics). When I learned about SPC, I was delighted. "I can use this on discrete processes the way feedback process controls work for continuous processes!" I eagerly took what I had learned in the classroom to the factory, only to discover that the data were like nothing in the textook examples. They didn't follow a bell curve, and too many points were outside the control limits. I did, however, have to take two courses in statistical theory--I never liked theory, but the presentation of distributions other than the normal caught my attention very quickly. It was clear that they could be used to model continuous distributions that were not bell-shaped. I also recall running up against the rational subgroup issue at IBM, and realizing that you have to sort the within-batch variation from the (usually larger) between-batch variation. When I went to Harris Semiconductor in 1993, I found that the people there were already addressing this issue.
My understanding is that the normality assumption is generally robust enough for practical purposes--and even a gamma distribution with a sufficiently large shape parameter will, I think, begin to look like a bell curve due to the central limit theorem. So 3-sigma limits are admittedly viable even for many non-normal situations as far as SPC is concerned, and even more so when a sample average is involved.
I don't see any way around the use of the actual distirbution (again, if it can be identified) to get the process performance index, though. Remember that, for SPC, we are talking about a nominal false alarm risk of 0.00135. Nonnormality may increase this risk by a factor of 5, 10, or maybe 20. The worst that will happen is that we chase more false alarms than we expect. For the nonconforming fraction, on the other hand, we are dealing with ppm or, ideally, ppb, and this is the region in which the effects of non-normality really begin to make themselves felt--as in 1000, 10,000, or even 100,000 times as many nonconformances as the normal distribution predicts. That is, we can have a nominally Six Sigma process (Ppk = 2.0) that is in fact not even capable (Ppk < 1.33) in terms of the nonconforming fraction.
Distributions everywhere and not a drop....
If I recall my Stat 101 courses from 25 years ago, the two dozen or so most-used distributions ALL have certain assumptions which define the conditions of the distribution. I have yet to find the most-used distributions collected in one place with ALL their assumptions listed and explained. I accept that some distribution assumptions would change with each parameter change. That explanation would be interesting.
Statisticians always want to assume Normal. Why? Why not Gamma as Levinson writes? Shewhart stuck in my mind because he did not assume normal. I have read that some control charts assume Binomial and some assume Exponential. What are the listed assumptions? What do their useage buy you as opposed to using no distributions?
Assumptions
The p, np, c, and u charts all assume that nonconformances and defects follow the binomial and Poisson distributions respectively. This is a scientifically justifiable assumption. E.g. the chance of one item being nonconforming is reflected by the Bernoulli distibution, as I recall, and the binomial is simply an expression for numerous Bernoulli distributions. In addition, the hypergeometric reduces to the binomial as the population becomes infinite. That means that, if your process is stable, these distributions will indeed reflect common cause attribute data.
The normal approximations for these distributions (the basis for the traditional control chart), by the way, improve as quality gets worse: no fewer than four, and preferably five or six, expected nonconformances or defects in each sample.
Clarification
It appears that I have somewhat overstated my case in this article. I had forgotten that there are some families of distributions where we can estimate the shape of a probability model using the statistics for location and/or dispersion. Because of this these families get used a lot when fitting models to data, which is rather like the drunk looking for his car keys under the street light, not because that is where he lost them, but simply because that is where the light is.
Add new comment