The second principle for understanding data is that some data contain signals; however, all data contain noise. Therefore, before you can detect the signals you will have to filter out the noise. This act of filtration is the essence of all data analysis techniques. It is the foundation for our use of data and all the predictions we make based on those data. In this column we will look at the mechanism used by all modern data analysis techniques to filter out the noise.
ADVERTISEMENT |
Given a collection of data it is common to begin with the computation of some summary statistics for location and dispersion. Averages and medians are used to characterize location, while either the range statistic or the standard deviation statistic is used to characterize dispersion. This much is taught in every introductory class. However, what is usually not taught is that the structures within our data will often create alternate ways of computing these measures of dispersion. Understanding the roles of these different methods of computation is essential for anyone who wishes to analyze data.
Perhaps the most common type of structure for a data set is to have k subgroups of size n where the n values within each subgroup were collected under the same set of conditions. This structure is found in virtually all types of experimental data and in most types of data coming from a production process. To illustrate the alternate ways of computing measures of dispersion we shall use a simple data set consisting of k = 3 subgroups of size n = 8 as shown in figure 1.
Figure 1:
Data set one with method one
The first method of computing a measure of dispersion is the method taught in introductory classes in statistics. All of the data from the k subgroups of size n are collected into one large group of size nk and a single dispersion statistic is found using all nk values. This dispersion statistic is then used to estimate a dispersion parameter such as the standard deviation for the distribution of X, SD(X).
As shown in figure 3 the range of all 24 values is 6. The bias correction factor for ranges of 24 values is 3.895. Dividing 6 by 3.895 yields an unbiased estimate of the standard deviation of the distribution of X of 1.540.
The global standard deviation statistic is 1.551. The bias correction factor for this statistic when it is based on 24 values is 0.9892. Dividing 1.551 by 0.9892 yields an unbiased estimate of the standard deviation of the distribution of X of 1.568.
Figure 3:
Since the original data are given to the nearest whole number, there is no practical difference between the two estimates of SD(X) shown in figure 3. Whether we use the range or the standard deviation statistic will not substantially affect our analysis.
Data set one with method two
While method one ignores the subgroups, method two respects the subgroup structure within the data. Here we calculate a dispersion statistic for each subgroup. These separate dispersion statistics are then averaged, and the average dispersion statistic is used to form an unbiased estimate for the standard deviation parameter of the distribution of X.
Figure 4:
Using data set one, we compute a dispersion statistic for each of the three subgroups. Because the subgroups are all the same size we can average the statistics prior to dividing by the common bias correction factor.
As shown in figure 5, the subgroup ranges are respectively 5, 5, and 3. The average range is 4.333, and the bias correction factor for ranges of eight data is 2.847. Dividing 4.333 by 2.847 we estimate the standard deviation for the distribution of X to be 1.522.
The subgroup standard deviation statistics are respectively 1.690, 1.690, and 1.195. The average standard deviation statistic is 1.525, and the bias correction factor is 0.9650. Dividing 1.525 by 0.9650 we estimate the standard deviation for the distribution of X to be 1.580.
Figure 5:
As before, there is no practical difference between the two estimates shown in figure 5. Neither is there any practical difference between the estimates in figure 3 and those in figure 5. The four estimates obtained using the two different measures of dispersion and the two different methods are all very similar.
Data set one with method three
The third method will probably seem rather strange. It is certainly indirect. Instead of working with the individual values as the first two methods do, the third method works with the subgroup averages. These subgroup averages are used to obtain a dispersion statistic, and this dispersion statistic is then used to estimate the standard deviation parameter of the distribution of X.
Figure 6:
For data set one the subgroup averages are respectively 5.0, 4.0, and 5.0. The range of these three averages is 1.00. The bias correction factor for the range of three values is 1.693. Since each of these averages represents eight original data, we will have to multiply by the square root of 8 and divide by the bias correction factor to estimate the standard deviation parameter for the distribution of X. When we do this with the values above we obtain an estimate of SD(X) of 1.671.
The standard deviation statistic for the three subgroup averages is 0.5774. Dividing by the bias correction factor of 0.8862 and multiplying by the square root of 8 we obtain an unbiased estimate of the standard deviation of the distribution of X of 1.843.
Figure 7:
Once again, there is no practical difference between using the range and using the standard deviation statistic. Here the two estimates are slightly larger than before, but not by any appreciable amount.
As summarized in figure 8, we have just obtained six unbiased estimates for the standard deviation parameter for the distribution of X using three different methods and two different statistics. These six values are listed along with their coefficients of variation. The first four unbiased estimates are all quite similar because they all have similar coefficients of variation. The last two unbiased estimates are not as cozy as the first four because they have much larger coefficients of variation and therefore have more uncertainty attached.
Figure 8:
Before we attempt to draw any lesson from this example we need to know that data set one has a very special property. When we place data set one on an average and range chart we end up with figure 9. There we see no evidence of any differences between the three subgroups. Data set one contains no signals. It is pure noise.
Figure 9:
Therefore, at this point we can reasonably conclude that when the data are homogeneous and contain no signals the three methods will yield similar values for unbiased estimates of SD(X) regardless of whether we use the range or the standard deviation statistic.
Data set two
But what happens in the presence of signals? After all, the objective is to filter out the noise so we can detect any signals that may be present. To see how signals affect our estimates of SD(X) we shall modify data set one by inserting two signals. Specifically we shall shift subgroup two down by two units while we shift subgroup three up by four units. This will result in data set two which is shown in figure 10. As may be seen in the average and range chart in figure 11, these changes have introduced two distinct signals.
Figure 10:
Figure 11:
Method one with data set two
Method one uses all 24 values in data set two to compute global measures of dispersion. As shown in figure 12, the global range is 10.0 which results in an unbiased estimate of the standard deviation parameter of 2.567. The global standard deviation statistic is 3.279 which gives an unbiased estimate of the standard deviation parameter of 3.315.
Figure 12:
These estimates of SD(X) are roughly twice the size of those found in figure 3. Thus, the signals introduced by shifting the subgroup averages have inflated both of the method one estimates by an appreciable amount.
Method two with data set two
Using method two, we compute a dispersion statistic for each of the three subgroups. Since the subgroups are all the same size we can average the statistics prior to dividing by the common bias correction factor. As shown in figure 13, the average range is 4.333 and the bias correction factor for ranges of eight data is 2.847. Dividing 4.333 by 2.847 we estimate the standard deviation for the distribution of X to be 1.522.
The average standard deviation statistic is 1.525 and the bias correction factor is 0.9650. Dividing 1.525 by 0.9650 we estimate the standard deviation for the distribution of X to be 1.580.
Figure 13:
The method two estimates of SD(X) for data set two are exactly the same as those obtained for data set one in figure 5. Thus, the method two estimates are not affected by the signals introduced by shifting the subgroup averages.
Method three with data set two
For data set two the subgroup averages are respectively 5.0, 2.0, and 9.0. The range of these three averages is 7.00. The bias correction factor for the range of three values is 1.693. Since each of these averages represents eight original data, we will have to multiply by the square root of 8 and divide by the bias correction factor to estimate the standard deviation parameter for the distribution of X. When we do this with the values above we obtain an
estimate of SD(X) of 11.693.
The standard deviation statistic for the three subgroup averages is 3.512. Dividing by the bias correction factor of 0.8862 and multiplying by the square root of 8 we obtain an unbiased estimate of the standard deviation of the distribution of X of 11.209.
Figure 14:
These method three estimates of SD(X) are seven times larger than values found in figure 7. Thus, the signals introduced by shifting the subgroup averages have severely inflated both of the estimates obtained using method three.
When we summarize the results of the three methods with data set two we get the table in figure 15. We have obtained six unbiased estimates of SD(X) using three different methods and two different statistics, yet these six values differ by almost an order of magnitude!
Figure 15:
The differences left to right in figure 15 show the effects of using the different dispersion statistics. The differences top to bottom reveal the differences due to using the different methods. Clearly, the differences left to right pale in comparison with those top to bottom. The key to filtering out the noise so we can detect the signals does not depend upon whether we use the standard deviation statistic or the range, but rather upon which method we employ to compute that dispersion statistic.
Method one estimates of dispersion are commonly known as the total variation or the overall variation. Method one is used for description. It implicitly assumes that the data are globally homogeneous. When the data are not globally homogeneous this method will be inflated by the signals contained within the data and the value obtained will no longer estimate SD(X).
Figure 16:
Method two estimates of dispersion are commonly known as the within-subgroup variation. Method two is used for analysis. Whenever we seek to filter out the noise in order to detect signals we use method two to establish the filter. Method two implicitly assumes that the data are homogeneous within the subgroups, but it places no requirement of homogeneity upon the different subgroups. Thus, even when the subgroups differ, method two will provide a useful estimate of SD(X).
Figure 17:
Method three estimates of dispersion are commonly known as the between-subgroup variation. Method three is used for comparison purposes. It assumes that the subgroup averages are globally homogeneous. When method three is computed it is generally compared with method two; the idea being that any signals present in the data will affect method three more than they affect method two. When the subgroups differ, method three will not provide an estimate of SD(X).
Figure 18:
Separating the signals from the noise
The essence of every statistical analysis is the separation of the signals from the noise. We want to find the signals so that we can use this knowledge constructively. We want to ignore the noise where there is nothing to be learned. To this end we begin by filtering out the noise. And for the past 100 years the standard technique for filtering out the noise has been method two. To illustrate this point figure 19 shows the average chart for data set two with limits computed using each of the three methods. Only method two correctly identifies the two signals we deliberately buried in data set two.
So when it comes to filtering out the noise you have a choice between method two, method two, or method two. Any method is right as long as it is method two!
Method one is inappropriate for filtering out the noise because it gets inflated by the signals. Method one has always been wrong for analysis, and it will always be wrong. Trying to use method one for analysis is so wrong that it has a name. It is known as Quetelet's Fallacy and it is the reason there was so little progress in statistical analysis in the 19th century.
Figure 19:here
Method three is completely inappropriate for filtering out the noise because it will be severely inflated in the presence of signals. If you use method three to filter out the noise you will have to wait a very long time before you detect a signal. Although there are analysis techniques that make use of the method three (between-subgroup) estimate of dispersion, they do so only in order to compare it with a method two (within-subgroup) estimate of dispersion.
Thus, the foundation of all modern data analysis techniques is the use of method two to filter out the noise. This is the foundation for the analysis of variance. This is the foundation for the analysis of means. And this is the foundation for Shewhart's process behavior charts. Ignore this foundation and you will undermine your whole analysis.
Many analysis techniques from the 19th century, such as Franklin Pierce's test for outliers, are built on the use of method one to filter out the noise. As may be seen in figure 19, this approach will let you occasionally detect a signal, but it will cause you to miss other signals.
In fact, many techniques developed in the 20th century also suffer from Quetelet's Fallacy; among these are Grubb's test for outliers, the Levey-Jennings control chart, and the Tukey control chart. Moreover, virtually every piece of statistical software available today allows the user to choose method one for creating control charts and performing various other statistical tests. Nevertheless, this error on the part of naïve programmers does not make it right or even acceptable to use method one for analysis.
There are proper uses of method one and method three, however, they are never appropriate for filtering out the noise. The only correct method for filtering out the noise is method two. Understanding this point is the beginning of competence for every data analyst.
You now know the difference between modern data analysis techniques and naïve analysis techniques. Naïve techniques use method one or method three to filter out the noise. Today all sorts of new naïve techniques are being created by those who know no better. Let the user beware.
To help with this problem of identifying naïve techniques, figure 20 contains a listing of 27 of the more commonly encountered within-subgroup estimators of both the standard deviation parameter and the variance parameter. There we see the hallmark of the within-subgroup approach: Each estimator is based on either the average or the median of a collection of k within-subgroup measures of dispersion. Method one and method three each use a single measure of dispersion. Now you know the importance of using the right method, and you know what the right method will look like in practice. This may be more than you ever wanted to know about statistics, but it is essential knowledge for all who seek to understand their data.
Figure 20: here
This article is based on material found in my book, Advanced Topics in Statistical Process Control, Second Edition (SPC Press, 2004). Used with permission.
Comments
Absolutely necessary knowledge...but RARE!
This is an important bit of knowledge...what Deming would call "simple...stupidly simple, but RARE!" If you google "control limits" you'll find many, many sites with "experts" telling you to use method one.
Add new comment