Featured Product
This Week in Quality Digest Live
Six Sigma Features
Mark Rosenthal
The intersection between Toyota kata and VSM
Scott A. Hindle
Part 7 of our series on statistical process control in the digital era
Adam Grant
Wharton’s Adam Grant discusses unlocking hidden potential
Scott A. Hindle
Part 6 of our series on SPC in a digital era
Douglas C. Fair
Part 5 of our series on statistical process control in the digital era

More Features

Six Sigma News
Helps managers integrate statistical insights into daily operations
How to use Minitab statistical functions to improve business processes
Sept. 28–29, 2022, at the MassMutual Center in Springfield, MA
Elsmar Cove is a leading forum for quality and standards compliance
Is the future of quality management actually business management?
Too often process enhancements occur in silos where there is little positive impact on the big picture
Collect measurements, visual defect information, simple Go/No-Go situations from any online device
Good quality is adding an average of 11 percent to organizations’ revenue growth

More News

Donald J. Wheeler

Six Sigma

A Better Way to Do R&R Studies

Evaluating the measurement process approach

Published: Tuesday, February 1, 2011 - 07:04

Last month’s column looked at how to fix some of the Problems with Gauge R&R Studies. This month I will show you how to learn more from your gauge repeatability and reproducibility (R&R) data with less effort. Rather than getting lost in a series of computations, the "evaluating the measurement process" (EMP) approach uses the power of the graph to reveal the interesting aspects of your data so that you can know how to ask the important questions.



An EMP study

The idea behind an EMP study is both simple and profound. As expressed by my friend and colleague, the late Richard Lyday, “Measurement is a process, and with rational subgrouping you can study any process.” An EMP study begins very much like a gauge R&R study, but rather than computing estimates of everything possible, it immediately places the data on an average and range chart in order to discover what is happening in the data.

When we use an average and range chart with experimental data, we are doing something completely different from what we usually do with this chart. When an average and range chart is used with data from a continuing process, it is properly called a “process behavior chart.” There the objective is to classify the process as either being predictable or unpredictable. In contrast to this, in an EMP study, we are looking at the results of a special type of experiment. Here we are trying to determine if we can detect part-to-part differences in spite of the uncertainty introduced by measurement error. This shift in both the nature of the data and the nature of our questions will change the way we interpret the average and range chart of an EMP study.

While the EMP approach can be adapted to many different data structures and data collection schemes, we will illustrate the basic EMP study using the same data collection strategy used in a gauge R&R study. A simple, fully crossed experiment is performed where two or more operators measure each of three to 10 parts two or three times apiece. For our example, we shall use an EMP study where six operators measured each of four parts three times apiece.

The measurement system consists of a manual test stand that measures an electromagnetic property of particular product. Because this manual test stand is used in production for 100-percent inspection, it is crucial to the operation of the plant. Since six operators routinely perform this test, all six were included in the EMP study. The four parts used in the study were selected from the product stream on each of four different days.

With simple and objective measurement systems the EMP study may be performed in a fairly straightforward manner. Richard Lyday would usually collect his data in two or three rounds where each operator would measure each part once in each round. However, with subjective or complex measurements it may be necessary to “blind” the experiment so that the operators do not know when they are retesting a given item, and where the order of testing is shuffled or “randomized” in some manner.

Figure 1:  EMP study for the manual test stand

The key to understanding any average and range chart is to understand what sources of variation are found within the subgroups and what sources of variation are found between the subgroups. In figure 1, there are three distinct sources of variation: the operator-to-operator and part-to-part differences that show up between the subgroups, and the measurement-to-measurement differences that show up within the subgroups.

The test-retest variation found within the subgroups is commonly known as the repeatability. This isolation of the test-retest error within the subgroups, with all of the other sources of variation showing up between the subgroups, is the distinguishing characteristic of an EMP study. Because of this isolation of test-retest error, the limits shown on the average and range chart in figure 1 depend solely on the test-retest error. Therefore, the limits in figure 1 specifically show that amount of variation which can be attributed to measurement error alone.

As always, the average chart looks for differences between the subgroups, while the range chart checks for consistency within the subgroups. This characteristic of the charts means that the range chart in figure 1 checks these 24 subgroups to see if there is any inconsistency in the amount of test-retest error shown. The range values that fall above the upper range limit are signals of inconsistency in the test-retest error. Since such inconsistencies represent serious problems with the measurement procedure itself, the causes of these points deserve investigation.

Because the limits for an average and range chart are robust, we can, in spite of the inconsistency on the range chart, also use the average chart to evaluate the part-to-part and operator-to-operator differences. We begin by discussing the part to part variation.

The differences between the parts will depend on how the parts were selected. Sometimes the parts may be selected at specific intervals from the product stream. At other times the parts may be a simple grab sample, or some other type of haphazard sample, selected from the product stream. And in some cases the parts may be deliberately selected to represent a range of product values. Regardless of how the parts are selected, you will want to detect the part-to-part differences in spite of the uncertainties introduced by measurement error. This means that you will want to find points outside the limits on the average chart. As long as you did not select the parts in such a way that they all end up being alike (such as what might happen if you use only rejected parts in the study), you will expect to find points outside the limits. The average chart allows us to make a visual comparison between the part-to-part variation and measurement error. The part-to-part variation is represented by the width of the band swept out by the running record. The measurement error is represented by the width of the average chart limits. Thus, the wider the band swept out by the running record is relative to the width of the limits, the easier it will be to detect product variation in spite of measurement error. 

Figure 2:  Relative utility of measurement system shown on average chart

At the same time that we want to detect the part-to-part differences, we prefer for there to be no differences between the operators. There are two ways to check for operator differences using the average chart. The first of these uses the shapes of the running records, and the second uses the positions of the running records. In order to facilitate both of these comparisons, an EMP study will omit the line segments that would connect dots from one operator to the next.

To see how to interpret the shape of the running record, it is helpful to begin by considering what the average chart would look like if there were no differences between operators and also no measurement error. Under these conditions the running records for each operator would be exactly the same.  Segment by segment they would be perfectly parallel to each other (rather like the curves for Operators A and B). However, as soon as we introduce measurement error into the picture, we will begin to see slight departures from perfect parallelism (rather like the curves for Operators D and F). As long as there is a reasonable degree of parallelism, we need not be concerned. Here Operators A, B, C, D, and F all show a reasonable degree of parallelism.  Operator E, on the other hand, displays a serious lack of parallelism.

Figure 3: Lack of parallelism for Operator E

So what does a lack of parallelism represent? Serious nonparallelism is indicative of an interaction effect between the operators and the parts. (Algebraically, interaction effects and nonparallelism are one and the same thing: You cannot have an interaction effect without a lack of parallelism, and vice versa.) Here we see that Operator E is measuring these four parts in a substantially different manner. Since there should be no interaction effects between the operators and the parts, this interaction represents a serious inconsistency in the measurement process that needs immediate attention. Such interaction effects might be due to operators using different techniques, or to some operators skipping a step in the measurement procedure, or simply due to the presence of one or more untrained operators. But whatever the cause, it is a problem with the measurement process that needs to be fixed.

In addition to checking for parallelism, we can also compare the positions of the running records.  When we do this we are essentially comparing the operator averages. In figure 4 we see that both Operator C and Operator E have averages that are substantially lower than those of the other four operators.  Such differences between operator averages are potential operator biases.

Figure 4:  Potential operator biases for Operators C and E 

So, what can we say is the overall message of the EMP chart in figure 4?  Operators A, B, and F show good parallelism, have similar averages for these four parts, and show consistently small amounts of test-retest error. Comparing the width of the limits with these three running records, we can see that the manual test stand can detect product variation.

Operator C shows good parallelism, and a small amount of test-retest error, but he is consistently low on all of his measurements. This is a potential operator bias. The reason for this bias needs to be identified so that the bias can be eliminated.

Operator D has reasonable parallelism and a good average, but she has a range point above the upper limit of the range chart. Obviously one of her readings for Part 1 is problematic. Although the other ranges and the reasonable parallelism show that she usually does a good job, the reason for this aberrant reading needs to be identified.

Operator E has large ranges, poor parallelism, and the wrong average for these four parts. Whatever else you might say about him, he clearly does not know how to use the manual test stand. While Operators C and D may need a refresher on using the test stand, Operator E needs to be moved to another job until he can be trained in how to use this device and can display a skill level comparable to that displayed by the other operators.

Of course, the first step is getting the operators to measure things alike and to convince them that they are not currently doing so. It is likely that Operators C, D, and E all think that they are measuring these parts in the same way as Operators A, B, and F. Creating figure 1 is the first step in convincing them that they are not.

So what have we learned?

An EMP study begins by placing the data from a gauge R&R study on an average and range chart. By doing this we can make several qualitative assessments even before we begin to make any specific computations:
1. The range chart will allow us to determine if the test-retest error is consistent throughout the study, and also to judge if it is consistent from operator to operator. When test-retest error is not consistent, we will need to find out why.
2. The average chart will allow us to assess the relative utility of the measurement system by showing whether the measurement process can detect product variation.
3. The average chart will allow us to spot nonparallelism between the operators. Since any appreciable nonparallelism will indicate an interaction effect between the operators and the parts, it will warn of serious inconsistencies in the measurement process.
4. The average chart will allow us to assess the likelihood of detectable operator biases. If such biases exist, they will need to be eliminated to get the most out of the measurement process.


By the time you have constructed your EMP chart, you will know what is going on in your data. You will know the interesting questions, and you will know if problems exist. One of the fundamentals of data analysis is to always begin with a graph of your data. Computations exist to complement graphs, but they can never replace them. When you depend upon the computed quantities alone, you are likely to miss many of the interesting aspects of your data.

The objective of analysis is insight, and the best analysis is the simplest analysis that provides the needed insight. Moreover, it does no good to discover something when you cannot communicate your discovery to others. EMP studies use the power of the graph to help with both the discovery and the communication.

But how can we be sure about the differences?

While the graph in figure 4 is fairly clear, not all EMP studies are so clear-cut. If we think we see an operator bias, or if we think the operators have different amounts of test-retest error, how can we be sure that we are not merely interpreting noise? To answer these questions we will need to rearrange the data for further analysis. The following charts will provide a powerful way to answer all of the questions of interest arising out of the EMP study.

Main effect charts

To compare the operator averages, we use an analysis of main effects (ANOME). This is a generalization of an average chart that is appropriate for experimental studies (which will have a fixed amount of data). The limits will be computed using the grand average and the average range from figure 1. In an ANOME the original k subgroups of size n are rearranged into m subgroups, and the idea is to see if any of these m subgroup averages are detectably different from the grand average. The limits for the main effect chart will be:

where ANOME.05 is the 5-percent critical value for an analysis of main effects. This critical value will depend upon n, k, and m, and may be found in the table in figure 5.

Figure 5: Scaling factors for main effect charts

For the data from figure 1, the six operator averages are 32.17, 31.83, 29.08, 31.83, 28.50, and 31.50. The grand average is 30.78 and the average range is 1.375. With n = 3, k = 24, and m = 6, our scaling factor from the table in figure 5 is ANOME.05 =  0.415, resulting in limits of:

The main effect chart in figure 6 shows Operators A, B, and D to have averages that are detectably greater than the grand average, while Operators C and E have averages that are detectably lower than the grand average. Since the grand average is a somewhat arbitrary point of comparison, we can look at the width of the limits and conclude that Operators A, B, D, and F have reasonably similar averages, while Operators C and E have averages that are substantially different from the other operators.

Figure 6:  Main effect chart for manual test stand data

Hence, figure 6 shows that the apparent operator bias seen in figure 4 is real. When you are presenting the results of an EMP study, the use of a main effect chart will make any operator biases easier to see.  Moreover, it will make it much harder to dismiss real operator biases as a fluke.

Mean range charts

To see if the operators display different levels of test-retest error, we can use an analysis of mean ranges (ANOMR). Beginning with the original k subgroups of size n, we use the k subgroup ranges to compute an average range for each operator. These m average ranges will be compared in a mean range chart. As before, the limits for an analysis of mean ranges will be based on the original average range for the k subgroups of size n. Since these limits can be nonsymmetric, we will need two scaling factors. The limits for a mean range chart will be:

where UMR.05 and LMR.05 are the 5-percent critical values for an analysis of mean ranges.  These critical values will depend upon n, k, and m = number of mean ranges to be compared, and may be found in the table in figure 7.

Figure 7: Scaling factors for mean range charts

For the data from figure 1, the m = 6 operator average ranges are 0.50, 0.25, 0.75, 1.75, 5.00, and 0.00. The original average range is 1.375. With n = 3, k = 24, and m = 6, our scaling factors from the table in figure 7 are UMR.05 = 1.679, and LMR.05 = 0.438, resulting in limits of:

Figure 8: Mean range chart for manual test stand data

Figure 8 shows the mean range chart. With an overall risk of a false alarm of 5 percent, we can say that Operators A, B, and F have average ranges that are detectably smaller than the original average range, while Operator E has an average range that is detectably greater than the original average range. Once again, based on the width of the limits, it is reasonable to say that Operators A, B, C, and F show similar amounts of test-retest error, and that Operator E is in a class of his own. While we found both Operators D and E to have points outside the limits in figure 1, the mean range chart will make the case even clearer.

Probable error

Using the test-retest error shown by Operators A, B, C, and F, we find a revised value of the overall average range to be 0.375. This gives us an estimate of repeatability of:

Since this average range is based on 16 subgroups of size three, it’s said to have 29 degrees of freedom. This means that when this manual test stand is operated consistently, the measurements will have a probable error of:

An appropriate measurement increment will be no larger than twice the probable error, and no smaller than one-fifth of the probable error. Here we find that twice the probable error is 0.30 units. Inspection of figure 1 will show that they have been recording the values to the nearest whole number. This means that in rounding off their values they have been needlessly discarding useful information. (This also explains why 11 of the subgroups in figure 1 had a zero range.)  Instead of rounding everything off to the nearest whole number, they need to record these values to the nearest tenth of a unit.

Intraclass correlation

While the repeatability and the probable error describe the quality of the measurements in an absolute sense, there remains the question of whether or not these measurements can be used to detect product variation. To answer this question we will need an estimate of the product variation. Due to the small number of parts used, EMP studies (and gauge R&R studies) are not the best place to obtain such estimates. In general, estimates of the product variation should be obtained from a process behavior chart. However, failing this, we can still obtain some rough idea about the relative utility of the measurement system for a given application from the EMP study. Here we delete the values for Operator E and use the remaining values to find part averages of 32.33, 30.73, 34.2, and 27.67. The range for these four part averages is 6.53. The d2* bias correction factor for one group of four values is 2.237. Dividing and squaring we find:

This estimate of the variance will only have 2.9 degrees of freedom. Nevertheless, we can still estimate the intraclass correlation to be:

Given the relative sizes of the two numbers involved (0.049 and 8.5), our uncertainty in the product variance is not going to have an appreciable impact upon the intraclass correlation statistic. So while this number may be soft due to the small number of degrees of freedom, we can still see that this measurement system will provide a first-class monitor for measuring this product.

This example comes from my book EMPIII: Evaluating the Measurement Process and Using Imperfect Data (SPC Press, 2006) where further examples and explanations may be found. The tables in figures 5 and 7 were excerpted from my book Range-Based Analysis of Means (SPC Press, 2003).


An EMP study uses the power of the graph to reveal the interesting aspects of our R&R study. Here we:
• Identified serious problems with one operator
• Found two more operators that need some retraining
• Identified three operators that are getting the most out of the measurement device

In addition, with a couple of basic computations, we discovered that we need to record one more digit in these data, and established that this measurement system is a first-class monitor for use with this product.

While it has been said that “the average and range chart technique will not allow you to estimate the interaction effects,” this statement conveys a false impression. When there is an operator by part interaction present, estimation is moot. The real question is, “Who is different?”  The EMP approach shows any and all interaction effects that may be present and tells you who is different. Without the insight to the operator differences provided by the average and range chart, and confirmed by the ANOME and ANOMR analyses, we might not have removed Operator E’s values. This would have skewed both our computations and our interpretation of the data.

The use of an average and range chart to evaluate the measurement process has been around for quite some time. It was briefly described and illustrated in the second edition of the Western Electric Statistical Quality Control Handbook (Mack Printing Co., 1956). Primarily due to the efforts of Richard Lyday, it was also included in the AIAG Measurement Systems Analysis handbook, now in its fourth edition. However, in both instances, little guidance was provided on how to interpret the charts and make sense of the analysis. Since this use of the charts with experimental data is substantially different from the traditional use of the charts with observational data, this lack of guidance resulted in a lack of use.

Hopefully, this article has shown how the EMP approach can be used to make the qualitative assessments that are needed to make sense of many R&R studies, and how ANOME and ANOMR can be used to confirm the nuisance components of measurement error as a first step in getting rid of those nuisances.


About The Author

Donald J. Wheeler’s picture

Donald J. Wheeler

Dr. Wheeler is a fellow of both the American Statistical Association and the American Society for Quality who has taught more than 1,000 seminars in 17 countries on six continents. He welcomes your questions; you can contact him at djwheeler@spcpress.com.



General Comment

Howdy, Don,

This comment is late in that I usually review these articles en masse, rather than weekly.  There are several issues that I want to raise:
1) References:  I know this is not a 'reviewed' journal and more of a forum for discussion, but a reference or two covering the basic source of the information would be helpful for those of us who like to investigate more.   The only reference is EMP III which internally only references your other books.  Actually, these concepts on 'test error' are part of the Classical Test Theory dating back to Spearmann and <1910.  Even the notation used is taken from several books in the 50's and 60's.  For those of us who do extra work, it would be appreciated that some references are given connecting us with the excellent works done before. 
2) Probable Error needed?  The standard deviation is nonparametric.  Once a value of 0.675 is applied, the metric now becomes related to a normal distribution -- parametric.  I looked at the prior articles and could not find a compelling reason to express all the charts as a function of PE over simply rescaling relative to the standard deviation.  It would be nice to see why the PE concept is even needed.  If you actually need the 25-75% range to be the basis, then perhaps use the Interquartile Range (IQR) which would also maintain the nonparametric basis. 
3) One way to Assess a Gauge?  This article does present alternate methods to improve on gauge capability assessments.  Thanks for this.  I also found problems with using the AIAG method specifically where the gauge is related to the specification.  An 'excellent' gauge is found using 6Sigma(grr) / Spec Range < 0.10.  This implies that the Spec range must be at least 60 Sigma(grr)  which is excessive as the author has pointed out previously.  Further, it cannot be really interpreted statistically - AND I can change my gauge capability by adjusting my specs or renegotiating the specs with the customer.


What I would like to add is that the method presented is more focused on internal process viewing -- detecting a process shift.  An alternate assessment (used by calibration personnel) provides methods by which a gauge is assessed based more on a customer view.  The methods describe guardbanding (and inferences about the gauge) with appropriate methods to quantify Consumer/Producer Risks.  A good review is provided by David Deavers (Fluke) found at http://assets.fluke.com/appnotes/Calibration/ddncsl94.pdf.  I have used those methods to set guardbands when we had gauges pushing the state-of-the-art technology, as well as calculating the impact of gauge improvement projects relating product costs, yields, etc. directly to measurement error improvements.