© 2024 Quality Digest. Copyright on content held by Quality Digest or by individual authors. Contact Quality Digest for reprint information.

“Quality Digest" is a trademark owned by Quality Circle Institute, Inc.

Published on *Quality Digest* (https://www.qualitydigest.com)

**Published: **08/07/2023

A simple approach for quantifying measurement error that has been around for over 200 years has recently been packaged as a “Type 1 repeatability study.” This column considers various questions surrounding this technique.

A Type 1 repeatability study starts with a “standard” item. This standard may be a known standard with an accepted value determined by some master measurement method, or it may be an item designated for use with the study (an “unknown standard”). The standard is measured repeatedly, within a short period of time, by a single operator using a single instrument. Finally, these repeated measurements are used to compute the average and standard deviation statistics, and these are used to characterize the measurement process. This technique can be traced as far back as Friederich Wilhelm Bessel and his *Fundamenta Astronomiae*, published in 1818.

The questions below pertain to how to use and interpret the results from a Type 1 repeatability study.

Do we need to be concerned about the predictability of the measurement process?

Yes. Sixty years ago, Churchill Eisenhart, senior research fellow and chief of the Statistical Engineering Laboratory at the National Bureau of Standards, wrote the following about Type 1 repeatability studies: “Until a measurement process has been ‘debugged’ to the extent that it has attained a state of statistical control, it cannot be regarded, in any logical sense, as measuring anything at all.”

Only a process behavior chart can answer the question raised by Eisenhart. We should always place our repeated measurements on an *XmR* chart to see if they have that degree of consistency that is essential in practice.

Richard Lyday wanted to evaluate a new vision system used to measure the diameters of steel inserts for steering wheels. He took a single insert and measured it 30 times over the course of one hour and got the diameters shown in Figure 1.

**Figure 1:** *XmR*

Here we see that this steel insert grew more than 1/4 in. in diameter over the course of one hour! The trend of these readings over time plus the three upsets are problems with the measurement system. These problems turn this new high-tech, gee-whiz vision system into a rubber ruler.

Figure 2 shows the *XmR* chart for Test Method 65. A known standard having an accepted value of 40 was measured 25 times using Test Method 65. The average is 39.68, the average moving range is 2.15, and the global standard deviation statistic is 1.684. Here we find no evidence of unpredictability.

**Figure 2:** *XmR*

When we do not place our repeated measures on an *XmR* chart, we are making a naive assumption that the measurement process is predictable. When we use the *XmR* chart, we are checking whether this assumption is reasonable. This is why any assessment of the measurement process that does not begin with a process behavior chart is inherently flawed.

When the measurement process appears to be predictable, what do the average and standard deviation statistics represent?

With a predictable measurement process, the average statistic is your measurement system’s best estimate of the value of the measured item. When a known standard is used, the average will let you test for bias.

With a predictable measurement process, the standard deviation statistic is an estimate of the repeatability of your measurement system. This is the state of affairs shown in Figure 3.

For the data of Figure 2, our estimate of the value for the known standard is 39.68 units, and our estimate for repeatability is *s* = 1.68 units.

** Figure 3:**

As above, we traditionally report repeatability using the standard deviation statistic from the repeated measurements. However, a simple multiple of this quantity offers an alternative that is easier to explain and use. This alternative is the probable error, a concept that dates back to Bessel.

*Estimated probable error = 0.675 estimated SD(E) *

The probable error characterizes the median error of a measurement. A measurement will err by this amount or more at least half the time. As such, the probable error defines the essential resolution of a measurement and tells us how many digits should be recorded. (We will want our measurement increment to be about the same size as the probable error.)

For Figure 2, our estimate of probable error is 0.675 (1.68) = 1.1 units. This means that these measurements are good to the nearest whole number of units. They will err by 1.1 units or more at least half the time. Thus, the results for Test Method 65 should be rounded to the nearest whole number; recording fractions of a unit will be excessive.

When the measurement process appears to be unpredictable, what do the average and standard deviation statistics represent?

Nothing whatsoever.

Descriptive statistics can only describe the data. Meaning has to come from the context for the data. When the data are an incoherent collection of different values, the statistics will not represent any underlying properties or characteristics.

The average for the data of Figure 1 is 13.495 in. However, since these observed values drift by 3/8 in. in the course of an hour, this average does not provide a useful estimate of the value of the measured insert. A wood yardstick would give us a more reliable measurement of the diameter of one of these inserts than is provided by this electronic vision system.

The standard deviation statistic of 0.114 in. for Figure 1 does not tell us anything useful about the precision of this vision system, since it has been inflated by both the trend and the upsets.

When repeated measures of the same thing display a lack of predictability, the statistics tell us nothing about the random process that is producing the measurements. Here the measurement process simply does not exist as a well-defined entity, and there is no such thing as repeatability or bias.

Some software will plot a running record showing the repeated measurements of the standard with tolerance limits added. How should we interpret this graph?

Since this graph makes a comparison that can only be misleading, it should be ignored. The *only* appropriate limits for the running record of values from a Type 1 repeatability study are those of an *XmR* chart.

To illustrate how the plot of the repeatabilities vs. the tolerance is misleading, I will use a synthetic example. Here we’ll assume that the specifications are 60.0 ± 4.5 units, we have a known standard with an accepted value of 60, and the measurement errors have a mean of zero and a standard deviation of 0.98. Fifty observations of this standard from a Type 1 repeatability study plotted against the specifications might look like Figure 4.

** Figure 4:**

If we subtract the value of the standard, we get a plot of the measurement errors as shown in Figure 5.

** Figure 5:**

Figure 5 is the graphic equivalent of the precision-to-tolerance ratio. Here the precision would be [6 *SD(E*)] = 5.88 units, so the *P/T* ratio is 0.6533. And Figure 5 makes it appear that measurement error consumes 65% of the tolerance, which leaves very little room for process variation.

Let us now assume we have a predictable production process operating with a mean of 60 and a standard deviation of 1.00. A set of 50 product values plotted against the specifications might look like Figure 6.

** Figure 6:**

Here the product variation appears to consume 66.6% of the tolerance and there is very little room left for measurement error. So what happens when we combine the product values *Y* with the measurement errors *E* to get 50 product measurements,* X*?

** Figure 7:**

How can a production process that consumes 67% of the tolerance be combined with a measurement system that consumes 65% of the tolerance and end up with a stream of product measurements that only use 93% of the tolerance?

The answer lies in the fact that both the 67% and the 65% are bogus proportions. Of the four graphs in this example, only Figure 7 is correct. Figures 4, 5, and 6 misrepresent reality. So, even though your software may give you Figure 4 or 5, these figures are bogus. They have always been bogus, and they will continue to be bogus until the Pythagorean theorem is no longer true.

Why do the simple ratios of Figures 5 and 6 not work as proportions? It has to do with a fundamental property of random variables. Whenever we plot a histogram or a running record, the variation shown will always be a function of the standard deviation. This is not a problem when we are working with a single variable, but when we start combining variables the bogus comparisons creep in because the standard deviations do not add up.

In Figure 4, we see the repeated measurements, [60+*E]*, and the horizontal band shown is [6 *SD(E)*] wide. This part of Figure 4 is correct. It is the inclusion of the specifications on Figure 4 that creates a bogus comparison.

In Figure 5, we see the running record of the error terms, *E*, and the horizontal band shown is [6 *SD(E)*] wide. This part of Figure 5 is correct. It is the inclusion of the tolerances on Figure 5 that creates the bogus comparison.

In Figure 6, we have the running record of 50 product values, *Y*, and the horizontal band shown is [6 *SD(Y)*] wide. This part of Figure 5 is correct. It is the inclusion of the specifications on Figure 6 that creates the bogus comparison. (While it might seem that the specifications should apply to the values *Y*, in practice we cannot observe the *Y* values directly, and the specifications have to be applied to the product measurements, *X*.)

In Figure 7, we have the running record for the product measurements, *X*, and the horizontal band is [6 *SD(X)*] wide. However, since specifications apply to the product measurements, we can place the specifications on Figure 7 without creating a bogus comparison.

Thus, if we are not careful to compare apples with apples, bogus proportions can creep in when we work with multiple variables. While the graphs will show variation as a function of the standard deviations, these standard deviations will never be additive. This complicates comparisons among the multiple variables.

When the measurement errors *E* are independent of the product values *Y*, then the *variance* of the product measurements *X *will be the sum of the *variance* for *Y* plus the *variance* for *E.*

*Variance( Y) + Variance(E) = Variance(X)*

*= 1.00 + 0.96 = 1.96*

So the standard deviation of *X* will be the square root of 1.96, which is 1.40. Thus, the only correct way to show the relationship between the standard deviations of *X, Y,* and *E* is to use a right triangle as in Figure 8.

** Figure 8:**

Specifications always apply to the product measurements, *X*. Thus the specifications, and the specified tolerance, belong on the hypotenuse of the right triangle as shown in Figure 8.

At the same time, [6 *SD(E)*] defines one side of the right triangle. Thus, the ratio of [6 *SD(E)]* to [6 *SD(X)]* defines the cosine of the angle denoted by alpha in Figure 8.

When we put these two results together, we find that the precision to tolerance ratio is, and always has been:

Thus, the P/T ratio is *always* a trigonometric function divided by the capability ratio. And everyone who has had high school trigonometry knows that we cannot treat trigonometric functions like they are proportions. They simply do not add up. Never have, never will.

Thus, the fallacy behind drawing Figure 4 or Figure 5 is that the tolerance lines encourage us to interpret a trigonometric function as a proportion. Whenever we do this we will always be wrong.

The only correct graph to use with the results of a Type 1 repeatability study is an *XmR* chart such as those in Figures 1 and 2.

So, can we use the P/T ratio to condemn a measurement process?

No, we cannot.

When we understand that the P/T is a trigonometric function times a constant, we discover why this ratio is so hard to interpret.

Consider the previous example. There we had a P/T ratio of 0.65. Based on this value, most arbitrary guidelines would condemn this measurement process. Yet here we have a process that, when operated predictably and on-target, is capable of producing essentially 100%-conforming product. Moreover, the current measurement system is adequate to allow this process to be operated up to its full potential. Here, there is no need to upgrade or change this measurement process.

So while small P/T ratios are good, large P/T ratios are not necessarily bad. This is why it is a mistake to use a P/T ratio to condemn a measurement process.

Can we test for bias?

Before we can talk about bias, we have to have a predictable measurement process.

All bias is relative. If we perform our Type 1 repeatability study using a known standard that has an accepted value based on some master measurement method, and if our measurements do not display a lack of predictability, then we can compare our average value with the accepted value for the standard.

Our predictable measurement process may be said to be biased relative to the master measurement method if and only if we find a detectable difference between the average and the accepted value for the standard. Here, we typically use the traditional t-test for a mean value. With a detectable bias, the best estimate for that bias will be the difference between the average and the accepted value.

A failure to detect a bias means that any bias present is too small to be detected with the amount of data available. In this case, the t-test will place an upper bound on the size of any bias present.

For Figure 2, the average is 0.32 units smaller than the accepted value of 40. With 24 *d.f.* the student-t critical value is 1.711, which gives a 90% interval estimate for the difference of:

[39.68 – 40] ± 0.58 = –0.90 to 0.26

Since this interval includes zero, we have no detectable bias, and any bias present is likely to be less than 0.58 units. Since this is less than the probable error of 1.1 units, we can say that this test is unbiased in the neighborhood of 40.

Do we always need to use 50 measurements?

Did Richard Lyday need additional data in Figure 1 to know that he had a rubber ruler? Without a predictable measurement process there is no magic number of readings. Without predictability, there is no repeatability and no bias. So we cannot estimate these quantities regardless of how many data we collect.

With a predictable measurement process, the relationship between the uncertainty in an estimate of dispersion and the amount of data is shown in Figure 9.

** Figure 9:**

In Figure 9 the vertical axis shows the coefficient of variation (*CV*), which is the traditional measure of uncertainty for an estimator. The coefficient of variation is the ratio of the standard deviation of an estimator to the mean of that estimator.

The horizontal axis shows the degrees of freedom (*d.f.*) for the estimator. For the standard deviation statistic, the degrees of freedom is the number of repeated measurements minus one, (*n*–1). The relationship between *CV* and *d.f.* is given by:

Thus, to cut the uncertainty in half you will have to collect four times as many observations.

Once you pass 10 degrees of freedom, you are in the region of diminishing returns. Between 10 and 30 degrees of freedom, your estimate of repeatability will congeal and solidify. The 25 data of Figure 2 give an estimate of repeatability having a coefficient of variation of 14%. Using twice as many data would have reduced the *CV* to 10%. This is why, historically, Type 1 repeatability studies have been based on 25–30 data.

Should we reload the part between measurements?

How can you measure a part without loading it? Preparing an item for testing is part of the measurement process. Variation in preparation can contribute to measurement error. Even if parts are loaded automatically, loading the item is part of obtaining the measurement.

If an item is prepared (loaded) once and then measured multiple times, you will have “multiple determinations.” Multiple determinations do not reflect the repeatability of obtaining a single measurement. For this reason, we need to be careful to distinguish between repeated measurements and multiple determinations.

When a measurement process calls for multiple determinations and reports their average as the observed value, a repeatability study will require multiple determinations for each observation, with reloading between each set of multiple determinations.

The graphs we draw determine the way we think. The way we think determines the words we use. The words we use determine the actions we take. If we start with the wrong graph, our reality becomes distorted and what we do may be skewed or even incorrect.

The repeated measurements of a Type 1 repeatability study belong on an *XmR* chart. The notions of repeatability and bias are predicated upon having a predictable measurement process. This is why the analysis of data from a Type 1 repeatability study should always start with an *XmR* chart.

In reporting repeatability, it is helpful to use the probable error. The probable error is the median amount by which a measurement will err, and as such it defines the effective resolution of a measurement.

The specified tolerance does not apply to the measurement system. Thus, any graph like Figures 4 or 5 is fundamentally flawed. These graphs encourage a bogus comparison between precision and tolerance, which can be very misleading.

This is why the precision-to-tolerance ratio should not be used to condemn a measurement process. It does not tell the whole story. Since money spent on measurement systems will always be overhead, we should be careful about condemning a measurement system based on a trigonometric function masquerading as a proportion.

Whenever your Type 1 repeatability study results in a set of values that display a lack of predictability, then, regardless of the technology involved, your measurement system is nothing more than a rubber ruler.