## The Mystery Measurement Theatre

### Determining sources of variations

Published: Thursday, January 14, 2010 - 07:40

If you can’t trust your measurement system, you can’t do anything with the data it generates. Last month, in “ Letting You In On a Little Secret,” we talked about the purpose of measurement system analysis (MSA) and I gave you a neat spreadsheet that will do MSA for you, as well as some data (repeated after the jump) from the gauge you want to buy, the Hard-A-Tron. I also left you with a mysterious statement that this study was trickier than it appeared. This month I’ll start off answering a question I received, and then we will see how well the Hard-A-Tron did—and what mysterious thing was going on in the data. After that, if you are good, I’ll give you another set of data to further test a measurement device.

The question I received was, “Why don’t you use a standard hardness coupon instead of production material?” It boils down to two words: external validity. Let’s say I'm testing my Hard-A-Tron on these standard, traceable to NIST coupons, and the gauge passes. What does this tell me about my ability to measure the cylinders I fabricate?

Well, not much if anything, actually. The coupons are flat—I just drop them onto the measuring stage. The cylinders are, well, cylindrical, so I might have some sort of a fixture to place them in to hold them while I test with the device. Does the fixture have some springiness to it that might affect the reading? How about the extra height above the stage—could that change the reading variability? And the fact that I'm measuring a curved surface rather than a flat one—how might that change the as-measured hardness? Well, we don’t know—that was not part of the study using coupons. Using similar reasoning, I want to perform the MSA in the environment in which the device is actually used (perhaps dirty and poorly lit) and by the people who actually use it, instead of me alone in a nice clean, bright lab. My research question is not, “How well does my device measure coupons?” but rather, “How well does my device measure my cylinders in the production environment?” And to do that, I want to replicate as much of the actual process as I can.

OK, back to our Hard-A-Tron device. Recall that one of the main purposes of MSA is to determine if a measurement system will correctly classify something as conforming or nonconforming to your specification. If you used my spreadsheet, it does all the calculations for you so that you can answer this question. Recall that we have two operators (Jack and Jill) measuring the hardness on the same bars. First Jack measures all 10 in a random order. Then he measures all 10 in a different random order. Then Jill measures all in a third random order, and finally measures them again in a different random order. (Phew!)

Here were the results I showed you last time:

So let’s think through our sources of measurement variability here.

• Each of the samples we are measuring probably has some different “real” hardness, so we have part-to-part variability. We don’t want to count this as part of our gauge variability though—it is inherent in the parts.

• The same operator measuring the same part does not get exactly the same number, so we track that with the range within part within operator. We will calculate an average range within operator for an overall indicator of this. This range will allow us to estimate the variance due to repeatability. (We want to keep an eye on the ranges within part though—that might indicate a problem with the part—perhaps it got damaged during testing or there is something making the measurement difficult to do on that part.)

• Each operator might have a different bias off of the “real” number. Regardless of the fact that the different parts could be totally different hardnesses, the *average* of all the parts should be a particular number, right? So we will take an average across all the readings for each operator. If those numbers are statistically different, then the operators are biased compared to each other. (A repeated measures t-test would be the way to statistically test that, but that is a little beyond the scope of this article.) The difference between the two averages will give us an estimate of the reproducibility variability.

• … and there is one more “mystery component” that I am not going to tell you yet.

You might have memorized the formula to estimate the standard deviation from the average range when you learned about statistical process control. Here it is again:

Because we have two operators as well as two measurements, we use an expanded table to look up d_{2}. Thankfully, that table is built right into the spreadsheet. We have two different ranges—the average of the two operators’ ranges, and the range between the averages for each operator. Dividing by d_{2} gives us the estimate for the within-operator and between-operator variability, repeatability (σ_{RPT}) and reproducibility (σ_{RPD}) respectively.

To estimate the total variability (σ_{e}) including repeatability and reproducibility, we need to add together the variances (you know we aren’t allowed to add standard deviations) and take the square root.

Umm, why do we care about σ_{e} again? This is the overall variability that this measurement system will produce, across operator, for the same part measured again and again. This in turn will be essential in answering the question, “Can I use this measurement system to determine if the product is conforming?”

Here is the output from the spreadsheet:

So Jack averages a little lower than Jill (without a known “real” value for the parts, we don’t know if either Jack or Jill has the right hardness). Jack has a little bit more variability on average within part than Jill. (Note again that for our simple MSA we have not shown that these differences are statistically significant.) We have used those ranges to estimate the variability due to reproducibility and repeatability, and added the variances and taken the square root to get the overall variability of the Hard-A-Tron as used by Jack and Jill.

The reproducibility variability is higher, indicating that the major source of the variability is from operator to operator. This points us to how we can improve our measurement system—we need to figure out how to get Jack and Jill to agree with each other better. Maybe we look at the standard operating procedure used with the measurement device, maybe we watch how they set up the measurement. Or maybe… but that would be giving away the mystery; and I haven’t even assembled everyone in the drawing room.

I know what you are thinking, “That is all well and good, but I still need to answer my question—can I use the dang thing to determine conformance to spec?”

Now for a simple question, there is a lot of confusion about how to answer it, so I am going to walk through this to make sure you understand why we use the metric we use. I need you to keep focused on the question we are trying to answer, though.

Determining if something is in or out of spec is an action near and dear to many of our hearts, I am sure. So let’s make a drawing.

Let’s assume for this discussion that the average between Jack and Jill is the real hardness. If Jack and Jill measured the part a whole lot of times, and the part’s real hardness was 70, we would see a distribution of measurements that looks something like this:

Figure 1: Hard-A-Tron measurement error on a part that really measures 70 |

Now, this is the *same part* with the real average right in the center of the spec. With this much variability, a part that is smack in the center of the spec would be scrapped or reworked about 12.754 percent of the time due only to our measurement variability. That is not so good.

Consider a part that is still really in spec, but not right in the center:

Figure 2: Hard-A-Tron measurement error on a part that really measures 67 |

This part really has a hardness of 67, so it is in spec, but 31.128 percent of the time we would say it was out of spec. Even more amusing, we would actually have said it is out of spec on the high side 0.5546 percent of the time.

Last one—what would happen if it were actually out of spec at 65?

Figure 3: Hard-A-Tron measurement error on a part that really measures 65 |

The good news is that we would actually recognize that it is out of spec about 56.7875 percent of the time. (Remarkably, 0.0648 percent of that is shown as over the upper spec.) The bad news is that 43.2125 percent of the time, we would think it was in spec, and send it off to our customer. Oh no! Can you spell “termination with cause”?

As a side note, we know what will happen if something is measured and is categorized as out of spec—we remeasure it, right? Maybe it will “go back in.” But as you can see in this case, due to the high measurement error, we should be remeasuring even if it is inside of the spec.“Quick—we are in spec—remeasure it.” How many times do you think *that* happens?

Anyway, let’s sum up and show a graph of the probabi lity of incorrectly classifying a part of a given hardness:

Figure 4: Probability of the Hard-A-Tron incorrectly classifying a part as conforming or nonconforming |

A crazy graph, I know, but for any given hardness, we can read off how likely it is we will come to the wrong conclusion about its conformance to spec. The worst case is when the part is really on the spec, which is (for a symmetric distribution anyway) always going to be 50–50. The further from the spec, the easier it is to make the right decision. But with this system, we are always running a pretty big chance of mischaracterizing our part.

So clearly, this is a measurement system that has real trouble accomplishing what we need it to do—correctly determine if a part is in or out of spec. Whatever metric we end up using has to reflect this inability.

There are two measurements that are commonly used for determining gauge capability, and they are both pretty similar. The one that has come into vogue is the %R&R metric:

That 5.15 gives you the width of 99 percent of the measurement error.

The one I prefer is the P/T ratio:

Here you are taking the natural tolerance of the measurement error instead of 99 percent of it. The difference is small, but this one makes more sense to me. Plus 6 is easier to remember than 5.15, but maybe that is just me. On the other hand, everyone uses 5.15 so I have pretty much given up that fight.

Here are the spreadsheet’s calculations for %R&R.

What does this number tell us? Dividing by the width of the spec gives you the proportion of the spec eaten up by measurement error alone. The smaller this number, the less likely you are to categorize something that is in spec as out, or out of spec as in. Or, as in our case here, if your measurement error itself takes up more than the whole width of the spec, the %R&R clearly tells you so by being stupidly big.

What would you want this number to be? A really nice %R&R would be 10 percent. Here is that goofy Twin Peaks graph with a %R&R of 10 percent:

Figure 5- Probability of a measurement system with a %R&R = 10 percent incorrectly classifying a part as conforming or nonconforming |

Well, this is about as silly as before, but in a good way. Unless I am reeeeeeealy close to the spec limit, I am almost certainly going to make the right decision about the conformance to spec for that part.

The %R&R of 10 percent should not be an acceptance criterion, though. Whether a measurement system is acceptable or not is a function of how costly it is to mischaracterize a part and of how many times you are willing to remeasure. All the above graphs are showing you what you would get if you measured each part *one* *time* in order to make a conformance decision. If your %R&R is a little too high, you could decrease the effective gauge error by measuring more than once and taking an average, and then base your decision on that. How many do you need to measure? Remember this equation from basic stats when learning about the random sampling distribution of the mean? Of course you do:

You can back-calculate the standard deviation you need in order to get a %R&R of 10 percent. For our Hard-A-Tron, we would need a standard deviation of about (0.1 ÷ 5.15) * 9 = 0.1748. Putting that in the equation above on the left, using the 2.953 as the sigma on the right, we solve for *n* and get 286.

Umm, I don’t think we will be measuring our little bars that many times. Looks like we need to reduce that measurement error some other way.

Here is where I finally solve the mystery. Did I distract you with enough stuff to make you forget there even was one?

Go back and look at the data and follow the readings for a single part through Jack 1, Jack 2, Jill 1, and Jill 2. With a few exceptions, the numbers *increase through time*. The potential study we did gives us no direct information of stability though time, but this is a pretty gross change that is visible across each part. Could it be that the real values of hardness were changing with time and the operator-to-operator difference we saw was actually due to that? Well, as the metallurgists in the audience can tell you, that certainly can happen. Either the metal itself gets harder with time (it ages) or the force exerted on the sample for each of the hardness measurements actually hardens the material nearby for the next test.

That is what happened here, and that is the last source of variability—the hardness of the material itself was changing with time. In this case, it is impossible to isolate the change in hardness from the change in operator. (By the way, this is exactly why we do Jack 1, Jack 2, Jill 1, and Jill 2 in that order. If we had randomized across operator, we probably would not have noticed the time effect. As it was, this effect was confounded with operator, so while we might have spent time trying to figure out why Jill was different than Jack, we had a better chance of eventually catching the time effect.) It turns out that in this case, hardness measurement is a Class III destructive test—the true value changes with time.

And that is the end of the Mystery of the Maltese Hard-A-Tron.

Now if I get a %R&R of 10 percent in the real world for a potential study like this, I would be happy—but not “I’ll buy your gauge” happy, since I have yet to really put it through its paces. To do that, I go and gather 25 parts that span the expected range of measurement, and have one or more operators measure each part five times or more. This will give us a little bit of information on the stability of the gauge, at least for the short-term, and so this study is cleverly called a short-term study.

Here are some data from a short-term test. In this example, we are weighing a plastic preform before placing it into a compression mold. The weight specification is 465 ±50 grams. We are assuming that the measurement is independent of operator, so we only have one operator do the test. Each row is one of the 25 samples, measured in random order. Each column is the repeated measures, which were done in different random orders.

We have been using this scale for years—and it has a digital readout. On the other hand, we have had a lot of defective parts for years too, and the area the scale is in is pretty contaminated with phenolic dust.

What do you think? Is the mass balance capable of measuring that specification?

Tune in next month to Mystery Measurement Theatre to find out.

## Comments

## MSA and sampling plans

Great article, Steven. While I strive to promote these concepts and tools at my facility, I question the adequacy of our sampling plans. Take moisture content in coated rolled paper for example. Economics lead us to minimize the number of samples we take to minimize equipment downtime and associated material loss. Measuring moisture content is a destructive testing process as it entails completely drying the sample to measure weight loss. Therefore, conducting an MSA on this process requires an assumption that multiple samples taken from the same area on the roll of paper has the same moisture content, and any variation we see in measurements of these samples is attributed to measurement error. Truthfully, we're skeptical about this assumption, so we feel true measurement system analysis is out of our grasp. How do we compensate for this dilemma with adequate (but economically sound) sample size adjustment?

## Destructive Testing

Steven, I am attempting to measure several properties from a continuous web based coating process where most all measurements are destructive: coating weight, peel force, release force, and shear force. We also know that our process can inherently produce material whose characteristics are dependent on location vs. independent and random. I would be very interested in learning about how to perform appropriate measurement system analyses with destructive testing. Can you offer suggestions on how I can learn about MSA with destructive testing?

## Thanks for reading,

Thanks for reading, Fred!

What you have is a Class II destructive test. As it is now, you are confounding product and measurement variation. Worst case you can quantify that total variation and use a control chart to alert you when "something" goes wrong, but you won't initially be able to tell if it is process or measurement and you will need to investigate. But there are some tricks we might be able to do, so here are a couple of ideas. Keep in mind it is the measurement system we want to test, and to do so we want to test the exact same thing over and over if we can. But we do have options if we can't. I am going to have to guess at what makes sense for your process, but hopefully either what I am going to say will make sense, or you can make sense of what I say for you process! :-) Ask me if you have questions.

.

Let's assume you are going to do a potential study.

.

First, if you can, get a big sample of 10 runs of coated paper (to whatever different nominal moisture losses you make) and put each in a moisture-neutral environment. Not sure how this would look since I don't know what your process or product looks like, but let's say it is a high-humidity box for sake of discussion, maybe each sample has a different humidity setting on each box.

.

Next, dice up each sample into test sheets. Mix them all together within sample, but keep the different samples separate in their little boxes.

.

Maintain these in the controlled atmosphere. You now have 10 homogeneous samples from which to draw to perform your MSA - each operator/gauge draws a sample at random from each box (in different random orders!) and measures it. We know we are confounding product with measurement, but by randomizing the test sheets within sample, we hope to have homogenized across that, so anything that stands out on top of that inherent variation (like operator-to-operator or gauge to gauge) is probably gauge related. Basically, we are trying to take it from Class II to closer to a Class I destructive, and we can do the standard MSA on Class I.

.

Now this is so much trouble that, unless you have doubts about the system you have been using for years (and you might!), it probably makes more sense to do this with 8 bigger samples that you use as during the course of your long-term MSA---which you will get in the article following the next one!

.

The other option is instead to test the non-destructive components of the system. We start running into concerns about external validity, though, so be cautious. So by that I mean, maybe a mass balance and a dehumidifier are parts of the process. Well, we can test to see that the balance is in control over time by performing an MSA on that, same with the dehumidifier. If both of those exhibit control and acceptability through time, and we see a problem on a measurement, it is a good first assumption that the problem is in the product.

.

Hmm, maybe I should do an article on destructive gauges?

## Steven: Very helpful set of

Steven: Very helpful set of articles re GR&R, thank you. About "...hardness measurement is a Class III destructive test..." Where are such testing classifications described? We are discussing qualification of NDT measurement systems. Are there classifications for nondestructive inspection techniques, for example, eddy current? Are there strategies for calibrating NDT for which there is no NIST traceability? I look forwar to your repy and the next edition of Quality Digest. BR, Robert Fix

## Hi Robert, and thanks for

Hi Robert, and thanks for reading!

.

From my Black Belt training...

.

A destructive gauge is one that possesses an inability to repeatedly assess the true value.

-Class I: The specimen is destroyed, but we are capable of sampling from homogeneous lots or subgroups

-Class II: The specimen is destroyed, and only heterogeneous lots or batches are available

-Class III: Are capable of repeated measuring, but the true value of the specimen is changing

.

Each can be handled for MSA.

.

Some apparently NDT are actually destructive. My first MSA was using a micrometer to measure 1-2" thick aluminum plate. Gotta be non-destructive, right? Wrong - the right angle on the carbide faces on the mic were shaving off little curls of aluminum, which when I remeasured the area got smashed, but the measurement was detectably thicker. So a mic on a huge slab of metal was a destructive measure!

.

I don't know if a classification system for truly ND tests in regards to MSA. In such cases you have a (theoretically) infinite ability to remeasure the same thing, so you should be able to quantify measurement error completely.

.

"Calibration" is not what we are talking here, and for gauges used for process variables, calibration is frequently irrelevant as long as you have gauge control and acceptability (see my article here: http://www.qualitydigest.com/inside/six-sigma-column/don-t-judge-gauge-i... about that - in the section on Long-Term Studies). But, having said that, what you are asking about will be covered in the article on Long-Term Studies, which ought to be the one after next.