Featured Product
This Week in Quality Digest Live
Six Sigma Features
Douglas C. Fair
Part 3 of our series on SPC in a digital era
Scott A. Hindle
Part 2 of our series on SPC in a digital era
Donald J. Wheeler
Part 2: By trying to do better, we can make things worse
Douglas C. Fair
Introducing our series on SPC in a digital era
Donald J. Wheeler
Part 1: Process-hyphen-control illustrated

More Features

Six Sigma News
How to use Minitab statistical functions to improve business processes
Sept. 28–29, 2022, at the MassMutual Center in Springfield, MA
Elsmar Cove is a leading forum for quality and standards compliance
Is the future of quality management actually business management?
Too often process enhancements occur in silos where there is little positive impact on the big picture
Collect measurements, visual defect information, simple Go/No-Go situations from any online device
Good quality is adding an average of 11 percent to organizations’ revenue growth
Floor symbols and decals create a SMART floor environment, adding visual organization to any environment
A guide for practitioners and managers

More News

Steven Ouellette

Six Sigma

Performing a Long-Term MSA Study

Testing through time stability

Published: Tuesday, March 23, 2010 - 10:40


hh, measurement system analysis—the basis for all our jobs because, as Lord Kelvin said, “… When you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind.” How interesting it is then, that we who thrive on data so frequently don't have any proof that the numbers we're using relate to the event we are measuring—hence my past few articles about the basics of measurement system analysis in “Letting You In on a Little Secret,” on how to do a potential study in “The Mystery Measurement Theatre, and on how to do a short-term study in “Performing a Short-Term MSA Study.” The only (and most important) topic remaining is how to perform a long-term study, which is the problem I left with you last month.

So read on to see how.

The potential study told us if we even have a hope of being able to use a measurement system to generate data (numbers related to an event) as opposed to numbers (symbols manipulated by mathematicians for no apparent reason other than entertainment). The short-term study allowed us to test the system performance a little more rigorously, perhaps in preparation for using it in our process. A measurement system’s performance is quantified by:

  • Repeatability—the amount of variability the same system (operator, device) exhibits when measuring exactly the same thing multiple times
  • Reproducibility—the amount of variability due to different operators using the same device, or maybe the same operators using different devices
  • %R&R—a combination of the repeatability and reproducibility that tells us how easy it is for this measurement system to correctly classify product as conforming or nonconforming to our specification
  • Bias—the amount that the average measurement is off of the “true” value

However, none of these things have any meaning if the measurement system changes through time, and neither the potential nor short-term study really tests for through-time stability. With an unstable gauge, I might convene a Six Sigma team to work on a problem that doesn’t exist, put in systems to control a process based on a random number generator, or scrap product that is conforming. Simply put, my ability to understand and control my process is completely compromised.  A measurement system that is out of control is even worse than a process that is out of control, since we don’t even have a hint of what is really going on.

Thus the need for the long-term study, which allows us to assess in detail exactly how our measurement system is performing through time. The long-term study is the Holy Grail (not the Money Python kind) of measurement system analysis, and with it we can state with confidence that the system is producing data, and not just numbers.

As before, I gave you a link to a totally free and awesome spreadsheet that will help you with your MSA work, and I gave you some data for the following scenario:

A statistical facilitator and an engineer wish to conduct a gauge capability analysis (long-term) for a particular ignition signal processing test on engine control modules. The test selected for study measures voltage which has the following specifications:

IGGND = 1.4100 ± 0.0984 Volts (Specification)

Eight control modules are randomly selected from the production line at the plant, and run (in random order, of course) through the tester at one hour intervals (but randomly within each hour). This sequence is repeated until 25 sample measures (j = 25) of size eight (n = 8) have been collected. The assumption is that these voltages are constant and the only variation we see in remeasuring a module is gauge variation.

Right off the bat, we know that we only measure this engine control module; if we measured others or to different target voltages, we would include them in the study as well. We select these eight parts and keep remeasuring the same ones each hour. We also are assuming that different operators have no effect, since we are only using one.

To start off the spreadsheet calculates the mean and standard deviation across each hour’s measurements. Regardless of the actual voltages the eight modules have, the average of those voltages must be the same, right? One way we will eventually look at the measurement error over time will be by looking at how that average moves around. Because we have eight modules, we can also calculate a standard deviation across these eight. But be careful that you understand what this is. There are two components of variability in this standard deviation, only one of which relates to gauge variability. There is some measurement error as I take a reading for each one, but there is also the fact that the eight modules are producing somewhat different voltages from each other as well. Even if there were no measurement error, we still would calculate a standard deviation due to the part differences. This second variance component is of no interest to us for the MSA, but we had better not forget about it as we go forward.

Figure 1: Worksheet calculations

First, we need some validation that we can use the numbers. If the measurement process is in control, then the measurements for each part over time should be in control. So we take a look at each part on an individuals chart (the limit comes from the moving ranges, which I leave out of sight for this example). In order for the moving range to relate to the dispersion, we want to check normality for all the parts:

Figure 2: Normality tests for the eight modules—output from MVP stats

With a sample size of 25 we probably are going to rely on the skewness and kurtosis indices, and they allow us to assume the variability is distributed normally. So let’s take a look at those individual charts on all the parts.

Figure 3: Individuals charts for each part through time—limits from the moving range

We do see a point outside the lower control limit on part 4 and a larger than expected moving range on part 7 (both of which we would have investigated once the long-term study was in place). But two out of 200 observations is well within what we would expect with Type I error rate of 0.0027 (the rate you get a point out of the ±3σ control limits due to chance and chance alone) so I am comfortable saying that, so far, the gauge looks stable with respect to location. If a particular part became damaged at some point, or was a lot more difficult to read than another part, it should show up on these charts.

While individual charts are really useful, they have their weaknesses, one of which is a lack of sensitivity to pretty big shifts in the mean. Thankfully, we have those means we calculated back in the worksheet calcul figure 1 to increase the sensitivity to a global shift in the average. Remember how the standard deviation across the parts has two components of variability? That is why we can’t just do an X-bar and an s chart using the usual limits—that “s” is inflated due to the part-to-part differences and will give us limits on the mean and the s chart that are too large. We get around this by recalling that we can plot any statistic on an individuals chart—though we may have to adjust the limits for the shape of the distribution. We are in good shape (pun intended) to use the individuals chart for our means, because due to the central limit theorem, those 25 averages of eight modules will tend to be distributed normally, and therefore the moving range of the means will relate to the dispersion of the means.

Figure 4: Means as individuals control chart—limits from the moving range

Figure 4 further supports our notion that the location measurements are stable through time. If there were some sort of a shift across many or all the parts, it would show up here, so actions such as a recalibration, retaring, or dropping the gauge on the floor would show up on this chart. Due to the central limit theorem, this chart will be far more sensitive to shifts in the average than the individuals charts on each part.

We also know that control for continuous data isn't assessed by just looking at the average—we need to look at that dispersion as well. Using the same trick as with the means, we will plot the standard deviations (extra component of variability and all) on an individuals chart. If the measurement error is normal and the same for every part, then regardless of the actual voltages the standard deviations across the parts ought to be distributed close to (though not exactly) normal. (Again, this is different than the random sampling distribution of the standard deviation, which would be very clearly positively skewed.)

The upshot is that we plot the standard deviations on an individuals chart with limits from the moving ranges as well.

Figure 5: Standard deviations across eight parts plotted as individuals—limits from the moving range

Here we have one point outside of the limits, which if we were up and running with the chart, we would have reacted to by investigating. Because the measurements are already done, I am going to continue watching that control chart very closely and any statement about control is going to be provisional. The types of events that would show up here would be if there was a global change in the dispersion of the measurements—perhaps a control circuit going bad or a change in the standard operating procedure.

As with the short-term study, we also want to keep an eye on the relationship between the average measurement and the variation of the readings. We want to see no correlation between them—if we do, then the error of the measurement changes with the magnitude of what you are measuring, and so your ability to correctly classify as conforming or not changes too. We will check that with a correlation between the mean and standard deviation.

Figure 6: Correlation between magnitude (mean) and variation (standard deviation) 

We only have eight points because we only have eight different parts. Normally, we won’t do correlations on such few points, but we would have already tested this with the short-term study. (Which you DID do, right?) This is just to make sure nothing really big changed. This correlation is not significant, so we can check that off our list.

At this point we are (provisionally) saying that there is nothing terribly strange going on in our measurement system through time—it seems to be stable sliced a couple of ways, and the magnitude and dispersion are independent. We can now, finally, begin to answer our question about the repeatability and reproducibility. In our case, we only have one operator and system, so measurement error due to reproducibility is assumed to be zero. If we had tested multiple operators, the estimate of the variability due to operator would be: 

Which is pretty similar to that formula you remember from SPC. The range across the operator averages divided by our old friend d2 on the expanded d2 table with g = 1 and m = the number of appraisers. The spreadsheet is set up to handle up to two operators.

All we are looking at (in this case, to estimate) is the variability due to repeatability, which is:

You recognize that from SPC as well I bet. The only difference is that we are taking the average of the standard deviations for each of the j = 8 modules and dividing by c4 for the number of measurements (25 here). The spreadsheet cleverly does all this for you.

Figure 7: Spreadsheet output

Again, we are interested in the ability of this device to correctly categorize whether a given module is in or out of spec. Our spec width was 0.1968V, and so we put that into the %R&R formula:

We find that the measurement error alone takes up 134.74 percent of the spec.

Uh-oh. What is it with these measurement devices?

If I measure a part that is smack dab in the middle of the spec over time, I would see this distribution:

Figure 8: Measurement error 

That means that on any individual measurement (how we have been using this gauge up to this point) of a module that is exactly on target at 1.41, I could reasonably see a reading somewhere between 1.26 and 1.56—in or out of spec at the whim of the gauge variability.

Do you remember that crazy graph that showed the probability of incorrectly classifying a part as conforming or nonconforming on a single measure? Here it is for our voltage measurement system:

 Figure 9: Probability of incorrectly classifying module conformance on a single measurement

Once again, we have a measurement system that is probably no good for making conformance decisions on a single measurement. It is stable through time, yes, but so highly variable that we stand a pretty good chance of calling good stuff bad and bad stuff good.

Because it's stable, we could conceive of taking multiple measurements to use the average of those readings to determine conformance:

Say we are looking for a  %R&R of 10 percent:

Leaving us measuring the voltage 182 times to get that %R&R.

Ahh, that’s not gonna happen.

We need to figure out what is causing all the variability in this measurement device, or replace it with something that is capable of determining if a module is in conformance with the spec. It is also possible that the modules themselves are contributing noise to the measurements—maybe our assumption that they give the same voltage time after time is wrong. That would be a good thing to find out, too.

We have some more work to do, it seems.

Note that we did not assess bias—with this gauge, it would be a waste of time since it is unusable for this specification. If we wanted to, all we have to do is get the “true” voltages of those eight modules, get the average of the true values, and see how far off that average is from our measured average. If they are statistically different, we just add or subtract the amount of bias to get an accurate (but not precise with this system) measurement. You would want to track this bias with time on an control chart as well, to make sure it was stable too.

If you have been reading these mini-dissertations on MSA, you will know that a common assumption of them all is that the measurement is not destructive—that what you are measuring remains constant with time. What do you do if that is not the case?

You read next month’s column.


About The Author

Steven Ouellette’s picture

Steven Ouellette

Steve Ouellette, ME, CMC started his career as a metallurgical engineer. It was during this time that his eyes were opened to find that business was not a random series of happenings, but that it could be a knowable enterprise. This continues to fascinate him to this day.

He started consulting in 1996 and is a Certified Management Consultant through the Institute for Management Consulting. He has worked in heavy and light industry, service, aerospace, higher education, government, and non-profits. He is the President of The ROI Alliance, LLC. His website can be found at steveouellette.com.

Steve has a black belt in aikido, a non-violent martial art, and spent a year in Europe on a Thomas J. Watson Fellowship studying the “Evolution, Fabrication, and Social Impact of the European Sword."




Part-to-part variation question

Very interesting article. I'm not sure that I would completely agree that part-to-part variation is of no interest to us in MSA, though...don't we desire a high value of the ratio of part-to-part variation to measurement-to-measurement variation? Aren't we testing to find out whether our instruments can detect part-to-part variation? True, in this example we are told to assume that each part produces exactly the same voltage in every reading (I'm glad you revisited that later...it's a great question to try to answer, before you throw out that gage!).

Hi Rip, and thanks for

Hi Rip, and thanks for commenting! This leads to an (I hope) interesting topic....
Hmm, let me rephrase this way and see if it makes sense: the component of the variation due to part-to-part variability is not needed to determine gauge capability (a.k.a. %R&R), except that I need to account for it so that I don't over-estimate the gauge variability.
So we don't need a "high value of the ratio of part-to-part variation to measurement-to-measurement variation" in order to determine if our gauge is capable of correctly categorizing parts as conforming or not. Let's take the extreme: the parts are identical in every possible way and we are only making one part to one spec, so we only measure these identical parts as part of our measurement process. Even though the parts are identical, there still is gauge variability, which we can still quantify using this approach - it is wholly independent of the real part-to-part variability. We just need to know if the gauge is too variable to do that conform/nonconform classification correctly. (This right here is, I believe, the source of MANY misunderstandings about gauge capability - people don't understand that gauge capability tests the capability to make the right decision (usually conformance) - it was that way from the beginning when my colleague Dr. Luftig worked on this at Ford with Dr. Deming.)
That said, if I measure multiple parts to multiple nominals, I want to include that entire span in my gauge study ("exercise the gauge"), otherwise I have no external validity beyond that one voltage or whatever. In this case, we only have one nominal voltage we are measuring and one part that makes it, so a random sample of the product is germane to our research question. If we add another product/nominal voltage to what we measure, we *must* re-do the MSA before qualifying it for production since the gauge might not be at all capable of measuring that. Also, be sure to note that, while we assume each part produces the same actual voltage each time, each different part produces a slightly different real voltage than the other parts, and that is the component we need to take out for our purpose of understanding the gauge variability.
"Aren't we testing to find out whether our instruments can detect part-to-part variation?" Actually, not really. Or rather, it is not necessary. We are testing to see if the gauge is capable of making the conform/nonconform decision based on the customer specification. Of course, if I choose as my spec something on the order of what I want to detect in part-to-part differences, then that is what I am testing, but it is not necessary for gauge capability *to the customer spec*. Consider parts that differ from each other by a nanometer, but are made to a customer spec that is +/- 1 inch - you really going to pay for a gauge to detect those nanometer differences, or just stick with a tape measure? :)
On the (third?) hand, if your work in the process requires you to be able to detect smaller differences than the spec to the customer, you would have two different capability numbers for the same gauge variability, right? A %R&R of 10% maybe for the customer, but a %R&R of 50% for process improvement work (dividing the same standard deviation by a smaller number). Still, one might argue that while you might need a gauge that measures more *precisely* to work on the process, it probably is not economically justifiable that it goes much beyond the equivalent of a 10% R&R of the customer spec - why are you working at that fine a detail of the spec in the first place - no one is paying you for it! (Unless, like what happened to me once, you knew the spec was going to be greatly decreased soon....)


Yes! I see...I just never get this kind of a situation. I get parts that do vary, and an operating range, and then the concern is, how much of the variation comes from the parts and how much from the gage? Can we measure parts all the way across the range? Can I tell when something is near the edge of the spec or out of spec? Can I measure the results of a designed experiment, that usually tests a range outside the spec on either end, and several increments in between?
Of course, if it were, say a measurement of length, and I were to do the entire study using a standard (or to be more consistent with your example, eight copies of a standard), I would expect NO part-to-part variation, but I would still be able to characterize the measurement variation. Maybe it's just the voltage example that's throwing me...what if the voltage is varying from measurement to measurement? How could we know? In this case, it's something that produces voltage, and we are assuming that "these voltages are constant and the only variation we see in remeasuring a module is gauge variation." If I were being asked to justify scrapping this gage, what evidence do I have that the output voltage of the piece is not creating this variation?
As to the third hand, I almost always want to be able to detect smaller differences than the spec; Taguchi pretty much established that back in 1960. Most of the people I work with don't live in a "go/no-go" world anymore. The spec is the voice of the customer; for most uses I see, I need gages with the ability to measure the voice of the process.

Back to ya Rip!

Right - if you are using a gauge to measure a bunch of stuff, then you had better use the full range you expect to measure in your MSA. This is where a correlation between the magnitude and dispersion analysis I showed in the short- and long-term can protect you from making a very expensive bad decision. Additionally, I would test to make sure that the *bias* is constant with magnitude as well (a.k.a. linearity) - I have been bitten by that one before too, specifically on thermocouples.
We can answer all the questions in your first paragraph performing the ongoing long-term MSA I describe - all those sources are identified. Every day you re-measure those same 8 parts (kept in a vault no doubt), assess the state of the gauge, and if it is still in control and acceptable, then you can trust the measurements. So if I get a nonconforming part, I am willing to bet that it is the part and not the gauge. Note again that a gauge calibration sticker is *wholly inadequate* to answer that question!
I would NOT do a study on a standard, unless you manufacture standards! That would have no external validity as I talked about in the second MSA article: http://www.qualitydigest.com/inside/quality-insider-column/mystery-measu... Note again that we are not counting on the parts to be identical with each other, we are counting on each one to produce the same voltage indication each time. But, as you say, what if the voltage of each unit changes with time? "How," you ask, "could we know?" Well the one thing we do know is that our process *as-measured* can't meet the customer spec. Even with perfect zero part-to-part differences right in the middle of the spec, we cannot reliably determine of a part is conforming or not! So it is not a question of justifying scrapping a gauge - you would have no way of determining if it was in or out of specification - the real question is justifying further investigation (does the voltage vary over time and we need to redesign the module) and/or justifying a new voltmeter. Whatever it is, it is happening across the board, so we are in no position to be guaranteeing anything to our customer. This is BAD NEWS! Even scarier if we just did the MSA and have been using it this way for years...we have probably scrapped a lot of perfectly fine modules and sent on some horribly out of spec modules, and the database of our measurements would show not one part shipped out of spec, and only out of spec parts scrapped!
Careful what I mean here: I am not proposing going back to "in-spec good, out of spec bad" go/no go world - quite the contrary, as my earlier articles show (if you like Taguchi you might enjoy http://www.qualitydigest.com/inside/six-sigma-column/show-me-money). Rather, one purpose of MSA is to determine if we can make a good "conformance decision" for that individual part, as I have been careful to say. That is what the %R&R tells us - what proportion of the spec width is taken up by gauge variation. That "twin peaks" graph above and in the other MSA articles show you why %R&R (or P/T) needs to be relatively small. If you have a %R&R of 10%, you can divide up the spec into ten real measurement divisions (far better than just determining go/no go). 5.15 times the measurement standard deviation is the "actual" resolution of the gauge. For most purposes, a %R&R of 10% is entirely sufficient to be able to detect changes of a practical magnitude as part of your improvement efforts (and remember the power of the Central Limit Theorem gives you further power with repeated measurements - *if* the gauge is in control). If you think you need 20 or 50 or 100 divisions, you might be right, but I am going to push back and say, "OK, so what magnitude of an effect do you want to detect?" and if you answer 1/100 of the spec, I am going to ask, "OK, so who cares? Not your customer or they would not have such a proportionally wide spec." And you are going to have to have a good answer to convince me to shell out $500,000 for that new nano-gauge! :) Or another way of asking it is, "Is your customer going to pay you more to improve your conformance *to target* by 1/100 of the spec?" If not, you probably don't need to measure at that resolution anyway and should be spending your time and money on that OTHER process over there.

Another question by

Another question by e-mail...name redacted until he comes along to claim it! :)
Hi Steve,
I've been following your discussions on the subject of MSA and now have a question regarding the most recent article that appeared in this month's Quality Digest - Performing a Long-Term MSA Study. My question is connected with the following paragraph extracted from your article:
"But two out of 200 observations is well within what we would expect with Type I error rate of 0.0027 (the rate you get a point out of the ±3σ control limits due to chance and chance alone)..."
Not sure if I follow your logic here and would appreciate if you could provide me with more explanation on this part of the statement, please.
Best regards,
Thanks for taking the time to read and ask the question!
OK, let's say that we are only using the "outside the +/-3 sigma limits" rule. (In MSA I would use others, particularly the run rule, but it only makes a false signal more likely, so let's keep it simple.) The probability of an individual occurrence of that due to chance and chance alone is 0.0027. That is our Type I error rate, or the rate of a "false positive" signal.
What is the likelihood of getting 2 signals in 200 due to chance and chance alone at that rate, then? Using MVPstats to calculate a one-sample exact proportion test (little dots are there to line things up):

One-Sample Proportion Test
....p = 0.0100............Po = 0.0027
..np = 2.0...............nPo = 0.5400
...n = 200
..95.00% Exact CI for P: 0.0012 to 0.0357
Exact Test for: P = 0.0027
Exact Binomial p-value(two-tailed) = 0.205

So in around 20% of trials of 200, we would see 2 false signals due to chance and chance alone. So I would conclude that the gauge is exhibiting reasonable stability - no reason to think we are seeing anything other than random variation.
Now, that is if I am looking at the 200 points all at once, right? But what I am really aiming for is an ongoing test. If instead of setting this up I was using it, I am only getting one point a day, let's say. Now if I see one of those charts go out of control, I must react - I don't have the context of 200 points, all I have is a point that is out RIGHT NOW!!! So in that case, I stop, check the part, check the gauge, and then do another measurement to see if I still have a problem before going ahead and approving the gauge for use today.
Make sense?