Featured Product
This Week in Quality Digest Live
Metrology Features
Leon Chao
A millennial's take on real-world applications
Bryan Christiansen
These checklists help you track machine condition in real time
Mark Hembree
Rumors about ball performance meet quality assurance
Tim Mouw
Getting from color inspiration to final product isn’t easy
Tracking changes in fluorescence from ultracold atoms is a super-sensitive indicator of pressure

More Features

Metrology News
System can pick parts, measure with robotic calipers and gauges, record data, and place and sort parts
US Dept. of Commerce issues seven grand challenges
New calibration software increases efficiency
Requesting quotes for ‘Baldrige Reimagined’
AS-Schneider will present the Digital Valve Kit, customized injection quill solutions at ADIPEC 2022
Providing operators a simple upgrade or ability to switch between two controls on a single machine
Transmitting surface measurements made easy
Material could improve performance of components for aerospace, medicine, and transportation
Hands-on training in robotics and automated metrology for manufacturing

More News

James Bossert


MSA: Back to the Basics

A review of the fundamentals of a gauge R&R study to remind us why it’s so critical

Published: Thursday, July 1, 2021 - 12:03

When we talk about measurement system analysis (MSA), people tend to focus on attribute agreement analysis because it is usually quicker and easier to do than a gauge repeatability and reproducibility (gauge R&R) study. This article is a review of the fundamentals for gauge R&R to remind us why it is so critical. We will review the basic definitions, go through a process for preparing a study, and then review the output in Minitab to make sure we understand what is going on in the analysis.

Why do we do a gauge R&R study in the first place? We do it for two reasons. One is to validate that the measurement process is acceptable. The second is to feel comfortable about the data. We want to avoid the potential for embarrassment by presenting data in a meeting and having it challenged. We want to be able to show what we did to validate it and to have confidence the data are good. Doing MSA also helps us to understand what is in the collection process and helps convince the people collecting the data why it’s important to do so. If the data collector understands why the study is being done, then he will be more likely to identify problems when they occur.

I had a circumstance early in my career where I noticed some consistent data on an extrusion-molding process. This consistency was only on one shift of a three-shift operation. I went to the operator who was collecting the data and asked him to show me how he collected it. He shared that he did one reading each hour and then added in the other data points between them. He felt there would be no change because he had not adjusted anything, and so it would be OK to do it that way. I had failed to communicate why we were doing the analysis, and as a result my data were biased.

I stopped the data collection until l could meet with everyone who was doing it before restarting the study. The young statmagician had learned a lesson: Spend the time with all operators who may be collecting the data to explain what you are trying to do, so they collect the information correctly every time.

Gauge R&R is an analytic study that provides us with information about the data collection process. It gives us the information to determine if the gauges we are using are clouding the process information. An analytic study will give us an objective perspective of the causes of variation in the process. The benefit of doing this type of study is that it allows us to better understand how much the gauge affects the measurement process. If the gauge impact is low, we have good data about our process.

Bias and variance

That brings us to how do we measure the quality of our data? We use statistics based on multiple readings of the samples that we are using. We use the same gauge, the same procedure, and the same operator, and test against a known standard. If the readings are all “close” to that standard, we say that the quality of our data is high.

But it is that simple? Basically, yes, but it requires the discipline to run the study exactly as it is set up. Which means randomization of the samples so that we minimize any effect of bias and lurking variables. A lurking variable is one an unknown one that impacts the measurement being taken.

One study that was undertaken in an industrial lab was to understand why results were so varied. In this study, there were 10 people who could run the analysis, and each had a different way to prepare the sample and then run the test. To do the MSA study, we had to first identify one person as the standard for doing sample prep. Then we had to select who was going to participate in the study. I was looking for three people at a minimum. I asked who was the newest person to the team. Who was the person who had been there the longest? Then I picked one who had been there in the median range. Why did I do that? I wanted to see the historical data from them. It would give me some clues on what bias and variation they might have. They would be the ones performing the MSA study once it was set up. I will come back to this later.

Bias is one of the two statistical properties that we look for to characterize the quality of data in a measurement system. The other property is variance. Bias is simply the location of the data with reference to the reference or standard. We take our repeat samples, calculate an average, and compare it to the reference. How close it is defines our bias. This is shown in figure 1.

Variance is spread of the data. In general, most measurement systems fail because of their variance.

Figure 1: Bias

Variance is influenced heavily by the other factors in an MSA: people, preparation, environment, equipment, and more. Some of these can be controlled, and some cannot. An MSA will help us identify what are the critical ones that we need to consider.

Basic terms

Before we get too far in this discussion, let’s consider some definitions.

First is measurement. This is the “assignment of numbers [or values] to material things to represent the relations among them with respect to particular properties.”1 The process of assigning the numbers is defined as the measurement process, and the value assigned is defined as the measurement value.

The measurement system is the collection of instruments or gauges, standards, operations, methods, fixtures, software, personnel, environment, and assumptions used to quantify a unit of measure or fix assessment to the feature characteristic being measured; it’s the complete process used to obtain measurements.2

A gauge is any device used to obtain measurements.

A standard is a reference value. It is used as the criteria for acceptance. It must have an operational definition that, when used, will provide the same results whenever it is used. If you purchase a testing standard from NIST, it comes with a stated limits of uncertainty. For example, a solution with a pH of 4.5 +/– .05.

Discrimination is the smallest unit that can be read or reported. This may not always be the smallest unit that a gauge may be able to measure but what is reported. An example is a digital thermometer that can read decimals up to 3 places (thousands) but only reported to one decimal place (tenths). This is also where the 10-to-1 rule comes from. It is a metrology guideline which states that the gauge should be sensitive enough to measure to 1/10th of the tolerance that you are interested in. If you measure to 0.1, your gauge should be able to measure 0.01.

Stability is the change in bias over time. This is also known as “drift.” Stability is a key assumption in MSA studies. Without stability, we run the risk of increasing error. What we want is a stable process where the error is consistent across the range of the data (figure 2).


Figure 2: Stability

Accuracy is the closeness to the given value or the true value. If our standard is a standard reference material, then we know it is a true value. If it is a working standard, then we consider it a given value.

Linearity is the change in bias over the normal operating range of the study. Traditionally, we want this to be consistent across the range we usually work in. However, when setting up an MSA study, we want to go beyond this to see how well the gauge works outside of that normal range.

Precision is the closeness of the repeated measures. This is the random error of the measurement system. Ideally, this is consistent across the range of the MSA study.

Repeatability is the measure of one instrument, with one person doing repeated measures on the same sample. It defines how good or consistent the instrument is when the same person does repeated measures.

Reproducibility is the measure of one instrument with multiple people doing repeated measures on the same sample. It gives us an idea on other sources of error, which could be people, processes, or environment.

When people talk about doing a gauge R&R study it is the combination of repeatability and reproducibility (R&R). Another thing to consider when doing this kind of study is the sensitivity of the instrument. Sensitivity is defined as the smallest input that results in a detectable output. It is determined by the discrimination of the instrument, the inherent quality as designed in by the equipment manufacturer, the maintenance of the instrument, the operating condition of the equipment, and the standard being used for calibration.

Calibration is the process to establish, under specified conditions, the relationship between a measuring device and a traceable standard of known reference value and uncertainty. This is done to identify if any adjustments are needed to the instrument to account for any bias. Many organizations do this daily because it has an effect on the results reported. The results of the calibration are recorded to establish some traceability.

Setting up the study

Now let’s talk about setting up the study. This is a process that, if done correctly, will help give the designer a good idea of what possible sources of error may be encountered. When setting up the most robust study, it will help to take the following steps into account so you can have confidence in the results.

1. Map the measurement process. This is an important step. You must clearly identify all the steps in the process, including the time the test sample is received, what preparation is completed, how the test is run, and the results recorded and sent to the requestor. It is important to take the time to document everything. This is where you find out how many people may run the experiment; what type of sample prep is done and how consistently; what is involved with the instrument setup; and, if there is any refresh time needed, are there any special environmental conditions that must be met, like temperature or humidity?

2. Walk the process. After the map has been created, walk through the process to ensure that the map is complete as well as noting any deviation, and all environmental conditions.

3. Decide what standard will be used. This should be material that is like what is used all the time but one that is certified to a given value and error. NIST has standard reference materials that can be purchased to perform studies. After the initial study has been completed, many companies create working standards that are made to NIST standards to use for routine calibration.

4. Find out how much time the test will take. This should be on a normal sequence, from prep to result. Depending on the test, this could range from two minutes to 30. This is necessary for the planning of the study. It will affect the number of samples used as well as the number of replicated analyses.

5. What degree of precision is acceptable? How many decimal places are needed? You should run the study to one more than is used. This helps define the uncertainty in the study.

6. How is the process currently running? This gives you the baseline you are starting from. You can use traditional control charts for this.

7. What kind of sensitivity do you want to have in the study? This is based on the baseline data. Do you want to detect the same amount of variation (i.e., the same standard deviation) or less than that (smaller than the current standard deviation)? This is dependent on the results you have been seeing as well as what is desired. If the current amount of variation is acceptable, that makes the study much easier to do. It will require fewer samples and replications. If you want greater sensitivity (i.e., the ability to detect less than the current standard deviation), then you will need more samples and replications. We will discuss this after we finish completing the process steps.

8. Design the study. Now you have all the information that you need to create the study. Once you have determined how many operators, samples, and replications, then it is a “simple” matter of randomizing it so that the error is minimized, and you can start scheduling it.

9. Run the analysis and report on the results.


Let’s talk about sensitivity. As mentioned, sensitivity in an MSA is talked about in terms of standard deviation. This is simply because it is a neutral measure of variation. When talking to people about MSA, it is important to have the baseline data so you can talk in terms of the standard deviation. That way, people have some idea of what the variation is and what they really want. Many people want to be able to detect a very small difference until they find out how many samples and replications they have.

This table in figure 3 shows how you can talk about sensitivity. For example, if you have six samples and four replications, you see the numbers 2.04 and .96. The 2.04 is the ratio of test-error standard deviations that remain undetected 10 percent of the time when a test is made at the 5-percent significance level. What this means is that if I have the test errors and divide the bigger by the smaller (similar to an F test), the result will not be noticeable unless it is more than 2.04 in the calculation. The difference must be more than two times greater before a difference is noticed. If I wanted a standard deviation of 5 and my test error was 10, that would only be 2, and I would not see a difference unless it was greater than that.

The .96 is the bias as a ratio of test-error standard deviations that remain undetected 10 percent of the time when a test is made at the 5-percent significance level. If we calculate an average, and it is greater than .96 of the ratio of the standard deviations, then I would say the bias is significant. I would need the error standard deviation to be equal before I would detect a difference. If my ratio of test deviations was 10/10, the difference must be greater than 10 for me to detect a difference.

Figure 3: Sensitivity chart for MSA

Why is all this important? Understanding historical variation and how much of a difference we want to see plays a big role in the studies. Similar to what we see in determining the sample sizes for hypothesis tests, the larger the variation in the current process, the more replications are needed to see a difference. This can cause problems when trying to design studies.

If you have too much variation to be able to detect any difference, the logical path to follow is to reduce the variation. The first recommendation to do so is to have everyone follow the exact same procedure for sample preparation. Although this may cause some grumbling from the people doing the prep, it does make a difference. Earlier, I mentioned the industrial lab study where we had three people participating. One was the newest team member, one was the oldest in terms of being on the team, and one was from the median group. They were following the sample prep procedure that was performed by the one team member who everyone on the team felt was the most consistent preparer. I did this simply because I wanted to minimize the resistance to the new sample-preparation process. The team was more likely to do it if they agreed who was the most consistent person doing the prep. This decision was based on the baseline study, which showed that there were problems in the measurement process.


Here are the results of the baseline study. There is no significant difference in parts, operators, or the interaction of the parts and operators. Minitab also runs the analysis without the interaction and, no surprise here, there is no change. When we look at the variance components, we see that the %Contribution for the Total Gauge R&R is 97.65, which tells us that the measurement process takes up almost all the variation in the study. Looking closer, we see that it is all in the repeatability component. What this tells us is that there is inconsistency in the measurement process and the operators.

This analysis showed some of the difficulties of performing this study using historical data and not setting up the study properly. There is little difference in the samples but wide variation in some of the samples. There is also some difference in the operators. This results in not much difference seen across the board, but when looking at the r chart and X-bar chart, we see a difference in the pattern of operator 3. This is even more noticeable in the parts and operator interaction graph. Operator 3 is different from the other two operators.




The next analysis shows what happens when the same preparation was followed by all operators. We also looked at getting a wider range of samples so that we could better test the ability of the instrument to detect differences. This analysis shows the differences between the samples as well as the effect of everyone doing the same sample preparation.

The %Contribution now shows the total gauge R&R is 3.15 percent, which is in the acceptable range for a gauge R&R. The repeatability is much improved as well as the reproducibility. Now we see that the part-to-part %Contribution is at 96.85 percent, which is more of what would be expected. When we look at the graphs, we now see that that the R chart shows some differences, but the rest show repeatable patterns as well as differentiation in the samples.



Looking at the traditional total gauge R&R guidelines, we see that the measurement system is now acceptable. Looking at the % Study variation and the distinct categories, we are in the marginal zone. So that while improvement has been made, there is still more to work on.

Here are some general guidelines for working on a measurement system.

First, if the main source of variation is repeatability, then you need to change the measurement system. If the main source of variation is the people (reproducibility) then you need to look at the parts of the measurement system where people interact with it, as in the example of the sample preparation. It could also be training or lack of a standard procedure.

You must walk the process to see where the differences occur to uncover the possible root cause. You may find that it is the instrument, and that changing it may or may not be an option, so you have to look at how often the calibration is performed, or do some analysis to see what the limit of detection or the limit of quantification is. These may help identify if the instrument is capable of the desired analysis, or if you need to purchase new equipment. Sometimes this is a result of changing customer requirements that are now in place, but the equipment is not capable of consistently reporting it.

This article showed the criticality of properly setting up a gauge R&R study and the impact it can have. It is a key tool for the quality practitioner to have and to use when looking at measurement systems.

1. Eisenhart, Churchill. “Realistic evaluation of the precision and accuracy of instrument calibration systems.” Journal of Research of the National Bureau Standards (U.S.), vol. 67C, no. 2, pp. 161–187, 1963.

2. Ku H. H., editor. Precision Measurement and Calibration: Statistical Concepts and Procedure, vol. 1. National Bureau of Standards, U.S. Dept. of Commerce, 1969.


About The Author

James Bossert’s picture

James Bossert

Jim Bossert (PhD) is a Lean Six Sigma Master Black Belt and a Certified Change Agent for High Reliability Initiatives at the Joint Commission Center for Transforming Healthcare. In this role, he is responsible for leading and coordinating activities supporting the adoption of high reliability practices within Healthcare organizations. He has been actively involved in quality and process improvement in all levels of management for over 30 years. He is the author of the Supplier Management Handbook (6th Edition) and Supplier Certification among others. He is an ASQ Fellow and received the Distinguished Service Medal from ASQ in 2012. Jim received his MBB from GE. He served seven years as the editor of the Six Sigma Forum magazine. He holds the following ASQ Certifications: CQE, CQA, CQM/OE, CSSBB, CSSMBB.



Comments on your study examples


Thanks for the article, a few observations on the examples.

This may be a terminology aspect but I think its very important, you mention in your wording for example 1 that the issue relates to the consistency of the measurement process, yet the range chart indicates that the measurement process is consistent, if nothing would change it could go on measuring at this level. 

The fact that the measurement system, based upon this study cannot distinguish between your chosen parts is evident from the X Bar Chart, in fact there's no point in looking at any other numbers if this is important because all the part values are plotted within the control limits which contain measurement variation, in other words we are measuring in the noise.

In example 2 you change the method to standardise the part preparation  which gives the measurement process more of a chance to not only distinguish between parts (you may of changed your part selection too) but has resulted in decreasing the measurement variation as depicted by the lowered standard deviation.

By looking at the X Bar chart associated with example 2 now we can see that the observations are all outside the control limits therefore based on this selection of parts we can be comfortable that we can distinguish between them. Note that the R Chart still shows operator consistency, we may not be comfortable with the measurement error but we can be confident that it can be replicated.

If we need to further improve we can always look at systematic detectible differences between the operators by using other methods.

Out of interest, what are the number of replicates per part here in order for the software to calculate sd's of 2.31 and 0.889 for the two examples?