Our PROMISE: Our ads will never cover up content.
Our children thank you.
Published: Wednesday, March 24, 2021  12:03
A quick Google search returns many instances of the saying, “A man with a watch knows what time it is. A man with two watches is never sure.” The doubt implied by this saying extends to manufacturing plants: If you measure a product on two (supposedly identical) devices, and one measurement is in specification and the other out of specification, which is right?
The aforementioned doubt also extends to healthcare, where measurement data abound. As part of the management of asthma, I measure my peak expiratory flow rate (discussed below), and I now have two handheld peak flow meter devices. Are the two devices similar or dissimilar? How would I know? To see how I investigated this, and to see the outcome, read on. A postscript is included for those wanting to dig a bit deeper.
In 2015, I was diagnosed with asthma, a chronic condition where the airways in the lungs can narrow and swell, making breathing more difficult. The worst of it occurred at my inlaws, where I experienced wheezing and had difficulty breathing. The cause? The family cat!
After diagnosis I got not only medication—an inhaler—but also a peak flow meter. This device allowed me, as the patient, to measure and monitor my progress due to the prescribed treatment. The peak flow test is used to measure how fast one can breathe out after having taken in a full breath. My doctor gave a target of 460 L/min but, initially, I was way short with values in the vicinity of 400 L/min.
By 2016, half a year after the diagnosis, my average daily peak flow had increased by more than 10 percent to around 450 L/min. Moreover, the occasionally low (and more concerning) peak flow values less than 350 L/Min had effectively stopped. Things had improved.
A few weeks ago I got the second peak flow meter. Both are shown in figure 1. I used the new one a few times but then decided I should more carefully check out the similarity, or dissimilarity, of the two devices. I didn’t want to be misled by some dissimilarity in this new device.

In my day job I learned a long time ago that success with data places a reliance on three key ingredients:
Context: What do the data represent? (Who got the data? From where? When? How? etc.)
Use an analysis that separates the signals from the noise: Most often a control chart does this job for me.
Keep things as simple as possible: Use graphs where possible—time series plots, histograms, control charts, etc.—not least for the job of communicating the results, and the proposed actions, to others.
Herein, the context comes largely from the test design. In most cases we get our data as a byproduct of an existing process, or system, in which case the person collecting the data may well have to record not just the measurement values obtained but also some context, such as the sampling point and the time of sampling, the processing conditions at the time, the time the measurement of the sample was carried out, and so forth.
I decided to collect data from the two peak flow meters in the morning and evening over four days, giving what you might call “eight rounds” of data collection. Figure 2 provides the measurement values and the needed context to guide the analysis.

Figure 3, using day one’s morning measurements as the example, shows the sequence in which the data were collected. The six measurements were as follows:
1. Taken with the original peak flow meter
2. Taken with the new peak flow meter and within a few seconds of the first measurement
3. Taken 10 minutes later and with the original peak flow meter
4. Taken with the new peak flow meter and within a few seconds of measurement three
5. Taken another 10 minutes later and with the original peak flow meter
6. Taken with the new peak flow meter and within a few seconds of measurement five
Referring back to figure 2, note that on days two and three, there was a switch to use the new peak flow meter first. This was done to reduce the risk of any systematic effect creeping in, in case the first (or “freshest”) of the peak flow measurements was better than the rest.
In order to separate the signals from the noise, we have to carefully organize the data so the noise level can be estimated. (For those who use control charts, especially the average and range chart—also known as the Xbar R chart—it is probably apparent how this is done, based on figure 2.)
The test design was based on the principal question this test is designed to answer: Are the two peak flow meters similar, or dissimilar? If the two peak flow meters are effectively:
• Similar, the analysis will be void of signals (see the postscript for examples)
• Dissimilar, the analysis will give me a “signal”

It was important, therefore, to avoid grouping together the data from the original and the new peak flow meters—the two sources under study—when estimating the background noise. To achieve this, the data were grouped as shown in figure 4 through the use of rational subgrouping in SPC.
A cursory glance at figure 2 (or figure 5, below) shows that the three values in each subgroup are nearly always numerically different: In subgroup 1 we have 525, 470, and 445 L/min. The range of the subgroups (80 L/min for subgroup 1) is used to estimate the routine level of variation to expect in the measurements, i.e., the noise level, when my capability to breathe out remains essentially unchanged. Why? Because, in a short time frame of 20 minutes, it is unlikely that the state of the airways in my lungs changes when no stimuli is present to provoke this.
My ability to breathe out could, however, have changed over the course of this test, either from morning to evening, or from day to day. The test design handles this because measurements on the two peak flow meters are always carried out in parallel. If the design had been different, such as all measurements with the original peak flow meter on days one and two, and all measurements with the new peak flow meter on days three and four, it could have been impossible to know if a signal was due to a dissimilarity in measurement device—the peak flow meters—or me if, for example, my airways were more open, or blocked, on the different days.
The job at hand is to separate the signals from the noise in a simple and understandable way, and the family of analysis of means techniques is great for this. These techniques are best suited to finite, or complete, data sets. In figure 2 we find 48 data values; we are “done and dusted” because there will be no 49th value.
An example of a source of continuing data streams, where we are not “done and dusted,” are factory production lines because each new batch or production run brings us more data. While it is technically possible, continuing data streams are not best suited to the analysis of means. Better are SPC control charts.
Even though control charts were developed for continuing data streams, some analysts might choose to place the data from this study on an average and range chart. Others, who work with designed experiments, might use the analysis of variance (ANOVA).
Why the analysis of means? Because we can combine the simple picture of the average and range chart with some of the versatility inherent to ANOVA. One element of this versatility is the choice of alphalevel, which is briefly discussed in postscript 2. Here I use the standard choice of alphalevel of 5 percent.
Perhaps the key feature of the analysis of means techniques are detection limits. These limits separate the signals from the noise, just as control limits seek to do on a control chart. Hence, any point falling outside these detection limits is a signal (technically, one speaks of potential signals, since false signals can occur even if only rarely).
Figure 5 presents the data for analysis. There are 16 subgroups (k = 16), with each subgroup consisting of three measurement values (n = 3). For each subgroup the calculated average and range are shown.
Figure 6 shows the charts used for the first analysis. The upper chart is an ANOM chart (ANalysis Of Means) and the lower chart an ANOR chart (ANalysis Of Ranges). The ordering of the subgroups on the xaxis is the time sequence of the test. (How to calculate the limits on the ANOM and ANOR charts is shown in postscript 1.)
Starting with figure 6’s ANOR chart, none of the subgroup ranges fall beyond the upper detection limit, meaning there is no signal in this chart to indicate a lack of consistency in the measurements. (A lack of consistency would call the measurements into question, putting in doubt whether a subgroup average should, or should not, be included on the ANOM plot. As with regular control chart usage, a signal on the ANOR chart should be investigated with a view to identifying and then eliminating the cause of the inconsistency.)
Figure 6’s ANOM chart contains signals: There are two subgroup averages above the upper detection limit (the circled points). A signal represents a detectable change, or difference, as per the data. With a signal, or signals, at hand, the question is, “What caused the change?” In other words, signals grant the license to go hunting.
The xaxis labels tell us that the two detectably higher averages were obtained with the original peak flow meter. As we seek to further decipher these signals, there is another, albeit more subtle, clue on the ANOM chart: All of the subgroup averages from the new peak flow meter are below the central line, where the central line is the average of all 48 measurements. (In presenting this observation more clearly to others, one could choose to give a different symbol, or color, to the points from the respective peak flow meters.)
These two observations lead to the hypothesis that the measurements on the original peak flow meter tend to be higher in value than those on the new peak flow meter. This hypothesis is one of measurement bias and can be clearly examined with an ANOME chart, which is an ANalysis Of Main Effects chart. This chart, moreover, is unbeatable when it comes to communicating the results to others. In this study of two peak flow meters, the ANOME chart plots two points which are the average of:
• All 24 measurements on the original peak flow meter: 458.125 L/min
• All 24 measurements on the new peak flow meter: 423.75 L/min
Figure 7 shows the ANOME chart. Once again, points outside the detection limits are signals. (How to calculate the limits on the ANOME chart is shown in postscript 1.)
Figure 7 shows very clearly that the two peak flow meters measure differently. Measurements on the original meter are, on the average, of greater value than those on the new meter, meaning there is a demonstrable bias between the two devices. The estimated magnitude of this bias, 34.375 L/min, is the difference between the two averages on figure 7, as illustrated in figure 8.
The one remaining member of the analysis of means family is the ANOMR chart, or the ANalysis Of Mean Ranges chart. What to look for in the ANOMR chart can be anticipated from the ANOR chart. In figure 6’s ANOR chart, the three biggest subgroup ranges come from measurements on the original peak flow meter. Hence, might there be more variation in the original peak flow meter’s measurements?
The ANOMR chart, shown in figure 9, detects a difference in measurement variation between the two peak flow meters. The signals in figure 9 tell us that there is a greater level of variation in the measurements from the original peak flow meter. This effect can be estimated using the standard deviations as shown:
Hence, there is estimated to be around double the variation in the measurements from the original peak flow meter. In many cases, such as for measurement methods used on the factory floor or in the laboratory, this would be an important discovery requiring further investigation and appropriate action.
What did I do and, by extension, what could you do? I:
• Defined the question of interest, which was to compare the two peak flow meters
• Designed a simple test to be able to answer the question of interest
• Executed the test, as per the test design, to obtain the data
• Organized the data so as to estimate the background noise in the data
• Used the analysis of means family of techniques to detect the signals in the data and learn what was there to be learned by interpreting the charts in context
Below is a summary of the analysis of means techniques:
With help from the family of analysis of means techniques, we’ve learned that the two peak flow meters measure differently. There is:
• A bias between the two peak flow meters, with the original peak flow meter estimated, on average, to measure approx. 34 L/min higher than the new peak flow meter
• More routine measurement variation in the original peak flow meter, with its standard deviation being close to double that of the new peak flow meter
The detection of a bias did not come as a surprise. I carried out this test because I suspected the new peak flow meter tended to measure a bit lower than the original peak flow meter.
The bias introduces a problem that I cannot resolve: I have no clue which of the two peak flow meters is “right.” Why? Because I don’t have a “master measurement method” to compare against. I have discovered a relative bias between the two peak flow meters. Consequently, I do not know which of these three possibilities applies:
• Does the original peak flow meter give the truest indication of my peak flow?
• Does the new peak flow meter give the truest indication of my peak flow?
• Do neither of the two peak flow meters represent my actual peak flow?
The safest approach is to assume the third option, meaning that the best I can actually do with the peak flow measurements is to take the values I get as some kind of approximation of where my peak flow is actually at. Periods of relative consistency will indicate that my peak flow is effectively unchanged. Inconsistencies in the data, such as sudden changes, or sustained shifts up or down, will serve to tell me if my peak flow is getting better or getting worse.
The extra measurement variation in the original peak flow measurements did come as a surprise and isn’t easy to explain. Nonetheless, no matter how hard I tried, I could not get above 450 L/min with the new peak flow meter, and the results were less spread out than those from the original peak flow meter.
In business, one way of getting around the problem of greater measurement variation is to carry out more repeat measurements to reduce the variation around the average of the measurements. In a factory this might be too timeconsuming or expensive to do in practice, but not at home. However, for peak flow monitoring, the medical recommendation is to take the maximum of three measurements, so this wouldn’t be as straightforward as the case where the average of repeat measurements is used, and the variation reduces proportionally to 1 divided by the square root of the number of repeat measures.
Finally, having learned that two supposedly similar peak flow meters measure differently, what about other asthma patients who monitor their peak flow? I haven’t changed my peak flow meter, but others will. In such cases, it can only be hoped that an apparent improvement, or deterioration, in the patient’s condition as per his/her peak flow data will not lead to a misdiagnosis, resulting in, for example, an unwarranted change of medication.
1. Calculations
The calculation of the detection limits seen on figure 6’s ANOM and ANOR charts is below. The scaling factors, of value 1.045 and 2.67, are found in reference 1, below.
The calculation of the detection limits seen on the ANOME and ANOMR charts is below (figure 7 and figure 9, respectively). The scaling factors, of value 0.172, 1.260, and 0.740, are found in reference 1, below.
2. Alphalevel
One choice to take with the analysis of means techniques that could be new to some people is that of the alphalevel. In this choice one can decide as per the descriptions below without the need to worry unduly about the underlying theory:
• 10percent alpha: You believe your data are likely to contain signals—such as the effect of an inhaler on peak flow or a change in temperature on product viscosity—and you don’t want to miss them, so you err in the direction of risking a false signal because you do not want to miss a signal.
• 5percent alpha: The default choice. You might not be sure and opt for a traditional, or standard, approach to the analysis.
• 1percent alpha: You want to exercise caution in your search for signals, wanting to keep to a minimum the risk of false signals (also called “false alarms”).
In going from 10 percent to 5 percent to 1 percent, the analyst is playing things “safer,” which means the detection limits get further apart, or wider, as the alphalevel gets smaller.
Once the alphalevel has been nominated, and with the values for k and n known, one looks up the scaling factors to use (see reference 1). With the ANOME and ANOMR charts, one also has a value for m. (In this test, we had k = 16, n = 3, and m = 2.) With the chart’s scaling factor(s) at hand, one proceeds to compute the upper and lower detection limits, as shown above in postscript 1.
3. Timeofday effect
In the main text, ANOME and ANOMR charts were used to compare the two peak flow meters. Perhaps you wondered whether the time of day—morning or evening—causes a change in my peak flow? The charts below answer this question.
In both charts the points are inside the detection limits. There is no signal. No effect, or change, is detected. This makes sense to me because I have no feeling that my peak flow is better in the morning.
Yet, how many would take the data and say that my peak flow is better in the morning? The numbers give a higher morning average of 10.2 L/min. (The morning average is 446.0 and the evening average is 435.8 L/min.)
The ANOME chart puts the brake on this interpretation because there is no signal in the data, and without a signal, there is no effect to estimate. This result is important: Being misled by the noise in your data is a bad idea. If you intervene in your process when the difference between two or more numbers is nothing but noise, you run the risk of making things worse, not better.
4. Four thermometers
Reference 2 below, Quality Control and Industrial Statistics, Third Edition, contains an example on page 549 in which three determinations of the melting point of hydroquinone were carried out on four different thermometers (k = 4, n = 3). The author, A. J. Duncan, asked: “Do the thermometers read differently, or can the variations between thermometers be attributed primarily to chance (routine or random) errors?”
The data are shown above. The author used the analysis of variance, also shown above, to conclude, “...we do not reject the null hypothesis and tentatively conclude that the thermometers really read alike.” (Even if you are comfortable with all the terms in the ANOVA table, are your colleagues?)
Using the analysis of means (ANOM and ANOR plots), we see in picture form that there are no signals in the thermometer data. Hence, without a signal in either chart, we can reasonably say that:
• [ANOR chart] There is no lack of consistency in the data, neither is there any indication that the thermometers display differing levels of measurement variation.
• [ANOM chart] There is no evidence that any of the thermometers measure detectably higher or lower than the other thermometers.
The analysis of means allowed us to reach the same conclusion as the oneway ANOVA, but in a simpler manner and without loss of effectiveness. Moreover, with the ANOR plot available, we could also examine each thermometer for measurement consistency.
This result is also important: Based on the temperature data collected, ANOM tells you, for example, that it would be a mistake to adjust the process based on differences in single temperature readings from the different thermometers.
5. Four suppliers
The question is, “Do we get similar quality from four suppliers?” This last example uses data on a key characteristic in an ingredient from four different suppliers.
The chart below is a 5percent ANOM chart. The width of the detection limits varies from supplier to supplier because the amount of data provided by the suppliers was different, resulting in subgroups of unequal size. (How to work with unequal subgroup sizes is in reference 1.) With two points outside the detection limits, dissimilarities between the suppliers are detected. Suppliers 3 and 4 are high in concentration, and suppliers 1 and 2 low in concentration.
As mentioned above, some might place these data on an average and range control chart. The average chart, shown below, has no signals, which is consistent with the conclusion that all four suppliers are similar.
Why the different conclusion? In this example we have:
• Four suppliers, so k = 4
• Between six to 12 values per supplier, so 6 ≤ n ≤ 12
When k < n, the average and range chart has a very low theoretical alphalevel. In other words, the chart has wider limits than the ANOM chart and lacks sensitivity to detect signals; it errs in the direction of missing signals to avoid false signals. In this case, the ANOM chart is better than the average chart. A “play it safe” analysis, which the average chart gives, is too cautious.
Hence, for finite data sets, an advantage of the analysis of means is this ability to specify the alphalevel to meet the needs of the situation.
References
1. Wheeler, D. J. Analyzing Experimental Data, SPC Press, Knoxville, Tennessee, 2013.
Source of the background material and the scaling factors used to compute the detection limits. A Quality Digest article by Donald J.Wheeler, “The Analysis of Experimental Data,” is a good starting point and contains some of the more commonly used scaling factors.
2. Duncan, A. J. Quality Control and Industrial Statistics, Third Edition, IrwinDorsey International, London, 1965.
Source of the four thermometers example in postscript 4.