Cost for QD employees to rent an apartment in Chico, CA. $1,200/month. Please turn off your ad blocker in Quality Digest

Our landlords thank you.

Statistics

Published: Monday, July 10, 2023 - 11:03

Chunky data can distort your computations and result in an erroneous interpretation of your data. This column explains the signs of chunky data, outlines the nature of the problem that causes it, and suggests what to do when it occurs.

When the measurement increments used are too large for the job, the limits on a process behavior chart, as well as other statistical techniques, can be distorted. This distortion can lead to spurious results. Fortunately, this problem is easily detected by ordinary, production-line process behavior charts. No special studies are necessary; no standard parts or batches are needed. You simply need to recognize the telltale signs.

Most problems with process behavior charts are fail-safe. That is, the charts will err in the direction of hiding a signal rather than causing a false alarm. Because of this feature, when you get a signal, you can trust the chart to guide you in the right direction. Chunky data are one of two exceptions to this fail-safe feature of the process behavior chart. (The other exception was the topic of last month’s column.)

Data are said to be chunky when the distance between the possible values becomes too large. For example, what would happen if measurements of the heights of different individuals were made to the nearest yard? Clearly, the variation from person to person would be lost in the round-off, and any attempt to characterize the variation in heights would be flawed. When the round-off of the measurements begins to obliterate the variation within the data, you will have chunky data. The effect that chunky data have on process behavior charts is illustrated by the following example.

The data in figure 1 are the measurements of a physical dimension on a plastic knob. These data were recorded to the nearest one-thousandth of an inch (0.001 in.). There are no signals of exceptional variation on either the average chart or the range chart. Since these data show no evidence of a lack of homogeneity, we would conclude that the process producing these rheostat knobs is being operated predictably.

To illustrate the effect of using a measurement increment that is too large, we shall round off the measurements in figure 1 to the nearest hundredth of an inch. While we would never do this in practice, we do it here to simulate what would happen if the measurements had only been recorded to two decimal places. After rounding these data, the averages and ranges were recomputed and a new average and range chart was obtained. In figure 2 we find four averages and two ranges outside the limits. The usual interpretation of the chart in figure 2 would be that these data show a lack of homogeneity, and that the underlying process is changing in some manner.

However, we know that the charts in figure 1 and those in figure 2 both represent the same process. The only difference between the two charts is the measurement increment used. Based on figure 1, we have to conclude that the “signals” in figure 2 are actually false alarms created by the round-off operation.

By comparing figures 1 and 2, it should be apparent that *chunky data can make a predictable process appear to be unpredictable*.

Fortunately, it’s easy to tell when the data have become chunky. The key is in understanding the differences between figure 1 and figure 2.

As may be seen in figure 3, the running records of the averages are trying to tell us the same story. However, the fewer possible values in figure 2 make the running record look more “chunky” than that in figure 1. This chunkiness also results in highs and lows that are more extreme in figure 2. At the same time, the similarity of rounded values within each of the subgroups results in many zero ranges in figure 2. These deflate the average range, which in turn deflates the limits.

So while the highs and lows get emphasized by the larger measurement increments, the limits get squeezed. When this happens it’s inevitable that the running record and limits will eventually collide and produce false alarms.

So how can we spot this problem? The very look of the running records is one clue; the abundance of zero ranges is another. However, the clear-cut, unequivocal indicator of chunky data is the number of possible values within the limits on the range chart.

The ranges will always have the same increments as the original data. Thus, the ranges in figure 1 are all multiples of one-thousandth of an inch. The range chart from figure 1 is reproduced in figure 4. The upper range limit is 0.0181. Dividing by 0.001, we find 19 possible values (from 0 to 18 thousandths) within the limits on the range chart.

The range chart from figure 2 is given in figure 5. There, the upper range limit is 0.0102. Dividing by 0.01, we find only two possible values (0 and 1 hundredth) within the limits on the range chart.

So, how many values inside the limits do you need to be safe from the effects of chunky data?

For average charts with subgroups of size *n* = 3 or more, you need to have at least five possible values for the range within the limits on the range chart to be safe from the effects of chunky data.

For *XmR* charts and average charts with subgroups of size *n* = 2, you need to have at least four possible values for the range below the upper range limit to be safe from the effects of chunky data.

When you fail to meet one of these minimums, your data are chunky, and the limits may be deflated by the round-off inherent in the measurement increment you are using.

Figure 1, with 19 possible values, shows no problem due to chunky data. Figure 2, with only two possible values, shows data that are definitely chunky—the measurement increment is too large for the purposes of creating a useful and meaningful process behavior chart. Since chunky data can create false alarms, you can’t safely interpret the “signals” of figure 2 as evidence of exceptional process variation. Thus, while chunky data may still be used for inspection, they can’t be used to characterize process behavior, and neither should they be used with any other statistical inference technique.

The problem seen in figure 2 is due to the inability of the measurement increments to properly reflect the process variation. When these measurements are rounded to the nearest hundredth of an inch, most of the information about variation is lost in the round-off. As a result, the rounded data have many zero ranges, even though the original data have no zero ranges. These zero ranges deflate the average range and tighten the computed limits. At the same time, the greater discreteness for both the averages and the ranges will prevent the running records from shrinking with the limits. Eventually it becomes inevitable that some points will fall outside the artificially tightened limits even though the process itself is predictable.

Therefore, the procedure to check for chunky data consists of three steps:

1. Determine the measurement increment used. This is done by inspecting either the ranges or the original data.

2. Determine the upper and lower limits for the range chart. This is done in the usual manner.

3. Determine how many possible values for the range fall within the range limits and apply the rules given above.

Since the problem with chunky data comes from the inability to detect variation within the subgroups, the solution consists of increasing the ability of the measurements to detect that variation.

One solution is to use smaller measurement increments. If you have been rounding your measurements too aggressively, you can solve the problem of chunky data by simply recording an additional digit for each measurement. Even if there is some uncertainty in that extra digit, its inclusion can actually improve the quality of your data. So, regardless of tradition, if your data are chunky because you have been rounding off your measurements, you need to start recording extra digits.

If the current measurement system will not provide you with additional digits for your observations, then you may need to consider changing the measurement system.

Another solution to the problem of chunky data is to increase the variation within the subgroups. This will increase the ability of your current measurement system to detect variation within the subgroups. With an average chart, this will usually involve a change in what a subgroup represents. When a single subgroup represents several successive parts coming off a line, you can usually increase the variation sufficiently by simply expanding the subgroup by waiting between sampled parts so each subgroup represents a longer period of time. When the within-subgroup variation becomes detectable, the visible effects of chunky data on the average and range chart will disappear.

What if we increased the subgroup size? This will not reliably remedy the basic problem of chunky data, which comes from the round-off deflating the estimate of dispersion. However, as will be seen below, it can delay the onset of the deflation, thereby extending the use of the current measurement system.

What happens with count data? We begin with the fact that all count-based data consist of individual values. If your observations consist of counts, and those counts display the effects of chunky data when placed on an *XmR* chart, then your data are irredeemably chunky (and you’re probably counting rare events). Such data may be used to create running records, but they will not support the computation of meaningful limits.

Data analysis consists of filtering out the noise so we can detect any signals that may be present. Process behavior charts use three-sigma limits to filter out the “noise” of routine variation. The formulas for these limits are based on “unbiased estimators.” The mathematical support for these unbiased estimators rests upon an assumption that the data come from a continuum of values. As long as the standard deviation for the data is larger than the measurement increment, this assumption is reasonable and the formulas will work as advertised. But as the data become more discrete, the formulas will become increasingly biased.

Figure 6 shows the average bias introduced by round-off into the formulas based on the average range. The average bias is shown on the vertical scale, while the horizontal scale shows the size of the standard deviation of the product measurements, *SD(X)*, relative to the measurement increment.

On the right, all the curves are on the line marked 0% bias, and the formulas will have an average bias of zero. As we move to the left, the standard deviation shrinks and the curves deviate from the unbiased line.

The bottom curve in figure 6 is for the average of two-point moving ranges. The remaining curves, in ascending order according to their left-hand end points, are for average ranges based on subgroups of size 2, 3, 4, 5, 6, 8, and 10.

On the right side of figure 6, we see that the formulas based on the average range will remain unbiased as long as *SD(X)* is larger than two-thirds of the measurement increment.

When *SD(X)* gets smaller than this, the round-off will begin to introduce biases into the formulas. While the top four curves show a region with some positive biases, ultimately all the curves plunge on the left, and the limits will shrink to oblivion as *SD(X) *gets smaller relative to the measurement increment.

If we define the borderline safe condition to be that point at which the standard deviation of the product measurements, *SD(X)*, is equal to the measurement increment, then the limits for the range chart will have the following form:

*Upper range limit* = *D4 MEAN(R) = D4 d2 SD(X) = D4 d2 measurement increments*

*Lower range limit *= *D3* *MEAN(R) = D3 d2 SD(X) = D3 d2 measurement increments*

These values are tabled for subgroup sizes of *n* = 2 to *n *= 10 in figure 7. Consideration of these limits reveals the number of possible values within the limits on a range chart at this borderline safe condition.

Thus, for subgroups of size *n* = 3 or larger, the measurement increment borders on being too large when there are only five possible values within the limits on the range chart. Fewer values within are indicative of chunky data.

For subgroups of size *n* = 2, the measurement increment borders on being too large when there are only four possible values within the limits on the range chart. Fewer values within are indicative of chunky data.

While these detection rules will work with range charts and moving range charts, they will not work with other charts for dispersion. This is because the range is the only measure of dispersion that preserves the discreteness of the original measurements.

Thus, there need be no confusion about whether the measurement increment being used is sufficiently small for the application at hand. The range chart clearly shows when it’s not. Fortunately, when this problem exists, the solutions are straightforward: Either smaller measurement increments must be used, or the variation within the subgroups must be increased. You must implement one of these solutions before your process behavior charts will be of any real use. If neither of these solutions can be applied, then the data may still be plotted in a running record and used in a descriptive sense, but they shouldn’t be used to compute limits for a process behavior chart.

Can we remedy the problem of chunky data by using subgroup standard deviations in place of the ranges? No, we can’t. Figure 8 shows the average and standard deviation chart for the data of figure 2.

While the limits on the average chart change slightly, the false alarms persist. Using a standard deviation chart won’t remedy the problem of chunky data.

Figure 9 shows how round-off introduces biases into the formulas based on the average standard deviation statistic. From the bottom, according to their left-hand end points, the curves are for subgroups of size 2, 3, 4, 5, 6, 8, and 10.

Here, except for *n* = 2, round-off begins to introduce bias as soon as *SD(X)* gets smaller than twice the measurement increment. As before, these curves all plunge on the left, and the limits shrink to oblivion as *SD(X)* gets smaller and smaller relative to the measurement increment.

Before we can say that our average and standard deviation chart is likely to be free from the biases introduced by chunky data, we will want to have an average standard deviation statistic that is greater than twice the measurement increment.

Figures 6 and 9 show that the effects of chunky data are eventually the same regardless of whether we’re using subgroup ranges or subgroup standard deviations. Once *SD(X)* is less than one-half measurement increment, the limits will plummet toward zero regardless of subgroup size and regardless of which measure of dispersion we use.

Chunky data are a problem with the measurement system that can be detected on an ordinary process behavior chart. It’s easy to spot, and it’s important to know about because it represents one of the two failure modes for a process behavior chart where the chart does not fail-safe. Chunky data will eventually create an excess number of false alarms, which will undermine the credibility of a process behavior chart.

The solution to chunky data requires less round-off in the measurements or an increase in the variation within the subgroups. Otherwise, your process behavior charts are likely to be misleading.

## Comments

## Nice refinement

This is one of my favorite topics. I'm glad to see the refinement (the power curves) to the original texts.

The problem I have often observed, especially with people who are new to SPC, is that they don't notice the chunky data patterns, and will thus find plenty of opportunities to tamper.