Making Sense of Observational Data

3 approaches with different results

Most data in business and industry belong to the category known as observational data. These data are the voice of your processes because they are the result of ordinary operations rather than an experiment.

Because the purpose of analysis is insight, the question is how to analyze observational data to gain insight into your processes. Here, we’ll illustrate three approaches to answering this question. Everyone who uses observational data needs to understand the differences between these approaches.

Description

The first approach is description. This uses the descriptive statistics and histograms taught in introductory classes in statistics. For our example, we’ll use the 200 observations listed in Figure 1.

Figure 1: 200 original observations X

The basic descriptive technique is to collect all the data into a histogram and compute the basic properties of the data set as a whole. The minimum value is 4, the median is 16, and the maximum value is 48. The average for the histogram in Figure 2 is 18.24, and the standard deviation statistic is 8.77.

Figure 2: Histogram of original observations X

The standard deviation characterizes how the data cluster around the average. This clustering is explained by the empirical rule, which tells us to expect 60% to 75% of the data within one standard deviation of the average. For these data, this interval is 9.47 to 27.01, and we find 147 (73.5%) of these observations in this interval.

The empirical rule also tells us to expect 90% to 98% within two standard deviations of the average. This interval is 0.70 to 35.78, where we find 189 observations (94.5%).

Finally, the empirical rule tells us to expect 99% to 100% within three standard deviations of the average. This interval is –8.07 to 44.55, where we find 198 observations (99%). So even though this histogram is skewed, it fits in with the rather broad expectations for a typical, unimodal histogram.

If the 200 values represent all of the product produced during this production run, then our description is complete. The 200 values are fully summarized by the descriptive statistics. But if the 200 values represent only a portion of the product stream, we may want to extend our description to cover the product not measured.

To extrapolate from the 200 observed values to a description of the 20,000 parts produced during this production run, we use interval estimates (also known as confidence intervals).

A 95% interval estimate for the average of all the product produced during this production run would be 17.01 to 19.46.

With specifications of 0 to 30, we find 9.0% nonconforming in the 200 parts measured. Based on finding 18 nonconforming out of 200, our 95% interval estimate for the fraction nonconforming among the 20,000 parts of this production run is 5.7% to 13.9%.

The performance index, P_p, computed from these 200 data is 59%. For the 20,000 parts as a whole, our 90% interval estimate for the performance index is 54% to 64%. The centered performance index, P_pk, for these 200 data is 47%. For the run as a whole, our 90% interval estimate for P_pk is 42% to 51%.

So, after describing these data, what do you tell the managers? Just that they will need to continue as in the past: In the words of Joe Druecker, they get to “Run... sort... pray!”

A critique of description

The descriptive approach is built on the assumption that the data are homogeneous. It is this homogeneity that makes the descriptive statistics meaningful. When the data are a mixture of apples and oranges, the average doesn’t summarize either group. Likewise, a histogram is intended to show how a set of homogeneous data varies. If a histogram is a collection of data representing unlike things, it tells us nothing about any one group of things.

Moreover, the extrapolation from the data to the product not measured is built on the assumption of homogeneity for both the data and the concurrent product. This is the basis for using interval estimates. If the assumptions of homogeneity are not correct, the descriptive approach becomes deceptive and misleading. It may describe some aspects of the past, but it’s unlikely to represent the future.

Modeling

Some data analysts seek to generalize observational data by finding a probability model that fits the histogram. Since our histogram is skewed, we’ll want to use a skewed probability model to characterize the process. And a family of skewed distributions preferred by many statisticians is the family of lognormal distributions. If these data might be reasonably modeled by a lognormal distribution, we can use the descriptive statistics from the histogram to estimate the two parameters for the lognormal model.

The scale parameter, alpha, may be estimated by the median of the original data. Here the median is 16, and so we estimate alpha = 16.

The shape parameter, beta, may be estimated according to the formula:

So we divide the average for the histogram by the median for the histogram, find the natural logarithm, multiply by 2, and take the square root to find an estimate of beta. For our data, this operation gives an estimate of beta of 0.511. So here we might use a lognormal (16, 0.5) model. This model is plotted on top of the histogram in Figure 3.

Figure 3: Original observations X with lognormal (16, 0.5) model

The advantage of having a lognormal model for our data is that we can compute probabilities using a standard normal distribution. Specifically, the probability that the lognormal variable X is less than some value X = x is the same as the probability that a standard normal variable, Z, is less than the value Z = z where:

So that

For example, using the upper watershed specification limit of 30.5, we might estimate that this process is likely to produce 9.85% nonconforming:

In the same way, we could compute a customized upper “control” limit for this process that would cover 99.7% of these observed values:

So we might predict that virtually all of the future values should fall between zero and 63.3 unless something changes in this process.

Thus, our model allows us to compute more precise values than we can compute based on the histogram alone. So now that we have a model for our process, what do we tell the managers? With 9% nonconforming, they will need to continue to run... sort... pray!

A critique of modeling

When we wrap up our observations inside a probability model, we no longer have to work directly with the raw data. As a result, the estimates based on the model will seem more precise and more robust than those obtained from the raw data. However, nothing in the process of fitting a model to the data does anything to reduce the uncertainties of the extrapolation from the data to the product not measured. Whether we report 9% or 9.85% nonconforming, the uncertainty is still 5.7% to 13.9%. The appearance of greater precision for the model-based estimates is an illusion created by writing extra decimal places.

Moreover, the upper bound on routine variation of 63.3 is an extrapolation beyond the data. Consequently, it’s a function of the model used rather than being a characteristic of the data themselves. Change the model, and you will change the estimated upper bound.

When we use a model to describe the process, we’re making two extrapolations. First, we’re assuming the observations represent the product that was not measured, and then we’re assuming that the model represents the product not yet produced. These extrapolations assume that the data are homogeneous internally, that they are homogeneous with the data stream, and that the process itself is homogeneous (i.e., unchanging) over time.

So the modeling approach seeks to generalize the observed data using a mathematical model. But this generalization is built on assumptions that are never explicitly stated nor examined for validity. While complexity may seem more profound than simplicity, it may more easily go astray and get the wrong answer.

Characterization

Observational data occur in time, and so they’ll almost always have a time-order sequence. Both the description approach and the modeling approach ignore this time order.

When we take the time-order sequence into account, we can characterize the process behavior. The data of Figure 1 are written in time order in rows. If we use this time order to organize these data into 40 subgroups of size five and create an average and range chart, we get Figure 4.

Figure 4: Average and range chart for original data

Clearly, this process was subject to oscillations, and the oscillations seem to have been getting more extreme. The short-term variation represented by the limits is dramatically less than the overall variation seen in the running record of averages. At this point, it doesn’t matter what the descriptive statistics happen to be. It doesn’t matter what the shape of the histogram may be. And it doesn’t matter what probability model we might use. The only question that matters here is, “What is causing the process to oscillate?”

The lack of homogeneity makes computations and models irrelevant. Effects have causes, and action is required to discover those causes. Unless and until the assignable cause of the process oscillation is discovered and controlled, this process will not operate anywhere near its full potential.

But how do we do this? Aristotle told us that the time to discover the causes that affect a system is to look at those points where the system changes. And a process behavior chart facilitates this search by highlighting process changes.

Moreover, a process behavior chart will also allow us to approximate what our process might look like when operated at full potential. We begin by estimating the process standard deviation using the short-term variation shown on the range chart. The average range of 5.40 results in an estimated process standard deviation of 2.32 units (rather than the 8.77 units found earlier). If we get the process average near the target value of 15, the empirical rule tells us that we could expect virtually all of the product to fall between 8 and 22.

So, after characterizing these data, what do you tell the boss? Find what’s causing the oscillation, and we can run this process so that it will produce 100% conforming product. No sorting needed. No prayers required. Just make it and ship it.

Figure 5 shows what this manufacturer was able to accomplish by identifying and controlling the previously unknown cause of the process oscillations.

Figure 5: Predictable operation vs. the original data

Summary

All data are historical. All analyses of data are historical. Yet management requires prediction. Prediction requires extrapolation. Extrapolation requires homogeneity. And the most useful information regarding homogeneity is contained in the time-order sequence of the data.

The descriptive approach ignores the time-order sequence and simply seeks to summarize the past. It can do no other. The modeling approach ignores the time-order sequence and provides a complex way to describe the past. It can do no other.

If the data happen to be homogeneous, the past may describe the future, and the descriptive and modeling approaches may work. However, in my 56 years of experience, I have found observational data to be homogeneous only about 1 time in 10.

The only approach that examines the data for the homogeneity needed for the extrapolations to be reasonable is the process behavior chart approach. When the chart shows points outside the limits, the requisite homogeneity is missing, and prediction is just wishful thinking. Actions are required to get the process to operate predictably and, by highlighting the points where the process changes, the process behavior chart gives you clues as to what actions to take. The inevitable results of finding and controlling assignable causes of exceptional variation are increased quality, increased productivity, and improved competitive position.

The alternative is: Run... sort... pray!

Donald J. Wheeler’s complete “Understanding SPC” seminar may be streamed for free; for details, see spcpress.com.

Comments

Taking action

Once again we see the value of finding the signals (of process changes) within our data. In some ways finding the signals is the easy part … because if we collect data as a basis for action there is still more to do. Is there the means and motivation to do something to address these signals? If yes, and if successful in doing so, process improvement is inevitable, helping to lower costs and improve quality. If not, perhaps the only one smiling is your competitor.

Exact definition of data homogeneity

I am asking myself "What is the exact definition of “data homogeneity”? Is the homogeneity data judged with respect to a given model? For example, since the data in Figure 3 fit satisfactorily the log-normal model, can it be judged homogeneous with respect to this specific model? Is it surprising that the same data are not homogeneous with respect to a normal model, given the fact that the data are clearly non-normal? Process behavior charts are known to fit best normal data. I wish the data were plotted also on an XmR chart. I wish Tolerance Intervals (TI) were also calculated (both normal TI and non-normal TI despite the fact that only 200 data are given) since they refer to individual data.

Normality Myth

Control charts work for ANY data. The normality myth is perpetrated by Sick Sigma con men wanting you to buy outrageously priced statistical software that you don't need.
https://www.qualitydigest.com/inside/statistics-column/normality-myth-090819.html

Use this FREE XmR charter:
https://QSkills3D.com/resources/MQCC.html

Reply for RB The second…

Reply for RB
The second axiom of data analysis is that probability models do not generate your data. The fourth axiom of data analysis is that no histogram can be said to follow a particular probability model (to the exclusion of all others). Therefore, agreement between a histogram and a probability model tells us nothing about the internal consistency of the data. The only operational definition of homogeneity is a process behavior chart. When a process is operated unpredictably the data will not be homogeneous. When the data are homogeneous the process will have been operated predictably. Data homogeneity and process predictability are two sides of the same coin.

What is a Homogeneous Process?

*yep, most people mis-spell it including me…

First it has nothing to do with any distribution. Read the dictionary definition of homogeneity: a uniform, consistent, or similar composition/character throughout, It is used to describe things of the same kind, such as a "homogeneous mixture" in chemistry. Synonyms include uniform, alike, consistent, and identical.

Mathematically and Physics based: A homogenous process is a seemingly random process where the factor that controls the average also controls the standard deviation. In the simplest terms this means the largest component of variation is piece to piece and within piece, lot to lot, operator to operator, equipment to equipment, measurement error and time to time are trivially minor. The SD of the between subgroup variation is: SD_Xbar = SD_Total/sqrt[n] where n = the subgroup sample size and SD_Total is estimated by the (pooled or average) within subgroup variation.

If you have a homogenous process then the within subgroup variation is a perfectly good estimator of the total variation. The between subgroup variation is very small. This also known as sampling error or tragically ‘noise’. The within subgroup and between subgroup variation are predictable without patterns within calculated limits.

A non homogenous process is one where the SD is controlled by one factor and another factor controls the between subgroup variation. In a non-homogenous process the total SD is a function of both the within subgroup variation and the between subgroup variation. Piece to piece variation is not the controlling component for variation. Now the other components play in a major way. Now the SD_Total is the sqrt(SD_Within^2 +SD_Between^2)

In a stable process, the process stream will appear homogenous. When an “Assignable Cause” occurs the process becomes non-homogenous, or “out-of control”.

Another Good Article

When I teach people about "sigma" levels, I always urge them to view a process behavior chart of the data. Special cause can distort data and create a "distribution" that isn't a real description of the process. This is a terrific example.

Sigma Levels

"Sigma levels" are a myth perpetrated by sellers of the Six Sigma Scam. You should NEVER use them.
The "6 sigma level" was concocted by a Harry, a school teacher, turned self-confessed con man. He based the trash on the height of a stack of discs!
Birth of a Fraud:
https://QSkills3D.com/resources/birth.html