Having an effective model for the nature of data will inevitably identify two different paths to process improvement. One path seeks to operate a process up to its full potential while the other path seeks to operate to meet requirements. This article explains how these two paths differ and how they can be used together to successfully improve any process.
In the classroom, two plus two is always equal to exactly four. Yet, when a manufacturer tries to lay down a two micron film on a two micron substrate, the result is seldom equal to exactly four microns. About the best that we can hope for is that the result will be four microns thick on average. And this is the basic difference between numbers and data.
Numbers inhabit the mathematical plane—where 1 is always 1, and 2 is always 2—and there is no uncertainty. Here, it can truly be said that two numbers that are not the same are different.
On the other hand, data belong to the real world. They are generated by some underlying process. As the process varies, the data will also vary. This variation introduces events that we never dreamed of in arithmetic. For example, when we use numbers to express our data, we suddenly find that two numbers that are different may, in fact, represent the same thing. So when it comes to using data, we have to understand how to deal with variation, or we are liable to end up being misled or confused.
Because it is the underlying process that produces the variation, we will need a model for the structure of an underlying process to be able to gain some insight into the origins of variation within our data. This article will provide such a model.
No matter what your process, no matter what your data, all data display variation. Any measure you can think of that will be of interest to your business will vary over time. The reasons for this variation are many. There are all sorts of causes that have an effect on your process and its outcomes. It is not unrealistic to think that your processes and systems will be subject to dozens, or even hundreds, of cause-and-effect relationships. This multiplicity of causes has two consequences: it makes it easy for you to pick out an explanation for why the current value is so high, or so low; and it makes it very hard for you to know whether your explanation is even close to being right. To make any headway in understanding data, we need to start with a discussion of these cause-and-effect relationships.
First, we need to perform a thought experiment. Begin by thinking about any product characteristic that you are familiar with. We will call this Characteristic No. 1. Next, list all of the cause-and-effect relationships that you can think of that will have an effect upon your Characteristic No. 1. In most cases, you will have dozens of cause-and-effect relationships. For purposes of this illustration, we will assume that you have named 21 cause-and-effect relationships. Because of these cause-and-effect relationships, we would expect the values for Characteristic No. 1 to vary about some average value in the manner shown in figure 1.
Some of these causes will have large effects upon our characteristic, while others will have smaller effects. In figure 2 we use bars to show the effects of each of the 21 causes upon Characteristic No. 1. As each cause varies within some range of interest, the bar shows the amount by which Characteristic No. 1 varies. Thus, the height of each bar shows how much of the variation in Characteristic No. 1 can be attributed to each individual cause. With perfect knowledge about effects of each cause upon our product characteristic, we could arrange these causes in order of descending impact to create the Pareto diagram shown below:
In figure 2, the causes with the large bars (the dominant causes) will provide the greatest leverage for changing the average value of the product characteristic. Causes with small bars will provide little opportunity to change the average product characteristic.
Therefore, the producer will typically select the levels of the dominant causes to obtain the desired average level for the product characteristic. Moreover, when the levels for the dominant causes are held constant, these causes will no longer contribute to the variation in the product characteristic. In our example, causes 1, 2, 3, and 4 account for 79 percent of the total variation in the product characteristic, while causes 5 through 21 account for the remaining 21 percent. By controlling the levels of causes 1, 2, 3, and 4, the producer determines the average value for Characteristic No. 1 and eliminates 79 percent of the variation in the product characteristic. This can be seen in figures 3 and 4.
Because of the way variation works, the 21 percent of the variance that remains in figure 3 shows up in figure 4 as a distribution with a spread that is √0.21 = 0.458, or 46 percent of the spread of the original distribution.
Finally, once the first four causes have been controlled, it is probably not economical to attempt to control any further causes. For example, causes 5, 6, 7, and 8 only contribute 9 percent of the total variation in Characteristic No. 1. While it will probably cost about the same to control the second set of four causes as it does to control the first set of four causes, the payback for controlling these additional four causes will only be about one-ninth as big as it was for the first four causes. This diminishing return may be seen in figure 5.
Economic production requires that we distinguish between those causes with dominant effects and the many causes with small effects. Controlling the levels of dominant cause-and-effect relationships is economic. Controlling the levels of other cause-and-effect relationships will not be economical.
The diminishing returns shown in figure 5 make it unreasonable to devote resources to controlling any cause-and-effect relationships for Characteristic No. 1 beyond the first four. Thus, Causes 5 to 21 will make up the set of uncontrolled factors. It is this group that creates virtually all of the variation in the product characteristic. Thus, the first benefit of having a conceptual model for the nature of data is that it provides a reasonable explanation for why it is not economical to try to control all of the relationships that are thought to be more critical (causes 1 through 10). The remaining causes will not be studied—they will either be ignored, or randomized, or held constant during the course of the experiments performed by research and development (R&D).
Thus, the first benefit of having a conceptual model for the nature of data is that it provides a reasonable explanation for why it is not economical to try to control all of the inputs to a production process. The dominant causes are the only ones that we want to include in the set of control factors.
Of course, before we can draw the Pareto diagram of figure 2 we will need perfect knowledge about all of the cause-and-effect relationships for Characteristic No. 1. In the absence of perfect knowledge, what can we do? We generally start off as shown in figure 6 for Characteristic No. 2 and ask R&D to sort out the gaps in our knowledge. To this end, R&D will need to find the amount of variation attributable to each cause. Next they will need to identify the control factors, and then for each of these they will need to establish the appropriate levels for these control factors that should be used in production.
Unfortunately, due to pressures of time and money, it is a rare thing for R&D to ever study all of the causes on the list. Generally they will use both theory and experience to redraw figure 6 to look like figure 7, and then they will tackle the cause-and-effect relationships that are thought to be the more critical (causes 1 through 10). The remaining causes will not be studied—they will either be ignored, or randomized, or held constant during the course of the experiments performed by R&D.
Figure 8 shows the results reported by R&D for Characteristic No. 2. Among the 10 causes studied, they found causes 5, 1, and 7 to be the dominant factors, collectively accounting for 77 percent of the variation observed by R&D. Thus, they told the production department that they needed to control causes 5, 1, and 7; and they also defined the appropriate levels to use with each of these control factors.
So the production process was set up using causes 5, 1, and 7 as control factors for Characteristic No. 2. However, at the start of production, they immediately had problems with too much variation in product Characteristic No. 2. This resulted in a high scrap rate. So they decided to also control cause 4. While this cost extra money, it was expected to remove an additional 6 percent of the variation found by R&D. Unfortunately, as seen in figure 9, controlling cause 4 had virtually no effect upon the process outcomes.
As they fell further and further behind the production schedule, and as the mountain of scrap and rework increased, they began to talk about the “skill” that it took to make this product. Words such as “art” and “magic” were used. Inspection and rework facilities were expanded, and soon the production department had settled down to a routine that W. Edwards Deming called the Western approach to production—“burn the toast and scrape it.”
Of course, the problem was not with the set of control factors—by virtue of their being controlled, these causes ceased to be sources of variation for the product stream. Likewise, the problem was not with the other six causes studied by R&D—these causes all had small effects upon Characteristic No. 2. Nothing was wrong with the information provided by R&D, except that it was incomplete—the problem was with the factors not studied by R&D.
As seen in figure 10, causes 14 and 16 are dominant cause-and-effect relationships for Characteristic No. 2. They are as big, or bigger than causes 5, 1, and 7. But they were not studied. Although the manufacturer was not aware of the effect of causes 14 and 16, the process continued to be under their influence. Because causes 14 and 16 had not been studied, and were thought to be part of the lesser causes back in figure 7, the manufacturer was not exerting any control over the levels of these two factors. As a result, when the levels of either one of these causes changed, it would result in a corresponding change in the product characteristic. While the manufacturer remained unaware of the impact of causes 14 and 16, he still suffered the consequences of their effects.
So while causes 5, 1, and 7 need to be in the set of control factors, the producer also needs to have causes 14 and 16 in the set of control factors. Having cause 4 in the set of control factors is a mistake caused by the incomplete information provided by R&D.
Tomorrow, in part two of this series, we will look at assignable causes and common causes.