PROMISE: Our kitties will never sit on top of content. Please turn off your ad blocker for our site.
puuuuuuurrrrrrrrrrrr
Malcolm Chisholm
Published: Tuesday, November 24, 2009 - 05:00
I have just finished rereading Walter A. Shewhart's 1939 book Statistical Method from the Viewpoint of Quality Control (Dover Publications, 1986). Mine is the 1986 edition, which has a foreword by W. Edwards Deming. Shewhart, a Bell Labs man, pioneered quality control and was a major inspiration to Deming (who met him at Bell Labs). Deming is well known in his own right for his contributions to such issues as manufacturing quality in post-war Japan and for laying the foundations of Six Sigma. While these and other accomplishments have made Deming justly famous, he was never shy about recognizing the contribution of Shewhart to science and industry, and the influence of Shewhart on his own career. Indeed, Deming may to some extent have communicated some of Shewhart's ideas in ways that could be more easily understood and applied than Shewhart himself was able to.
Of course, the main interest that data management has in Shewhart and Deming is to learn from them how we can improve data quality. And indeed, it is not difficult to find mention of these names by authors such as Larry English, Danette McGilvray, and others who specialize in data quality. Yet, a brief reprise of some of the ideas in Shewhart's little 1939 book is in order, along with Deming's comments, because they seem to relate not just to data quality, but to data itself.
One of the most astonishing passages in Deming's foreword is the following:
"There is no true value of anything. There is instead a figure that is produced by application of a master or ideal method of counting or measurement .... There is no true value of the speed of light; no true value of the number of inhabitants within the boundaries of (e.g.) Detroit. A count of the number of inhabitants of Detroit is dependent upon the arbitrary rules for carrying out the count. Repetition of an experiment or of a count will exhibit variation. Change in the method of measuring the speed of light produces a new result."
This is a good deal to think about, but one of the most valuable lessons that can be learned is that we need to set expectations about data. Too often this is not done. When a data warehouse is built, the business sponsors may not even stop to think about what its limitations may be. In my experience, they simply seem to think that a data warehouse must be 100 percent accurate and complete, and that any deviation from this is due to technical problems caused by someone or something in IT. Too often, the team building the warehouse fails to set any kind of expectation about the accuracy of what the data warehouse contains. If we think in terms of just the technology involved, then we have no reason to believe that it will not work perfectly. But if we think about the data itself, we know in our hearts that Deming's admonition is true. It might seem strange that we have to tell our users that a technically perfect environment will only be, say, 92 percent accurate on average, but that is the reality.
In case anyone should think that Deming didn't know what he was talking about because he was too far removed from data, it should be remembered that he was involved in the 1940 U.S. Census and the 1951 Japanese Census. In the former, he introduced sampling techniques that greatly improved error rates.
There is one caveat about Deming's comments, however. They seem to echo Heisenberg's Uncertainly Principle, and what they have in common with that famous precept is the involvement of measurement. Measurement is itself a very complex subject, and we don't have space to deal with it in this article. However, although measurement predominates in scientific data, it is not always so in many of the enterprises that data management professionals work in. We deal with insurance policies, trades, employee actions, orders, and so on. These are not material things, as are cars, telephones, and people. Deming and Shewhart focused on manufactured products, where measurement is vital. Much of the data we deal with seems to belong to a different realm, and while the spirit of Deming's comment seems to apply to it, the detail of what he is telling us may be only partially relevant.
Shewhart's book, just 150 or so pages long, is packed with all kinds of implications for data management. One concept that Shewhart introduces early on is the order of production of data. This is of vital importance for Shewhart. It can be illustrated as follows. Suppose we are testing a printer cartridge to see how many pages it will print. Suppose we run 2,000 pages through it and find that at page 1,975 the print quality has deteriorated to the point where it is unacceptable. Now, if we look at the results as a whole we are forced to say that 1,975 pages out of 2,000, i.e., 98.75 percent of pages, are printed perfectly. This implies that our printer cartridge will continue to print for all eternity with a 1.25 percent error rate. Of course this is nonsense. After page 1,975, every page that comes out is unacceptable.
This may seem blindingly obvious, but I have rarely seen it applied to data. Shewhart is saying that the order of production of information contains extremely valuable information that is lost when the results are pooled and treated as a single population. We may find that a medical billing clerk has miscoded only 0.5 percent of his or her work in the past five years. That may seem acceptable, but if all those errors have occurred in the last 10 days, then we have a serious problem.
If we take Shewhart seriously on this point, then we may need to rethink the way in which we design databases and the way in which we test for data quality. For instance, in a Type 2 Slowly Changing Dimension table, do we need to know what columns have changed from one record version to the next and when they changed? Of course we can infer this by scanning the entire table, but if we had metadata flags to represent such changes, then we would be better positioned to run this kind of query. Perhaps variance in the relative rates of changes between columns could alert us to underlying problems.
A more difficult concept that Shewhart introduces—and he is not always an easy read—is that of statistical control. Shewhart divides up sources of variation into two categories:
Shewhart was greatly concerned with detecting and identifying the assignable causes. These can then be overcome and removed from the system. Once this was done, only the variation caused by the system itself remains. The system is in statistical control. That is, its performance is statistically predictable.
If we don't measure the performance of a system, we have no way of knowing if it is in statistical control. Something has to measure the system to monitor that it really is in statistical control and that its performance is as predicted. The things that do the measuring—instruments in manufacturing processes—themselves have to be in statistical control. Here we have a "who guards the guardians" kind of problem, and it is ultimately necessary for people to judge whether the entire complex is under statistical control.
Just about everyone in data management hates production environments, but Shewhart's thoughts on statistical control lead us inevitably in that direction. Most of our emphasis tends to be on design. We tend to wait for processing jobs to blow up before we (or hopefully someone else) are called in to correct the error. But surely process control for data should involve something deeper, something that approaches the monitoring of statistical control.
This article was originally published by the BeyeNETWORK on November 4, 2009.
Quality Digest does not charge readers for its content. We believe that industry news is important for you to do your job, and Quality Digest supports businesses of all types. However, someone has to pay for this content. And that’s where advertising comes in. Most people consider ads a nuisance, but they do serve a useful function besides allowing media companies to stay afloat. They keep you aware of new products and services relevant to your industry. All ads in Quality Digest apply directly to products and services that most of our readers need. You won’t see automobile or health supplement ads. So please consider turning off your ad blocker for our site. Thanks,
Malcolm Chisholm is an internationally recognized thought leader in Data Management with over 25 years experience in a variety of sectors. He is the author of the books Managing Reference Data in Enterprise Databases (Morgan Kaufmann, 2000) and How to Build a Business Rules Engine (Morgan Kaufmann, 2003). Chisholm is an independent consultant and a frequent speaker on data topics at conferences. He runs the website www.refdataportal.com and can be reached at MasterDataConsulting@gmail.com.
Shewhart, Deming, and Data
Going deeper on process control for data.
There is no true value of anything
The order of production of data
Statistical control
Alas, space only permits the sketching of just a few of the ideas of Shewhart and Deming, and really nothing about how they can be practically implemented for data. Writing in 1986, Deming lamented that it would take at least another 50 years before we would fully comprehend Shewhart's contribution and benefit from it. Perhaps it will take longer.
Our PROMISE: Quality Digest only displays static ads that never overlay or cover up content. They never get in your way. They are there for you to read, or not.
Quality Digest Discuss
About The Author
Malcolm Chisholm
© 2023 Quality Digest. Copyright on content held by Quality Digest or by individual authors. Contact Quality Digest for reprint information.
“Quality Digest" is a trademark owned by Quality Circle Institute, Inc.
Comments
Shewhart, Deming and Knowledge
Malcolm:
Nice article. I would mention that Shewhart's 1931 "Economic Control of Manufactured Product" was as revealing.
The notion that we seem to fight with data is that we have too much of it and most of it is junk. Technology folks seem to equate information with knowledge . . . it is not. Dr. Deming encouraged us to "get knowledge" in the systems in which we operate. This is a far cry from today where the decision-making has been separated from the work command and control style and decisions are instead made from IT reports. This is not getting knowledge, we lost too much context and poor decision making follows.
In the information age we lack this context to the point we would be better rolling dice and flipping a coin than reliance on such things as business analytics or intelligence. Most of these IT solutions add to costs and do little in the pursuit of knowledge or profound knowledge in Deming's terms.
If we are to use IT, we are best using it when we understand the work. For our biggest opportunity for change lies there.
Tripp Babbitt
www.newsystemsthinking.com
Data Affected by Measurement
I enjoyed the article. I wanted to add a comment about considering the quality of the data. In my university days we were taught in a class on design of experiments, that any measurement affects what you are measuring to some extent. For example, if you want to measure the temperature of a small part, you can do so by attaching a thermocouple, but the mass of the thermocouple will affect the rate of a temperature change, and the contact area changes the heat loss in that area. Similarly, making measurements of any sort of data may impact the data. Classic cases are (1) the measurer rounds the measurements or reports them on Mondays even though they occurred on the weekend, (2) the workers pay more attention to the process being measured, thereby modifying the error rate. Consideration of these aspects may affect the way we collect data or interpret the data collected.