Featured Product
This Week in Quality Digest Live
Quality Insider Features
Jonathan Griffin
New standard leads to smoother production in 3D printing
Anisur Rahman
ORNL finds scalable, sustainable approach
David Stevens
Tracking your assets is critical to patient safety
Richard Harpster
Good news? You are probably already doing it.
Adam Zewe
Researchers find the root cause of side-channel attacks that are easy to implement but difficult to detect

More Features

Quality Insider News
Day and a half workshop to learn, retain, and transfer GD&T knowledge across an organization
Making designs a physical reality with the know-how to make more
Sapphire XC will ship in late Q3 beginning with aerospace companies
Major ERP projects take six months longer than companies were told
Program inspires leaders to consider systems perspective for continuous improvement and innovation

More News

Davis Balestracci

Quality Insider

Big Data Have Arrived

But do you know their quality and origin?

Published: Monday, October 13, 2014 - 09:19

There has been an explosion in new technology for acquiring, storing, and processing data. The “big data” movement (and its resulting sub-industry, data mining) is becoming more prevalent and having major effects on how quality professionals and statisticians do their jobs.

Big data are a collection of data sets that are too large and complex to be processed using traditional database and data-processing tools. Any change this big will require new thinking. However, one thing won’t change and now becomes more important. My respected colleagues Ron Snee and Roger Hoerl call this an “inquiry on pedigree,” which asks if you know the quality and origin of your data to answer the following questions:
• What was the original objective of these data, if any?
• How were these data defined and collected?
• What was the state of the processes that produced these data—both the data process itself and the process by which data were collected?

Guilty until proven innocent

What people naively termed “clerical errors” resulted in a case at Duke University where four cutting-edge research papers were retracted and, subsequently, three trials using the results of these papers were shut down, causing several high-level resignations.

The analysis wasn’t the issue; data quality was.

It is poor practice to rely on whatever data happen to be available or to assume sophisticated analytics can overcome poor data quality. It is rare to see statistical textbooks address this issue; their universal assumption of “random samples” is the exception rather than the rule. The fact that data reside in electronic files says nothing about the quality of the data. Observational data are teeming with reproducibility issues, especially if they result from the merging of many different sources.

Diverse data sets collected by multiple sources are rife with opportunity for missing values, missing variables, measurement variation, and definition conflicts.

Knowing how the data were collected is critical to performing the correct analysis because the computer will do anything you want. For published data studies based on observational data, practical accumulated records compilation (PARC) are commonly analyzed using passive analysis by regressions and correlations (PARC). The results are generally PARC spelled backwards. Correlations are many times mistaken for cause and effect. Models obtained with inappropriate use of regression analysis have poor predictive capability, and their results can’t be reproduced by other investigators.

Subject matter knowledge is crucial. One must constantly check the data and results with the “does this make sense” test, aided with extensive use of appropriate graphical displays from beginning to end of any project. Look for:
• Data that are clearly wrong, such as grossly inappropriate values or impossibilities (“pregnant males,” for example)
• Results and trends that don’t make sense given the technical background of the problem
• Missing information and data critical to a useful analysis and making sound conclusions
• Regarding surveys and missing information, are the data truly missing or do they represent zero values?

Especially if significant conclusions are made, do you really understand how the data were collected? If you needed to, could you trace back and identify the origin of each data point?


Hand-in-hand with the explosion of big data will be an industry trying to sell you solutions for analyzing them. In fact, a client of mine sent me the figure below:

Figure 1: Performance management “eye chart”—not to be confused with an I-chart, i.e., a control chart for individuals. Click here for larger image.

The vendor promises highly visual, easy-to-understand information that will allow insight into not only the level of employee engagement, but also the level of effective leadership within each department. The vendor claims its analyses are statistically robust and repeatable over many years. Note how it “delights its customers” by also throwing in red, yellow, and green displays.

Another claim of this vendor that is some possible good news, but for a different reason: This one tool is a fraction of the cost of a typical survey. Well, at least you could save money on all those ongoing silly customer-satisfaction surveys.

As W. Edwards Deming replied to an executive who bragged that he’d just bought a $3 million computer, “Too bad. What you need is $300,000 worth of brains.”


About The Author

Davis Balestracci’s picture

Davis Balestracci

Davis Balestracci is a past chair of ASQ’s statistics division. He has synthesized W. Edwards Deming’s philosophy as Deming intended—as an approach to leadership—in the second edition of Data Sanity (Medical Group Management Association, 2015), with a foreword by Donald Berwick, M.D. Shipped free or as an ebook, Data Sanity offers a new way of thinking using a common organizational language based in process and understanding variation (data sanity), applied to everyday data and management. It also integrates Balestracci’s 20 years of studying organizational psychology into an “improvement as built in” approach as opposed to most current “quality as bolt-on” programs. Balestracci would love to wake up your conferences with his dynamic style and entertaining insights into the places where process, statistics, organizational culture, and quality meet.


Problems with getting Accurate and Complete Data

Mr. Balestracci has hit the nail on the head - no data analytics solution will be able to yield valid results unless diligence has been done on the data. Is the data complete, self-consistent, and accurate? Many of the test and quality datasets that our team reviews have vast disparities in data formats and schemas before being synchronized. A lot of companies try to implement a system themselves, only to find that the data is not properly formatted and scrubbed to provide the analytics that they need. Even some commercial companies are more "glitz" than substance.

IntraStage's customers are incorporating manufacturing data from multiple sources and formats - manufacturing floors, contract manufacturers, suppliers, field data, failure analysis, and engineering - to provide a full view of quality across the entire product lifecycle. Check out www.intrastage.com to learn more about how test and quality analytics are being used to improve yield and product quality by our customers today.