Featured Product
This Week in Quality Digest Live
Statistics Features
Donald J. Wheeler
The more you know, the easier it becomes to use your data
Scott A. Hindle
Part 7 of our series on statistical process control in the digital era
Donald J. Wheeler
How you can filter out noise
Scott A. Hindle
Part 6 of our series on SPC in a digital era
Douglas C. Fair
Part 5 of our series on statistical process control in the digital era

More Features

Statistics News
How to use Minitab statistical functions to improve business processes
New capability delivers deeper productivity insights to help manufacturers meet labor challenges
Day and a half workshop to learn, retain, and transfer GD&T knowledge across an organization
Elsmar Cove is a leading forum for quality and standards compliance
InfinityQS’ quality solutions have helped cold food and beverage manufacturers around the world optimize quality and safety
User friendly graphical user interface makes the R-based statistical engine easily accessible to anyone
Collect measurements, visual defect information, simple Go/No-Go situations from any online device
Good quality is adding an average of 11 percent to organizations’ revenue growth
Ability to subscribe with single-user minimum, floating license, and no long-term commitment

More News

Davis Balestracci


Can You Prove Anything With Statistics?

Maybe... using PARC

Published: Monday, December 14, 2015 - 09:13

“It is impossible to tell how widespread data torturing is. Like other forms of torture, it leaves no incriminating marks when done skillfully. And like other forms of torture, it may be difficult to prove even when there is incriminating evidence.”
—J. L. Mills

When will academics, Six Sigma belts, and consultants wake up and realize that, despite their best efforts, most people in their audiences will not correctly use the statistics they’ve been taught—including many of the teachers themselves?

Sometimes I wonder if they are exacting revenge on their captive audiences for being beaten up on the playground 25 years ago.

The clinical publications world is especially a hotbed for inappropriate uses of statistics. Many people are guilty of looking for the most dramatic, positive findings in their data, and who can blame them? If study data are manipulated enough, they can be made to appear to prove whatever the investigator wants to prove. When this process goes beyond reasonable interpretation of the facts, it becomes data torturing.

Two types of torture

1. Opportunistic. This involves a) poring over data not collected specifically for the current purpose until an alleged statistically significant association is found between variables, and then b) devising a plausible hypothesis to fit the association.

One can easily find significant results where none exist simply by making multiple comparisons. Using the widely accepted p-value of 0.05 (i.e., willingness to take a 5-percent risk of declaring something is significant when it isn’t), more comparisons mean more opportunities for random events to be declared significant simply due to chance. For two tests, the probability that at least one “significant” difference could possibly be declared randomly is 10 percent (1 – (0.95 x 0.95)). For 20 tests, it is 64 percent (1 – 0.95**20)).

If one is on a fishing expedition with such a data set—once again, I emphasize that this term applies only to data (usually a tabulation) that weren’t collected specifically for the current purpose—one should at least adjust decision criteria to make the overall risk 5 percent. This significance value is dependent on the number of possible comparisons.

There are several ways to do this, but one example would say that the threshold to declare significance for two potential individual comparisons should each be p < 0.025. Similarly, for 20 comparisons, this would be p < 0.0036.

Further, if the fishing expedition catches a boot, the fishermen should throw it back and not claim that they were fishing for boots. The honest investigator will limit the study to focused questions, all of which make sense in the given context—which can then be subsequently tested with an appropriately designed study. The data torturer will act as if every positive result confirmed a major hypothesis.

Unfortunately, when this type of data torturing is done well, it may be impossible for readers to tell that the positive association did not spring from an a priori hypothesis.

2. Procrustean. You decide on the hypothesis to be proved, then make the data fit the hypothesis.

This requires selective reporting, one of the most common being the intentional suppression of contradictory data. It is more difficult to carry out than opportunistic data torturing, but its results are often more believable if one starts with a popular hypothesis that appears to have been “proven.”

One should suspect data torturing whenever subjects are dropped without a clear reason, or when a large proportion of subjects are excluded for any reason. You should ask, “Is the rationale for the subgroup analyses convincing?”

In the case of medicine, remember that two sexes, multiple age groups, and different clinical features such as stages of disease make it possible for the investigators to examine the data in many different ways.

If a drug is reported as working only in women older than 60 years, the savvy reader should at least suspect a chance finding.

Do a PARC analysis, and you get...

The delightful applied-science statistician J. Stuart Hunter invented the term PARC to characterize a lot of what is being taught and practiced: “practical accumulated records compilation” on which one does a “passive analysis by regressions and correlations.” Then, to get it published, one must do the “planning after the research is already completed.”

With the current plethora of friendly computer packages that are designed to “delight” their customers, I have also coined the characterization “profound analysis relying on computers.”

Here is an enlightening quote by Walter A. Shewhart from his classic book, The Economic Control of Quality Manufactured Product (Martino Fine Books, 2015 reprint):

“You go to your tailor for a suit of clothes, and the first thing that he does is to make some measurements; you go to your physician because you are ill, and the first thing he does is to make some measurements. The objects of making measurements in these two cases are different. They typify the two general objects of making measurements. They are:
• To obtain quantitative information [only]
• To obtain a causal explanation of observed phenomena.”

These are two entirely different purposes. For example, when I’m being fitted for a suit, I don’t expect my tailor to take my waist measurement, then ask, “I need to know whether your mother has or had Type II diabetes.” The tailor doesn’t care about the genetic process that produced my body; he or she just measures it (once), then makes my suit.

I vividly remember a newspaper article that appeared when I lived in Minnesota more than 20 years ago, titled, “Whites May Sway TV Ratings.” It read:

“...[An] associate professor and Chicago-based economist reviewed TV ratings of 259 basketball games.... They attempted to factor out all other variables such as the win-loss records of teams and the times games were aired [my emphasis].... The economists concluded that every additional 10 minutes of playing time by a white player increases a team’s local ratings by, on average, 5,800 homes.”

Hence, Minnesotans are bigots! What do you think?

Isn’t the objective of TV ratings solely to find out how many people watched a particular show (i.e., “making a suit”), period? Is the data-collecting agency trying to determine racial viewing patterns during basketball games (i.e., causal explanation)? Hardly.

When “data for a suit” (i.e., most tabulated statistics) are used to make a causal inference, that’s asking for trouble. This is why a lot of published research is, in essence, PARC spelled backwards—which was Hunter’s ultimate point. People are doing PARC analyses on data that are the “continuous recording of administrative procedures.”

Speaking of data torturing, when are teachers of statistics going to stop torturing their students as well?


About The Author

Davis Balestracci’s picture

Davis Balestracci

Davis Balestracci is a past chair of ASQ’s statistics division. He has synthesized W. Edwards Deming’s philosophy as Deming intended—as an approach to leadership—in the second edition of Data Sanity (Medical Group Management Association, 2015), with a foreword by Donald Berwick, M.D. Shipped free or as an ebook, Data Sanity offers a new way of thinking using a common organizational language based in process and understanding variation (data sanity), applied to everyday data and management. It also integrates Balestracci’s 20 years of studying organizational psychology into an “improvement as built in” approach as opposed to most current “quality as bolt-on” programs. Balestracci would love to wake up your conferences with his dynamic style and entertaining insights into the places where process, statistics, organizational culture, and quality meet.