© 2024 Quality Digest. Copyright on content held by Quality Digest or by individual authors. Contact Quality Digest for reprint information.

“Quality Digest" is a trademark owned by Quality Circle Institute, Inc.

Published on *Quality Digest* (https://www.qualitydigest.com)

**Published: **03/05/2012

For those of us practicing improvement in a medical culture, presenting this “funny new statistical way” of doing things to a physician audience triggers a predictable stated reason: “This isn’t in line with rigorous, double-blind clinical trial research.” And your response should be, “True! Nor could it be, nor should it be.”

Clinical trial research statistical methods make assumptions and control variation in ways that can’t be replicated in the unstable environment of the real world, making them less suitable for improvement. This is true for any work environment as well.

Most basic academic statistics requirements are based in a context of “estimation” and teach methods appropriate for research. These, unfortunately, have limited applicability in everyday work, which is based on process-oriented thinking—a concept foreign to most academics—and whose need is “prediction.” This affects data collection, use of statistical tools, and validity of analyses.

Ultimately, a disarmingly and elegantly simple analysis will yield far more profound and productive questions than a typical, overly complicated (alleged) statistical analysis.

Suppose you have been getting an increasing number of vague anecdotes that it “feels” like the cardiac surgery mortality rate has been increasing of late. Not only that, but the organization is not making progress toward attaining a published national benchmark of 3.5 percent mortality rate. There are three hospitals in your system doing this type of surgery, and the tabular summary performance data for the last 30 months is shown below:

You ask your eager local statistical “guru” (LSG) to analyze the data, and you receive a report that states:

1. “Pictures are very important. A comparative histogram was done to compare the distributions of the mortality rates. At a first glance, there seem to be no differences.” (figure 1)

2. “The three data sets were then statistically tested for the assumption of normality. The resulting analysis showed that we can assume each to be normally distributed (p-values of 0.502, 0.372, and 0.234, respectively, all of which are > 0.05); however, we have to be cautious. Just because the data pass the test for normality does not necessarily mean that the data are normally distributed; only that, under the null hypothesis, the data cannot be proven to be non-normal.”

3. “Since the data can be assumed to be normally distributed, I proceeded with the analysis of variance (ANOVA) and generated the 95-percent confidence intervals” (figure 2):

4. “The p-value of 0.850 is greater than 0.05. Therefore, we can reasonably conclude that there are no statistically significant differences among these hospitals’ cardiac mortality rates as further confirmed by the overlapping 95-percent confidence intervals.”

5. “Regarding comparison to the national benchmark of 3.5 percent, none of the hospitals are close to meeting it. There will need to be a systemwide intervention at all three hospitals. I recommend that we benchmark an established hospital and copy their best practices systemwide.”

Has all the potential jargon been utilized? This includes: mean, median, standard deviation, normal distribution, histogram, p-value, analysis of variance (ANOVA), 95-percent confidence interval, null hypothesis, statistical significance, F-test, degrees of freedom, and benchmark.

Do you realize that this LSG’s analysis is *totally* worthless?

Here are three questions that should become a part of every improvement professional’s vocabulary whenever faced with a set of data for the first time:

1. How were these data defined and collected, and were they collected specifically for the current purpose?

2. Were the processes that produced these data stable?

3. After considering No. 1 and No. 2, were any analyses appropriate?

• How were these data collected?

The table was a descriptive statistical summary of the 30 previous months of cardiac mortality rates for three hospitals. These hospitals all subscribed and fed into the same computerized data collection process, so at least the definitions are consistent.

• Were the systems that produced these data stable?

This might be a new question for you. There are two key concepts to any robust improvement process:

1. *Everything* is a process.

2. *All* processes occur over time.

Hence, all data have an implicit “time order” element that allows a necessary assessment of the stability of the process or system producing the data.

It is always a good idea as an initial analysis to plot any data in its naturally occurring time order to assess, formally, the process stability. This was not done for this set of data. Otherwise, as you will see, many common statistical techniques could be rendered invalid. This puts one at risk for taking inappropriate actions.

• Were the analyses appropriate, given the way the data were collected and the stability state of the systems?

“But the data passed the Normal distribution test. Isn’t that all you need to know to proceed with the standard statistical analysis?” you ask. Early in my career, I believed this.

And your LSG also concluded that there were no statistically significant differences amongst the hospitals’ mortality rates.

Here are the three simple time plots for the individual hospitals. The individual median of each hospital’s 30 data points has been added as a reference line, making them run charts (figure 3):

Note that just by “plotting the dots,” you have far more insight.

Won’t this result in the ability to ask more incisive questions whose answers will lead to more productive system improvements?

Compare this to outputs typically encountered, such as bar graphs, pages of summary tables, and the “sophisticated” statistical analyses full of jargon. From your experience, what questions do people ask from those? Are they generally even helpful? Does anything change as a result?

Health care workers are very smart people. Unfortunately, they will, with the best of intentions, come up with theories and actions that could unwittingly harm a system. Or worse yet, they might do nothing because “there are no statistical differences” among the systems. Or they might decide, “We need more data.” Without common theory, there *will* be variation in how a roomful of people perceive and want to act on variation.

A potential new conversation will be shared in my next column that will once again shed light on my favorite answer to: “What should we do?”—“It depends.”

**Links:**

[1] /ad/redirect/19959