Featured Product
This Week in Quality Digest Live
Health Care Features
Kari Miller
An effective strategy requires recruiting qualified personnel familiar with the process and technology
William A. Levinson
People can draw the wrong conclusions due to survivor, survey, and bad news bias.
The Un-Comfort Zone With Robert Wilson
Here’s how to control negative self-talk
Merilee Kern
Radicle Science brings AI-driven clinical trials to cannabinoid and wellness research
Duxin Sun
Working at such small scale becomes the next big thing

More Features

Health Care News
MIT course focuses on the impact of increased longevity on systems and markets
Delivers time, cost, and efficiency savings while streamlining compliance activity
First responders may benefit from NIST contest to reward high-quality incident command dashboards
Enhances clinical data management for medtech companies
Winter 2022 release of Reliance QMS focuses on usability, mobility, and actionable insights
The tabletop diagnostic yields results in an hour and can be programmed to detect variants of the SARS-CoV-2 virus
First Responder UAS Triple Challenge focuses on using optical sensors and data analysis to improve image detection and location
Free education source for global medical device community
Extended validation of Thermo Scientific Salmonella Precis Method simplifies workflows and encompasses challenging food matrices

More News

Davis Balestracci

Health Care

The Famous DOE Question

How many experiments should I run?

Published: Monday, August 15, 2016 - 00:00

I hope this little diversion into design of experiments (DOE) that I’ve explored in my last few columns has helped clarify some things that may have been confusing. Even if you don’t use DOE, there are still some good lessons about understanding the ever-present, insidious, lurking cloud of variation.

Building on my June column, consider another of C. M. Hendrix’s “ways to mess up an experiment”: Insufficient data to average out random errors (aka, a failure to appreciate the toxic effect of variation).

This is where the issue of sample size comes in, and it’s by no means trivial.

How many experiments should I run? It depends.

The ability to detect effects depends on your process’s standard deviation, which in the tar scenario from my May column simulation was +/– 4 (the real process was actually +/– 8).

Here’s a surprising reality for many: The number of variables doesn’t necessarily directly determine the number of experiments. But let’s continue the tar scenario:

“Three variables? It’s obvious: Let’s run a 2 x 2 x 2 factorial.” (Eight experiments.)
Most people might not realize that this design would allow detection of an approximate 8 percent to 9 percent difference between the high and low levels of a variable, e.g., the average difference in tar if one goes from 55°  to 65°, or 26 percent to 31 percent copper sulfate, or 0 percent to 12 percent excess nitrite.

“I want to do only four experiments, so I’ll do a 2 x 2 factorial and study the other variables later.”
There are consequences! Running an unreplicated 2 x 2 (four experiments) on two of the variables (e.g., omitting excess nitrite) would allow detection of an 11 percent to 13 percent difference between the high and low variable settings. Interactions between excess nitrite and each of the other two variables would be unknown.

“What do you mean, replicate it?”
Replication of your 2 x 2 x 2 factorial (16 experiments) would then allow detection of an approximate 5.5 percent to 6.5 percent difference. To get this same accuracy with two variables, the 2 x 2 factorial would have to be repeated three more times (16 total experiments)—a wasted opportunity.

If you knew up front that 16 experiments would be needed for your objective, you could now easily include excess nitrite. And you could easily add two additional variables (five total, perhaps those two variables you were planning to “study later?”) with no serious consequences.

Wouldn’t it be nice to discover up front that the excess nitrite could subsequently be set to zero? It’s your decision, and it depends on the answer to this question: What size effect must you detect to take a desired action?

I’ve often had this conversation in various guises:

Client: I have three variables I can test. Given the potential cost savings per each percent tar reduction, I need to detect a one percent difference.

Davis: Sit down. I’m afraid I have some bad news. That would require 500 to 680 experiments, depending on how badly you want to detect that effect.

Client: Ohhh... what if I cut it down to two variables?

Davis: Sorry. Still 500 to 680.

Client: Really? OK, I’ll settle for detecting 2 percent.

Davis: You’d better stay sitting. That would now require 130 to 170 experiments. But wait; let’s chat some more. Under the right circumstances, I might be able to recommend a 15-run design (three-variable Box-Behnken) that will map the region;  or, should you wish to study two additional variables, there is a five-variable design that would allow you to study these two additional variables and map the region in 33 experiments (believe it or not, four variables would also take ~30 experiments).

Why the dramatic difference in numbers? It depends on your objective, which brings me to another Hendrix “mess up”: Establishing effects (factorial) when the objective was to optimize (response surface), or vice versa.

People think it’s as simple as running a factorial design based on the number of variables, and then performing statistical t-tests. It’s not.

A healthcare example—for everyone

Suppose you’re interested in examining three components of a weight-loss intervention:
• Keeping a food diary (yes or no)
• Increasing activity (yes or no)
• Home visit (yes or no)

You plan on randomly assigning individuals to one of the eight experimental conditions, each representing a different treatment protocol. For example, the individuals randomly assigned to Condition 2 would receive a home visit, but neither of the other two intervention components. Those randomly assigned to Condition 7 would receive the “keeping a food diary” and “increasing physical activity components,” but wouldn’t receive a home visit. People assigned to Condition 1 will have to rely on sheer willpower.

Sounds simple enough.

I happen to be visiting your facility, and you ask me for a sample size recommendation for the number of people needed.

I smile and say, “Please sit down.”

To be continued next time.

Discuss

About The Author

Davis Balestracci’s picture

Davis Balestracci

Davis Balestracci is a past chair of ASQ’s statistics division. He has synthesized W. Edwards Deming’s philosophy as Deming intended—as an approach to leadership—in the second edition of Data Sanity (Medical Group Management Association, 2015), with a foreword by Donald Berwick, M.D. Shipped free or as an ebook, Data Sanity offers a new way of thinking using a common organizational language based in process and understanding variation (data sanity), applied to everyday data and management. It also integrates Balestracci’s 20 years of studying organizational psychology into an “improvement as built in” approach as opposed to most current “quality as bolt-on” programs. Balestracci would love to wake up your conferences with his dynamic style and entertaining insights into the places where process, statistics, organizational culture, and quality meet.