Featured Product
This Week in Quality Digest Live
Operations Features
Etienne Nichols
It’s not the job that’s the problem. It’s the tools you have to do it with.
Lee Simmons
Lessons from a deep dive into 30 years of NFL and NBA management turnover
Having more pixels could advance everything from biomedical imaging to astronomical observations
Chris Caldwell
Significant breakthroughs are required, but fully automated facilities are in the future
Dawn Bailey
Helping communities nurture the skilled workforce of the next generation

More Features

Operations News
Modern manufacturing execution software is integral for companies looking to achieve digital maturity
Study of intelligent noise reduction in pediatric study
Easy to use, automated measurement collection
A tool to help detect sinister email
Funding will scale Aigen’s robotic fleet, launching on farms in spring 2024
3D printing technology enables mass production of complex aluminum parts
High-end microscope camera for life science and industrial applications
Three new models for nondestructive inspection

More News

Davis Balestracci


Some Final Thoughts on DOE—for Everyone

It’s all in the planning

Published: Monday, October 17, 2016 - 13:21

Client A came to me for a consultation and told me upfront his manager would allow him to run only 12 experiments. I asked for his objective. When I informed him that it would take more than 300 experiments to test his objective, he replied, “All right, I’ll run 20.”

Sigh. No, he needed either to redefine his objectives or not run the experiment at all.

I never saw him again.

Client B came to me with what he felt was a clearly defined objective. He thought he just needed a 10-mintue consult for a design template recommendation. It actually took three consults with me totaling 2 1/2 hours because I asked similar questions to those required for planning the experiment I wrote about in my column from September 2016.

During the first two consults, Client B would often say, “Oh... I didn’t think of that. I’ll need to check it out.” He eventually ran the experiment, came to me with the data, and asked, “Could you have the analysis next week?” I asked him to sit down and was able to finish the analysis (including contour plots) in about 20 minutes.

It’s all in the planning
To review, if your objective is to establish effects, any good design needs answers to three questions as part of the planning:
• What risk are you willing to take in declaring an effect significant when it isn’t? (Usually 5%.)
• What is the threshold minimum difference you must detect to take the action you want?
• If this difference exists, how badly do you want to detect it?

If you don’t formally consider these questions, your design will answer them by default. As you’ve seen in the past few columns, this can result in some eye-popping sample sizes.

Power: the probability of successfully detecting your desired difference, if it exists

For any previously calculated sample sizes, I assumed the usual desired significance level of 0.05 for testing the effects. Many people blindly use this as their only statistical criterion without formally asking the other questions. They naively run a design and t-test the results to declare them significant or not; but, they have no idea what minimum effect their design was implicitly designed to detect.

So how did I obtain those sample sizes? (See the sample-size calculation below.) Once again using the tar scenario hypothetical conversation from my column, “90 Percent of DOE [design of experiments] Is Half Planning,” in May 2016, I made an assumption I didn’t tell you about that answers the question, “How badly do I want to detect my desired effect?” If you wanted to detect a 1-percent difference, it will take 680 experiments to have a 90-percent chance of detecting it—if it exists. The more relaxed sample size of 500 gives you an 80-percent chance of detecting your desired difference. The same is true for the 170 and 130 experiments required, respectively, to detect a 2-percent difference.

This concept is called “power,” a design’s ability to detect your desired threshold difference if it exists. Or, as I heard a statistics professor say tongue-in-cheek to a Ph.D. student planning an overly ambitious experiment using an inappropriately small sample size, “Power is the probability you will get a thesis.” Things subsequently simplified quite a bit!

Let’s look at some designs people might use, naively expecting to detect a 1-percent difference in the tar scenario:
• Running four experiments would have a 5.5-percent chance of detecting the 1-percent difference if it existed
• Running eight experiments would have a 6.5-percent chance
• Running 16 experiments would have a 7.5-percent chance

As you see, one can work backward to obtain the power of a design. It’s not unusual for clients to be surprised (and disappointed) at the answers.

When consulting with the client who initially could afford only 12 experiments but agreed to run 20, my sample size of 300 experiments resulted from his original objective of wanting to detect a minimum effect of approximately (0.325 x SDprocess) with 90-percent power. (SDprocess = standard deviation of process being studied, of which he had a historical estimate.)

With his proposed 20 experiments, he would be able to detect a minimum effect of approximately (1.6 x SDprocess); or, to turn things around, the power to detect his desired threshold difference of (0.325 x SDprocess) would be approximately 9.3 percent.

If you have a relatively good estimate of your process standard deviation (i.e., SDprocess):
• A 2 x 2 unreplicated factorial design can detect a ((2.8 to 3.3) x SDprocess) difference (80% and 90% power, respectively)
• A 2 x 2 x 2 unreplicated factorial can detect a ((2.0 to 2.3) x SDprocess) difference
• 16 experiments can detect a ((1.4 to 1.6) x SDprocess) difference

For those of you interested in the sample size calculation,

Let R = ratio of (desired effect / SDprocess):

For 80-percent power: N total = (5.6 / R)**2 [(5.6 divided by your ratio R), and this result is squared]

For 90-percent power: N total = (6.5 / R)**2

Some pretty good DOE rules

From my experience with factorial designs:
• 16 experiments is a “pretty good” (and relatively affordable) number. I rarely ran 32.
• If you’re going to run 16 experiments, you may as well study five variables, if you can.
• If you have only three variables, think about a Box-Behnken design, if appropriate (it wouldn’t be appropriate for the factorial example in my column, “Two More Lurking Mess-Ups for Any Experiment, Designed or Not” in September 2016, because all three variables were “yes or no.”)
• If you have more than five variables and can afford only 16 experiments, then consider if you have six variables, you may as well study eight. This strategy also allows you to screen out variables that don’t seem to be important, which often gets the reaction, “Wait a minute—I know that variable is important!” A variable may indeed be important, but in this case, all “insignificant” means is it doesn’t exert an effect or interaction in the specific range studied—which was chosen for a reason (i.e., objective). Set the insignificant variable outside the studied range, and don’t be surprised at what happens. When variables are screened out, a design can usually be easily augmented to get the remaining variables’ interactions, and if a contour plot is desired, augmented even further to yield the quadratic equation to plot.
• Make important decisions regarding noncontinuous discrete variables before continuing to get a contour plot, e.g., catalyst A vs. catalyst B. Since the design in last month’s column was all discrete variables (e.g., yes or no: Did the patient keep a food diary?), this contour plot option isn’t applicable.

Note: This strategy is sequential and many times builds upon experiments already run. No data are wasted.

For more information, see R. M. Moen’s, T. W. Nolan’s, and L. P. Provost’s book, Quality Improvement Through Planned Experimentation (McGraw-Hill, 2012).

To summarize, the act of designing an experiment is composed of four parts:
1. Deciding what you need to find out or demonstrate
2. Estimating the amount of data required
3. Anticipating what the resulting data will be like
4. Anticipating what you will actually do with the finished data

Thanks for indulging me in this design of experiments (DOE) tangent of the last six columns. I hope you learned a few things, especially statistical trainers and belts. Back to improvement next time.


About The Author

Davis Balestracci’s picture

Davis Balestracci

Davis Balestracci is a past chair of ASQ’s statistics division. He has synthesized W. Edwards Deming’s philosophy as Deming intended—as an approach to leadership—in the second edition of Data Sanity (Medical Group Management Association, 2015), with a foreword by Donald Berwick, M.D. Shipped free or as an ebook, Data Sanity offers a new way of thinking using a common organizational language based in process and understanding variation (data sanity), applied to everyday data and management. It also integrates Balestracci’s 20 years of studying organizational psychology into an “improvement as built in” approach as opposed to most current “quality as bolt-on” programs. Balestracci would love to wake up your conferences with his dynamic style and entertaining insights into the places where process, statistics, organizational culture, and quality meet.


Reply to Steven Wachs's Comment

See below...

Using Standard Deviation of Process to estimate sample size

I believe the relevant standard deviation to be concerned with (when calculating the number of replicates), is that due to experimental error.  That is, the standard deviation among replicate measurements.  This is often much less than the typical process variation we might see assuming, we are following good experimental practices (one operator, single lot of material, or using blocking and covariates to manage variation due to these nuisance sources in the experiment).  Of course it's possible that experimental error could be more than normal process variation if the going through the various setups result in unintended variation.

Which variation to use?

Thanks for commenting, Steven.  I see what you mean, BUT...what happens when you try to take your results from a tightly controlled experiment into the real world environment, i.e., multiple operators and lots of material that combine at random? That's the REALITY of implementation and cannot be controlled.

A more robust approach might be judicious blocking of these nuisance (random, but very real) factors.  The resulting variation would be more than your approach but more realistic and not as bad as not leveraging the power to block them -- control charts managing the process would detect special causes among those factors with the appropriate variation "yard stick."  This variation is also not as naively low as controlling factors that are realistically uncontrollable.

The result you get from such a design is good only for your specific designed conditions, i.e., you have the result for THIS specific operator for THIS specific lot (enumerative).  How does that help you?  What is your theory about putting this result in the real world and not a lab?

As W. Edwards Deming always asked, "What can you predict?"  How robust is your result?  And this gets into the question, How is variation going to manifest in your results (analytic), i.e., multiple operators and multiple lots and factors you didn't even envision affecting your result?  Control charts are the ways to shed insight on these factors and increase your degree of belief in your study's validity.


What does DOE stand for in this instance?

Nice article, but not sure what DOE stands for in this instance

I've found it's a good practice to include what the acronym means the first time it's used and then use the acronym later, especially for any professional document.

DOE does not appear to mean Department of Energy or Education... maybe "Depending on Experience"... but not real sure. Maybe something related to operations.

Thanks again for the nice article, but I spent too much time trying to figure out what the acronym might stand for.


Good catch

Nice catch. We have now spelled out "design of experiments" in first reference in the article. We don't do that in headlines but we normally catch those in the actual article. Thanks