## Some Final Thoughts on DOE—for Everyone

### It’s all in the planning

Published: Monday, October 17, 2016 - 14:21

Client A came to me for a consultation and told me upfront his manager would allow him to run only 12 experiments. I asked for his objective. When I informed him that it would take more than 300 experiments to test his objective, he replied, “All right, I’ll run 20.”

Sigh. No, he needed either to redefine his objectives or* not run the experiment at all*.

I never saw him again.

Client B came to me with what he felt was a clearly defined objective. He thought he just needed a 10-mintue consult for a design template recommendation. It actually took three consults with me totaling 2 1/2 hours because I asked similar questions to those required for planning the experiment I wrote about in my column from September 2016.

During the first two consults, Client B would often say, “Oh... I didn’t think of that. I’ll need to check it out.” He eventually ran the experiment, came to me with the data, and asked, “Could you have the analysis next week?” I asked him to sit down and was able to finish the analysis (including contour plots) in about 20 minutes.

**It’s all in the planning**

To review, if your objective is to establish effects, any good design needs answers to three questions *as part of the planning*:

• What risk are you willing to take in declaring an effect significant when it isn’t? (Usually 5%.)

• What is the threshold minimum difference you *must* detect to take the action you want?

• If this difference exists, how badly do you want to detect it?

If you don’t formally consider these questions, your design *will* answer them by default. As you’ve seen in the past few columns, this can result in some eye-popping sample sizes.

### Power: the probability of successfully detecting your desired difference, if it exists

For any previously calculated sample sizes, I assumed the usual desired significance level of 0.05 for testing the effects. Many people blindly use this as their only statistical criterion without formally asking the other questions. They naively run a design and t-test the results to declare them significant or not; but, they have no idea what minimum effect their design was implicitly *designed* to detect.

So how did I obtain those sample sizes? (See the sample-size calculation below.) Once again using the tar scenario hypothetical conversation from my column, “90 Percent of DOE [design of experiments] Is Half Planning,” in May 2016, I made an assumption I didn’t tell you about that answers the question, “How badly do I want to detect my desired effect?” If you wanted to detect a 1-percent difference, it will take 680 experiments to have a 90-percent chance of detecting it—if it exists. The more relaxed sample size of 500 gives you an 80-percent chance of detecting your desired difference. The same is true for the 170 and 130 experiments required, respectively, to detect a 2-percent difference.

This concept is called “power,” a design’s ability to detect your desired threshold difference if it exists. Or, as I heard a statistics professor say tongue-in-cheek to a Ph.D. student planning an overly ambitious experiment using an inappropriately small sample size, “Power is the probability you will get a thesis.” Things subsequently simplified quite a bit!

Let’s look at some designs people might use, naively expecting to detect a 1-percent difference in the tar scenario:

• Running four experiments would have a 5.5-percent chance of detecting the 1-percent difference if it existed

• Running eight experiments would have a 6.5-percent chance

• Running 16 experiments would have a 7.5-percent chance

As you see, one can work backward to obtain the power of a design. It’s not unusual for clients to be surprised (and disappointed) at the answers.

When consulting with the client who initially could afford only 12 experiments but agreed to run 20, my sample size of 300 experiments resulted from his original objective of wanting to detect a minimum effect of approximately (0.325 x SDprocess) with 90-percent power. (SDprocess = standard deviation of process being studied, of which he had a historical estimate.)

With his proposed 20 experiments, he would be able to detect a minimum effect of approximately (1.6 x SDprocess); or, to turn things around, the power to detect his desired threshold difference of (0.325 x SDprocess) would be approximately 9.3 percent.

If you have a relatively good estimate of your process standard deviation (i.e., SDprocess):

• A 2 x 2 unreplicated factorial design can detect a ((2.8 to 3.3) x SDprocess) difference (80% and 90% power, respectively)

• A 2 x 2 x 2 unreplicated factorial can detect a ((2.0 to 2.3) x SDprocess) difference

• 16 experiments can detect a ((1.4 to 1.6) x SDprocess) difference

For those of you interested in the** sample size calculation,**

Let R = ratio of (desired effect / SDprocess):

For 80-percent power: N total = (5.6 / R)**2 [(5.6 divided by your ratio R), and this result is squared]

For 90-percent power: N total = (6.5 / R)**2

### Some pretty good DOE rules

From my experience with factorial designs:

• 16 experiments is a “pretty good” (and relatively affordable) number. I rarely ran 32.

• If you’re going to run 16 experiments, you may as well study five variables, if you can.

• If you have *only* three variables, think about a Box-Behnken design, if appropriate (it wouldn’t be appropriate for the factorial example in my column, “Two More Lurking Mess-Ups for *Any* Experiment, Designed or Not” in September 2016, because all three variables were “yes or no.”)

• If you have more than five variables and can afford only 16 experiments, then consider if you have six variables, you may as well study eight. This strategy also allows you to screen out variables that don’t seem to be important, which often gets the reaction, “Wait a minute—I *know* that variable is important!” A variable may indeed be important, but in this case, all “insignificant” means is it doesn’t exert an effect or interaction *in the specific range studied—*which was chosen for a reason (i.e., objective). Set the insignificant variable outside the studied range, and don’t be surprised at what happens. When variables are screened out, a design can usually be easily augmented to get the remaining variables’ interactions, and if a contour plot is desired, augmented even further to yield the quadratic equation to plot.

• Make important decisions regarding noncontinuous discrete variables before continuing to get a contour plot, e.g., catalyst A vs. catalyst B. Since the design in last month’s column was all discrete variables (e.g., yes or no: Did the patient keep a food diary?), this contour plot option isn’t applicable.

*Note:* This strategy is sequential and many times *builds upon experiments already run*. No data are wasted.

For more information, see R. M. Moen’s, T. W. Nolan’s, and L. P. Provost’s book, *Quality Improvement Through Planned Experimentation* (McGraw-Hill, 2012).

To summarize, the act of designing an experiment is composed of four parts:

1. Deciding what you need to find out or demonstrate

2. Estimating the amount of data required

3. Anticipating what the resulting data will be like

4. Anticipating what you will actually do with the finished data

Thanks for indulging me in this design of experiments (DOE) tangent of the last six columns. I hope you learned a few things, especially statistical trainers and belts. Back to improvement next time.

## Comments

## Reply to Steven Wachs's Comment

See below...

## Using Standard Deviation of Process to estimate sample size

I believe the relevant standard deviation to be concerned with (when calculating the number of replicates), is that due to experimental error. That is, the standard deviation among replicate measurements. This is often much less than the typical process variation we might see assuming, we are following good experimental practices (one operator, single lot of material, or using blocking and covariates to manage variation due to these nuisance sources in the experiment). Of course it's possible that experimental error could be more than normal process variation if the going through the various setups result in unintended variation.

## Which variation to use?

Thanks for commenting, Steven. I see what you mean, BUT...what happens when you try to take your results from a tightly controlled experiment into the real world environment, i.e., multiple operators and lots of material that combine at random? That's the REALITY of implementation and cannot be controlled.

A more robust approach might be judicious blocking of these nuisance (random, but very real) factors. The resulting variation would be more than your approach but more realistic and not as bad as not leveraging the power to block them -- control charts managing the process would detect special causes among those factors with the appropriate variation "yard stick." This variation is also not as naively low as controlling factors that are realistically uncontrollable.

The result you get from such a design is good only for your specific designed conditions, i.e., you have the result for THIS specific operator for THIS specific lot (enumerative). How does that help you? What is your theory about putting this result in the real world and not a lab?

As W. Edwards Deming always asked, "What can you predict?" How robust is your result? And this gets into the question, How is variation going to manifest in your results (analytic), i.e., multiple operators and multiple lots and factors you didn't even envision affecting your result? Control charts are the ways to shed insight on these factors and increase your degree of belief in your study's validity.

Davis

## What does DOE stand for in this instance?

Nice article, but not sure what DOE stands for in this instance

I've found it's a good practice to include what the acronym means the first time it's used and then use the acronym later, especially for any professional document.

DOE does not appear to mean Department of Energy or Education... maybe "Depending on Experience"... but not real sure. Maybe something related to operations.

Thanks again for the nice article, but I spent too much time trying to figure out what the acronym might stand for.

Darrel

## Good catch