## Reverse-Engineer Your Experiments

### Be deliberate about the sample size you use

Published: Thursday, December 6, 2018 - 12:03

Everybody wants to design and conduct a great experiment! To find enlightenment by the discovery of the big red X and perhaps a few smaller pink x’s along the way. Thoughtful selection of the best experiment factors, the right levels, the most efficient design, the best plan for randomization, and creative ways to quantify the response variable consume our thoughts and imagination. The list of considerations and trade-offs is quite impressive. Then, finally, after optimizing all these considerations, and successfully running the experiment, and then performing the analysis... there is the question of “statistical significance.” Can we claim victory and success?

The answer lies in-part on the critical value provided by a table of critical values—or by a computer program. If our calculated test statistic exceeds the critical value, we will reject the null hypothesis and claim there is a difference among the treatment averages. If our calculated test statistic does not exceed the critical value, we will fail to reject the null hypothesis. This is the moment of truth at the end of all our hard work. This is a moment of anticipation and excitement.

What? This sounds too crazy to be true. After all this planning, engineering judgment, tender care, and procedural exactness our experiment relies upon the magic of a computer program which, at the last minute, “spits” out a critical value as a surprise measuring stick during the final analysis? There must be a better approach!

Thankfully, there is a better approach. But before moving forward with the discussion, it is purposeful to remind ourselves of some of the fundamentals. There are two notional relationships illustrated in figure 1. Within the figure, experiment 1 shows a large difference in treatment averages, while experiment 2 shows a more discreet difference. The difference between treatment averages of experiment 1 are easily detected with small sample sizes, but the difference in treatment averages of experiment 2 are not so easily detected. Larger sample sizes will be required to detect small differences in treatment averages.

But let’s not get too carried away. “How large” should the sample size be?

### Tests to compare treatment means (factorial experiment)

Suppose we are designing a factorial experiment with replication of one treatment combination, or perhaps a center point to form a central composite design. It is no secret that the analysis will require the calculation of an F-statistic with one degree of freedom in the numerator and replicates minus one (n-1) in the denominator. Figure 2 illustrates critical values of F with α = 0.05, df = 1, and df = (n-1). It is the shape of this function that is most revealing. This function is not a simple first degree polynomial, and this is good news for the experimenter. The function converges on F-critical of 3.84 as degrees of freedom approach infinity, but most of the convergence occurs before the 10th degree of freedom. Not shown in figure 2 is the critical value of F with df=1 because its value is 161.45 and would distort the graph by nature of its magnitude. In practical terms, the experimenter can set the height of the bar he wants to jump over during the design of an experiment. How large would you like F-critical to be?

There is actually a “sweet spot” to be found. We can eliminate degrees of freedom of 1, 2, or 3 because of their relatively high critical values, and these will only detect large differences in treatment average. We can also eliminate degrees of freedom of 7, 8, 9, and so on, because this will certainly make a more expensive experiment with very little reduction in the bar we must jump over. It seems that the sweet spot is a fairly narrow window between 4<df<6 (5<n<7). Of course, there may be other practical considerations when choosing the final number of replicates, but the experimenter should plant firmly in her mind the critical value before collecting any data.

The experimenter should note that a one-way analysis of Variance for ‘k’ treatments will follow roughly the same function as the one illustrated in figure 2.

### Tests for paired observations on the same specimen

Figure 3 illustrates the critical values for the paired t-test. That is, paired observations on the same sample. In figure 3, the actual critical value has been divided by to yield a “critical difference” for detection. For instance, the experimenter may suspect his parts are growing in heat treatment, and therefore has determined “before and after” measurements should be taken on each part. But how many units? Should the experimenter grab a handful, perhaps three or four units? Or perhaps the experimenter should select 30 units for good measure.

With the help of figure 5, the experimenter can decide in advance how big the difference is which is required to detect. The y-axis is therefore indicated in σ units. A 1σ growth may be detectable with sample size of n = 7 units, and a growth of 0.6σ may require 13 units. The experimenter should not leave this important decision to a roll of the dice. Rather, she should be deliberate and decide in advance what the sample size needs to be to accomplish the task at hand.

Certainly the cost of sample sizes larger than n = 10 must be weighed against the incremental advantage of a lower critical value.

### Tests to compare sample variance to a known standard (lower tail)

The graph in figure 4 is taken from a left-tailed χ2 test and has been modified as shown. Of course this function will converge on 1.00 when df = infinity. Again, note the rapid convergence, and then after df = 10, the slope levels out and becomes very flat. Such a function can be used directly in capability studies to answer the question: What is the minimum Cp value that must be calculated by a sample in order to be 90-percent sure the true process capability is at least 1.0? Perhaps in this case we can conclude that there is little statistical justification for sample sizes greater than n = 15, but once again, there are other practical considerations, such as the longer-term stability of the process.

### Tests to compare sample variance to a known standard (upper tail)

The graph in figure 5 is taken from a right-tailed χ2 test and has been modified by dividing the actual χ2 critical values by (n-1), which then yields a critical variance ratio s^{2}/σ^{2}. Of course this function will converge on 1.00 when df = infinity. Note the rapid convergence, and then after df = 10, the slope levels out and becomes very flat. Again, the experimenter must decide the critical value before data collection.

From a practical perspective, this function could be used to determine how many points to scan when measuring a hole with a coordinate measuring machine (CMM). Scanning and collecting eight points (7 degrees of freedom) provides a critical variance ratio of 2.00. On many CMMs, the fit of scanned points is indicated using the standard deviation, so a critical value of 1.41 could be programmed into the measuring routine. If it is known historically that “good” measurements have a fit parameter of 0.0001s when collecting eight points during the scan, then an “if” command can be inserted into the program to remeasure if s>0.00014. The shape of the function illustrates that collecting more than, say, 10 points during a scan has little statistical benefit.

Figure 5: |

### Summary

Let’s be deliberate about the sample size for our experiments, keeping in mind the critical value that will be used in the analysis. The relationship between critical value and sample size is never a simple first-degree polynomial. A decision of three vs. four replicates in a factorial experiment or ANOVA can make all the difference in the experiment’s ability to detect differences.

On the other hand, there may be no advantage in obtaining eight replicates. Design for success, and choose the critical value first. Don’t wait for the software to surprise you and indicate that you have failed to reject the null hypothesis. This could simply mean the critical value is too high because you had too few replicates.

It is the novice who collects some data and then performs a statistical calculation. A good lawyer never asks a question she doesn’t already know the answer to.