Featured Product
This Week in Quality Digest Live
Management Features
Jim Benson
Don’t just set and forget KPIs or other metrics. Understand the true narrative of the work you do.
John Baldoni
John Baldoni interviews Garry Ridge about clarity and hope amid crisis
Vincent Dominé
Workplace teams need an embedded knack for learning and adapting
Nate Burke
Best practice is all about optimizing content for logical human behavior and user experience
Henrich Greve
The more diverse your goals, the greater the temptation to muddy the waters on your performance

More Features

Management News
Includes global overview and new additive manufacturing section
Tech aggravation can lead to issues with employee engagement, customer experience, and business results
Harnessing the forces that drive your organizations success
Free education source for global medical device community
New standard for safe generator use created by the industry’s own PGMA with the assistance of industry experts
Provides synchronization, compliance, traceability, and transparency within processes
Galileo’s Telescope describes how to measure success at the top of the organization, translate down to every level of supervision
Too often process enhancements occur in silos where there is little positive impact on the big picture
Latest installment of North American Manufacturing Covid-19 Survey Series shows 38% of surveyed companies are hiring

More News

Fred Schenkelberg


Statistical Significance

Beware the type III error

Published: Monday, February 22, 2021 - 13:03

There is a type of error that occurs when conducting statistical testing: to work very hard to correctly answer the wrong question. This error occurs during the formation of the experiment.

Despite creating a perfect null and alternative hypothesis, sometimes we are simply investigating the wrong question.

Example of a type III error

Let’s say we really want to select the best vendor for a critical component of our design. We define the best vendor as one whose solution or component is the most durable. OK, we can set up an experiment to determine which vendor provides a solution that is the most durable.

We set up and conduct a flawless hypothesis test to compare the two leading solutions. We can see very clear results. Vendor A’s solution is, statistically, significantly more durable than Vendor B’s solution.

Yet neither solution is durable enough. We should have been evaluating if either solution could meet our reliability requirements instead.


Even if we perfectly answer a question in our work, if it’s not the right question, then the work is for naught.

Short history of type III errors

Jerzy Neyman and Egon Pearson used the terminology for type I and II errors as “error of the first kind” and “errors of the second kind,” respectively. This led others to consider other types of errors, naming them “errors of the third kind,” and so forth.

In a paper published in 1947, Florence N. David, an occasional colleague of Neyman and Pearson, suggested she may have a need to extend the Neyman and Pearson sources of error to a third source by possibly “choosing the test falsely to suit the significance of the sample.”

Frederick Mosteller, in 1948, defined type III error as “correctly rejecting the null hypothesis for the wrong reason.”

Extending Mosteller’s definition, Henry Kaiser in 1966 defined such a type III error as coming to an “incorrect decision of direction following a rejected two-tailed test of hypothesis.”

Allyn Kimball, in 1957, suggested a definition close to how I consider a type III error, as “the error committed by giving the right answer to the wrong problem.”

And so on.... There is no one widely accepted definition for an error of the third kind or for type III errors. Yet for any of the above definitions, the error is one to guard against by careful consideration when designing, conducting, and analyzing statistical tests.

Situations that lead to type III errors

An obvious situation, in hindsight, is the experimenter solving the wrong problem or asking the wrong question. The cause here could be simple ignorance of sufficient information to recognize the error. Another cause could be focusing on the first or most interesting question to investigate.

Another set of situations may be the deliberate or unconscious effort to connect the experimental results to an expected outcome. This sometimes occurs when reinterpreting the results when the results don’t agree with the desired outcome.

Another set includes the process of just doing what we always have done. In this case, the experimenter may not even have a connection between the experiment and a suitable hypothesis that would enable analysis. We can do a test or experiment perfectly well, yet it has no meaningful result or influence on any future work.

Other situations exist. If you spot one or more that I missed, please add your thoughts in the comment section below.

First published on the Accendo Reliability blog.


About The Author

Fred Schenkelberg’s picture

Fred Schenkelberg

Fred Schenkelberg is an experienced reliability engineering and management consultant with his firm FMS Reliability. His passion is working with teams to create cost-effective reliability programs that solve problems, create durable and reliable products, increase customer satisfaction, and reduce warranty costs. Schenkelberg is developing the site Accendo Reliability, which provides you access to materials that focus on improving your ability to be an effective and influential reliability professional.


Type III and Higher Errors

My personal experience is that I find many Type III errors (my version of Type III) originate from observational studies versus designed experiments.  Modern data analytics often executes computationally massive observational statistical significance fishing expeditions but does not account for the total number of implicit hypotheses being tested in making the significance judgement, the implications of data structure nor the implications of failing to approximately represent an appropriate distributional error model.  Designed experiments choose the focus of the study and somewhat structure the data collection, cutting down on excessively obtaining higher numbers of spurious results or correlations.  Enjoyable discussion, thanks for publishing it. 

The Right Question

The quote from Einstein comes to mind: "If I had an hour to solve a problem and my life depended on it, I would use the first 55 minutes determining the proper question to ask, for once I know the proper question, I could solve the problem in less than five minutes. Einstein

Asking questions is difficult, especially inquiry based questions that produce useful predictions (theory). A great challenge in the use of PDSA. When I saw the word, "Significance," Deming's Forward to Quality Improvement through Planned Experimentation came to mind:

This book by Ronald D. Moen, Thomas W. Nolan, and Lloyd Provost breaks new ground in the problem of prediction based on data from comparisons of two or more methods or treatments, tests of materials, and experiments.

Why does anyone make a comparison of two methods, two treatments, two processes, or two materials? Why does anyone carry out a test or an experiment? The answer is to predict—to predict whether one of the methods or materials tested will in the future, under a specified range of conditions, performs better than the other one.

Prediction is the problem, whether we are talking about applied science, research and development, engineering, or management in industry, education, or government. The question is, What do the data tell us? How do they help us to predict?

Unfortunately, the statistical methods in textbooks and in the classroom do not tell the student that the problem in the use of data is prediction. What the student learns is how to calculate a variety of tests (t-test, F-test, chi-square, goodness of fit, etc.) in order to announce that the difference between the two methods or treatments is either significant or not significant. Unfortunately, such calculations are a mere formality. Significance or the lack of it provides no degree of belief—high, moderate, or low—about prediction of performance in the future, which is the only reason to carry out the comparison, test, or experiment in the first place.

Any symmetric function of a set of numbers almost always throws away a large portion of the information in the data. Thus, interchange of any two numbers in the calculation of the mean of a set of numbers, their variance, or their fourth moment does not change the mean, variance, or fourth moment. A statistical test is a symmetric function of the data.

In contrast, interchange of two points in a plot of points may make a big difference in the message that the data are trying to convey for prediction.

The plot of pints conserves the information derived from the comparison or experiment. It is for this reason that the methods taught in this book are a major contribution to statistical methods as an aid to engineers, as well as to those in industry, education, or government who are trying to understand the meaning of figures derived from comparisons or experiments. The authors are to be commended for their contributions to statistical methods.

                                                                                          W. Edwards Deming

                                                                                          Washington, July 14, 1990