Donald J. Wheeler  |  05/03/2009

All Outliers Are Evidence

Removing the extreme values is always a serious mistake.

Many have been taught that they must remove outliers prior to analysis. This is because much of modern statistics is concerned with creating a mathematical model for the data. Because all these models are created using algorithms, they tend to be severely affected by any unusual or extreme values.

Therefore, to use these mathematical techniques to obtain useful and appropriate models, it’s often necessary to polish up the data by removing the outliers. However, the act of building a model implicitly assumes that the data are homogeneous enough to justify the use of a model.

For example, the histogram in figure 1 has a bell-shaped curve superimposed. This curve is based on the average and standard deviation statistic for all 100 values in the histogram. It’s neither wide enough nor tall enough to provide a good fit to the data. The histogram in figure 2 contains the 93 values left after the seven extreme values (the four lowest and three highest) were deleted. Now the curve based on the average and the standard deviation statistic does a much better job of fitting the data. Thus, it’s true that outliers can undermine our efforts to create a model for our data.

But what about the seven values we simply deleted to obtain this better fit between our assumed model and our revised data set? What were these seven values trying to tell us about the underlying process that generated these data?

The whole operation of deleting outliers to obtain a better fit between our model and the data is based on computations which implicitly assume that the data are homogeneous. When you have outliers, this assumption becomes questionable.

The X chart in figure 3 shows clear evidence of six upsets or changes in the underlying process. The seven “outliers” from figure 1 are part of these signals. The outliers that we dismissed in figure 2 are signals that the values are not homogeneous, and that the models fitted in both figures 1 and 2 are wrong. From the perspective of data analysis, the outliers are the most important values in the data set. We must understand these values rather than dismiss them. Finding a model is premature. There isn’t one underlying process here, but many.

“Are these data homogeneous?” must be the first question of any analysis. Process behavior charts provide the easiest way to address this question. Hence, any analysis that doesn’t begin by organizing the data in some rational manner and placing them on a process behavior chart is inherently flawed.

Discuss

About The Author

Donald J. Wheeler’s picture

Donald J. Wheeler

Dr. Donald J. Wheeler is a Fellow of both the American Statistical Association and the American Society for Quality, and is the recipient of the 2010 Deming Medal. As the author of 25 books and hundreds of articles, he is one of the leading authorities on statistical process control and applied data analysis. Find out more about Dr. Wheeler’s books at www.spcpress.com

Dr. Wheeler welcomes your questions. You can contact him at djwheeler@spcpress.com. 

Comments

Outliers!!!

Observations (or recorded data) is the only truth available to a researcher. It needs to be treated with respect, for it is for real. Models are a different species altogether. A researcher does not create/generate data to fit a model, rather the data represent the phenomenon under study as affected by the treatments and other controlled/uncontrolled all influencing factors. If an experimental data does not fit a generally considered appropriate model for the phenomenon, do not panic! The experimenter might be at the threshold of discovering something new, something that was probably not found earlier, so it is important. Or simply, the 'general model' was a result of 'censored data'. One experiments not to prove or otherwise a model, but try to honestly study a phenomenon the researcher has chosen to be of significance. Do not worry, if one is not able to give cogent explanations to the observations. Just state that. This will be much better than 'just censoring' the data.

The writeup by Donald Wheeler needs more material/explanation/further treatment. I do not know why he stopped at this brief explanation.

Nagin Chand
Scientist, CSIR and
former Adviser, AICTE

Mistake? What mistake?

Yeah. We planned it that way. You see, they should be in a series... and uhhh... the outlier isn't in the series... so... it's a series mistake. See? Yeah, that's it.

Serious Series

or, as the hardcopy magazine says on page 18, "Removing the extreme values is always a series mistake."