Donald J. Wheeler’s picture

By Donald J. Wheeler

Many have been taught that they must remove outliers prior to analysis. This is because much of modern statistics is concerned with creating a mathematical model for the data. Because all these models are created using algorithms, they tend to be severely affected by any unusual or extreme values.

Therefore, to use these mathematical techniques to obtain useful and appropriate models, it’s often necessary to polish up the data by removing the outliers. However, the act of building a model implicitly assumes that the data are homogeneous enough to justify the use of a model.

For example, the histogram in figure 1 has a bell-shaped curve superimposed. This curve is based on the average and standard deviation statistic for all 100 values in the histogram. It’s neither wide enough nor tall enough to provide a good fit to the data. The histogram in figure 2 contains the 93 values left after the seven extreme values (the four lowest and three highest) were deleted. Now the curve based on the average and the standard deviation statistic does a much better job of fitting the data. Thus, it’s true that outliers can undermine our efforts to create a model for our data.

Nicolette Dalpino’s default image

By Nicolette Dalpino