Don’t We Need to Remove the Outliers?

Characterization and estimation are different

Much of modern statistics is concerned with creating models which contain parameters that need to be estimated. In many cases these estimates can be severely affected by unusual or extreme values in the data. For this reason students are often taught to polish up the data by removing the outliers. Last month we looked at a popular test for outliers. In this column we shall look at the difference between estimating parameters and characterizing process behavior.

Estimation

To illustrate how polishing the data can improve our estimates, we will use the data in figure 1. These values are 100 determinations of the weight of a 10-gram chrome steel standard known as NB10. These values were obtained once each week at the Bureau of Standards, by one of two individuals, using the same instrument each time. The weights were recorded to the nearest microgram. Because each value has the form of 9,999,xxx micrograms, the four nines at the start of each value are not shown in the table—only the last three values in the xxx positions are recorded. The values are in time order by column.

…

Want to continue?

By logging in you agree to receive communication from Quality Digest. Privacy Policy.

Create a FREE account

Forgot My Password

Comments

An alternative path that might be as bad as deleting the outlier

Dear readers,

For the 100 data found in this article a method I observed for calculating what I have found described as a robust GLOBAL SD gives a value of 4.77 (i.e. no data deleted) (My point is not to dwell on the formula but instead to focus on the logic, or pevalence, of using methods of computing SD that pay no attention to time order of the data)

Here my question: In other organisations/industries is it common, when dealing with process data, to find summary statistics of dispersion computed using such GLOBAL measures that are described as robust?

As already mentioned, by being GLOBAL, such measures pay no attention to the observational order of the data (unlike average/median measures of dispersion used for the process behaviour chart).

What is interesting here is that:

- using the so-called robust (global) SD, the value of 4.77 takes us closer to the "good model", and this computational method giving us 4.77 is demonstrably less affected by the outliers (so one could proceed mathematically without deleting them)

- but, and probably most importantly, since we haven't addressed "why" the outliers occurred we leave ourselves open to suffer the consequences of their effect in the future (e.g. rework/scrap/ unhappy customers...)

So, I propose that, with process data, a robust GLOBAL SD is just another way of going down the wrong path. This path may help us to get closer to a "good model" (without deleting the outliers) but, by failing to address and deal with the cause/s behind the outliers, we leave ourselves open to a greater risk of less consistent quality in the future which could end up being scrap or rework ...

Scott.

Fantastic as always ... not

Fantastic as always ... not that most folk will pay any attention ...