© 2023 Quality Digest. Copyright on content held by Quality Digest or by individual authors. Contact Quality Digest for reprint information.

“Quality Digest" is a trademark owned by Quality Circle Institute, Inc.

Published on *Quality Digest* (https://www.qualitydigest.com)

**Published: **04/23/2015

Welcome to baseball season! I always do a baseball-themed article around this time, and I found my topic after stumbling on this article recently: How accurate are umpires when calling balls and strikes?

From what I understand, since 2008, home plate umpires have been electronically monitored every game and given immediate feedback on their accuracy—i.e., the number of actual balls they called as strikes, and vice versa.

Using the aggregated data from the 2008–2013 seasons, the author observed that wrong calls were made 15 percent of the time (average of both rates combined), which, according to him, “is just too high.” He provided a table of umpires whose inaccuracy rate was 15 percent or higher—38 umpires out of approximately 80 (Tsk, tsk... that’s close to half of them*.*)

He also listed the top 10 most accurate umpires. Oops—I mean the 10 umpires who happened to have the lowest rates of wrong calls.

I was curious and found the data source. This site was *unbelievable—*you can slice and dice the data *any* way you want. I obtained the 2014 data for all umpires. It was displayed as the two figures below using a common presentation—left-axis data as a bar graph and right-axis data as a line graph. Individual umpires are the horizontal axis:

By hovering my cursor over each “dot,” I obtained and entered the data on each umpire, converted the graphs above to p-charts, and added a third chart for the combined rates:

According to this chart, umpires 7, 9, 11, 12, 14, 28, 75, and 82 had above-average mistake rates, and umpires 35, 38, 52, 71, and 78 had below-average mistake rates.

According to this chart, umpires 5, 26, 55, and 85 had above-average mistake rates, and umpires 9, 28, 30, and 72 had below-average mistake rates. Note that no umpire was either good at both or poor at both.

I was curious about the overall wrong call rate, so I combined them:

In this chart, umpires 11, 12, 14, 33, 55, and 82 had above-average mistake rates (umpires 11, 12, and 14 appeared previously, but not in both), and umpires 15, 35, 52, 71, and 88 had below-average mistake rates (umpires 35, 52, and 71 appeared previously, but not in both).

Of course there are the 10 lowest rates, but there is truly only a “top five” in accuracy (umpires 15, 35, 52, 71, and 88).

How might these three p-charts change the current conversation?

One might wonder whether the two individual types of errors are related based on a theory that if someone were a “bad” umpire, he would have high rates of both and vice versa. What is the correlation between the two?

Correlation = –0.242 (p-value = 0.021, which is < 0.05: statistically significant...* or is it?*)

Let’s clear things up with a scatter plot, with a trend line of course:

Many people don’t realize that a trend line is an implicit regression and that any regression has at least three diagnostics. The data point at the lower right happens to be a whopping outlier, which invalidates the analysis. In fact, after eliminating that point and looking at the correlation of what remains:

Correlation = –0.135 (p-value = 0.207, which is > 0.05)

As Ellis Ott used to say, “First, you plot the data, then you plot the data, then you plot the data.”

• 10 percent will be the top 10 percent and a different 10 percent will be the bottom 10 percent.

• An arbitrary number ending in 0 or 5 percent will be the top (same number) percent, and a different (same number) percent will be the bottom (same number) percent.

• 10 people will be the top 10 and 10 different people will be the bottom 10.

Our ranking-obsessed society continues its quest to find the best and worst of everything. As I hope this has shown, *there is no pre-set percentage of outliers—*and there is also the possibility of no outliers!

I remember an illustration in one of Deming’s books where he took a figure similar to my p-charts and wrote on the chart about the performances between the common cause limits: “These cannot be ranked.” Based on the given data, they are indistinguishable from each other and from the overall average.

More data might shed further light—some umpires currently near either limit might now have a big enough denominator to indeed declare them above or below average.

There will also be the poor person whose performance, as in the umpire analysis above, could, for example, go from 15th best (No. 15) to 15th worst (No. 75)—through no fault of his own or change in his performance—provided the others maintained their current “process” as well.

What does this lack of *basic* knowledge about variation cost our society?

How would analyses like these change conversations to make subsequent actions productive rather than the status quo of increasing confusion, conflict, complexity, and chaos?

The article’s author felt that a 15-percent wrong call rate was too high. Well, it’s what the current system is perfectly designed to get. He may not like it and other people may not like it, but that’s what it is, and ranking to death won't solve a thing. And—horrors!*—*note that half of the umpires were above average.

Until next time...

**Links:**

[1] http://www.beyondtheboxscore.com/2014/1/27/5341676/how-well-do-umpires-call-balls-and-strikes

[2] http://baseballsavant.com/pitchfx_search.php

[3] http://baseballsavant.com/apps/umpires.php