Featured Product
This Week in Quality Digest Live
Statistics Features
Ryan McKenna
Guest author from academia walks us through considerations and tools available to conduct this type of analysis with differential privacy
Donald J. Wheeler
What does it tell us about the usefulness of a measurement?
David Darais
The importance of upper limits and clipping
Donald J. Wheeler
How to ship conforming product
David Darais
How to achieve differential privacy with common counting queries

More Features

Statistics News
Collect measurements, visual defect information, simple Go/No-Go situations from any online device
Good quality is adding an average of 11 percent to organizations’ revenue growth
Ability to subscribe with single-user minimum, floating license, and no long-term commitment
A guide for practitioners and managers
Gain visibility into real-time quality data to improve manufacturing process efficiency, quality, and profits
Tool for nonstatisticians automatically generates models that glean insights from complex data sets
Version 3.1 increases flexibility and ease of use with expanded data formatting features
Provides accurate visual representations of the plan-do-study-act cycle
SQCpack and GAGEpack offer a comprehensive approach to improving product quality and consistency

More News

Davis Balestracci


Data Torturing in the Baseball World, Part 2

The queasy shifting of probable, common, and special cause

Published: Tuesday, April 12, 2016 - 13:07

In part one yesterday, we looked at stats of the Boston Red Sox bullpen, a typical example of baseball’s tendency to find special cause in just about anything. The Boston Globe article on which these two columns are based has been a gold mine for teaching many useful, basic concepts about variation. Today we’ll continue the analysis with a closer focus on common cause.

Once again, for those of you who aren’t interested in the statistical mechanics but want to be aware of how this type of analysis can drastically change one’s thinking, just skip to the bottom-line conclusions where indicated. For my non-U.S. readers, I hope you’ll be able to follow the analysis philosophy and see parallels to your favorite sports and news articles.

Any italics in direct quotes are mine, and if I make comments within a quote, I show that by inserting [DB:...].

‘A compelling microcosm’

From the Globe article: “The hallmark of bullpens is their inconsistency. The Mariners offer a compelling microcosm, having gone in the last four years, from a 3.39 ERA in 2012 to a 4.58 ERA in 2013 (a rise of 1.19 runs) to a 2.59 ERA in 2014 (a drop of 1.99 runs) to a 4.15 ERA (a rise of 1.56 runs) in 2015.”

The sportswriter might have a point, since each of these year-to-year differences in earned-run averages (ERAs) was greater than the 1.1 calculated in yesterday’s column. To prove it, he went on a fishing expedition in the last 10 years of data. (Was his choice of 10 years arbitrary?) I think the following qualifies as a “masterful” example of explaining probable common cause as special cause, while at times calling it common, then making a special-cause conclusion:

“While [the Mariners’ inconsistency] is an extreme snapshot, it’s far from isolated. Over the last 10 years [DB: 30 teams x 9 year-to-year differences = 270 ranges], there are 30 instances of teams whose relief ERAs changed by at least one run [DB:  ERA: lower = better]—with 16 of them representing improvements by at least a run, and 14 representing declines of at least one run [DB: Half went up, half went down. Sounds average to me—as well as common cause (< 1.1)].

“On average, teams saw their bullpen ERA change by 0.52 runs on a season-to-season basis over the last 10 years—meaning that a ‘normal’ ERA adjustment [from 4.24 to 3.71] could give the Sox at least an average bullpen [DB: Huh?], and with the possibility that it would be far from an outlier for the team to improve by, say, a full run, which would in turn suggest a bullpen that had gone from a weakness to a strength.”

Actually, Seattle is isolated. As you will see, it was the only bullpen with more than one special cause.

“....At least an average bullpen”: The Red Sox already have an average bullpen.

....Far from an outlier”: He’s right. Changing by a run would be common cause, i.e., not an outlier. But his conclusion implies that nothing could in essence change, yet it would now be a strength. (A special-cause conclusion?)

If he can fish, I can fish—but more carefully and statistically. I didn’t want to rely solely on the differences between 2014 and 2015. Because the sportswriter initially looked at the ERA data for 2012 to 2015, I decided to start there.

Optional technicalities: I did a quick and dirty two-way analysis of variance (ANOVA) to see whether there were any differences by year and/or league, and there weren’t.   

Bottom line
I can look at the three year-to-year differences for each team (2012 to 2013, 2013 to 2014, and 2014 to 2015), which gives me 90 ranges to work with.

Taking the average of the 90 ranges of two:
R avg ~0.50.

Note how close this four-year average is to the sportswriter’s 10-year average of 0.52. (Consistent inconsistency?) What he didn’t realize was that this average range by itself isn’t very useful. It needs to be converted to the maximum difference between two consecutive years that’s due just to common cause. In that case:
R max =  3.268 (from theory—for use with an average range of 2) x 0.50 ~1.6

Two teams were much higher than that: Seattle, from 2013 to 2014 (–1.99); and Oakland, from 2014 to 2015 (+1.72). These need to be taken into consideration to get a more accurate answer.

Optional technicalities: It’s standard practice to begin a process of omitting special-cause ranges and recalculating until all the remaining ranges are within common cause.

  2012 2013 2014 2015
Seattle 3.39 4.58 2.59 (–1.99) 4.15
Oakland 2.94 3.22 2.91 4.63 (+1.72)

1. Eliminating these two, R avg now equals 0.465 and R max = 1.52, which then flags:

2012 2013 2014 2015
Seattle 3.39 4.58 2.59 (–1.99) 4.15 (+1.56)
Houston 4.46 4.92 4.80 3.27 (–1.53)

Note the similar pattern to yesterday’s figures, when I used just the 2014–2015 data: Oakland, Seattle, and Houston get flagged on their 2014–2015 difference.

2. Eliminating these, R avg now equals 0.4398, and R max = 1.44

  2012 2013 2014 2015
Milwaukee 4.66 3.19 (–1.47) 3.62 3.40

3. Eliminating Milwaukee, R avg now equals 0.4276 and R max = 1.40

Largest remaining:

  2012 2013 2014 2015
Atlanta 2.76 2.46 3.31 4.69 (+1.38)

According to this analysis, 1.38 is not a special cause. However, a deeper—and subsequently confirmatory—analysis using ANOVA left little doubt that 4.69 was a special cause (just like yesterday’s results).

So, omitting this range, I get a final R max of 1.36.

Two anomalies from yesterday:
• Given the 1.36, San Diego’s 2014 to 2015 difference of 1.29 seems to have been common cause.
• The previous R max of 1.1 based on 2014–2015 was probably low and quite variable in its estimate due to the use of only 25 ranges. Today’s 2012 to 2015 analysis ends up using 84 ranges, which makes it more reliable and accurate.

Neat trick to avoid all this eliminating and recalculating: One can alternatively use the median range of the original 90 differences at the outset as a good initial estimate of what constitutes an outlier. In this case:
R med = 0.375

And from this:
R max = 0.375 x 3.865 (from theory—used with a median range of 2) ~1.43

Which is very close to the final answer using the average range with successive eliminations. This is oftentimes “one stop shopping.” Using the median range on the final data with six outliers eliminated, it matched the R avg result.

Using a box plot to do an analysis on the original 90 actual differences yields the information that any range greater than ~1.5 is a special cause.

Bottom line
My approach of using several analyses simultaneously to seek convergence was successful: Three different, simple approaches (along with some slight help from ANOVA) yield a very similar conclusion: Two consecutive years’ ERA can have a difference of ~1.4 due just to common cause.

‘Lightning in a bottle’

Here’s another quote from the Globe article: “The Sox’s biggest one-year improvement of the last decade came between 2006 and 2007, when the relief ERA dropped by 1.41 runs en route to a championship, primarily thanks to a) lightning in a bottle...; b) a breakthrough... in middle relief; and c) drastic defensive improvement that permitted a bullpen group with modest strikeout numbers to record outs.”

Given 10 years, isn’t one of the differences going to be the largest? This is what’s called “cherry picking,” but we can test it. That difference of 1.41 is a borderline special cause and needs to be examined more closely.

Optional technicalities: I did an I-chart of the Red Sox bullpen ERA from 2000 to 2015, as shown in figure 1:

Figure 1: Red Sox bullpen ERA

Looking at this graph, I wondered whether 1.41 indicated a distinct shift in overall bullpen performance, i.e., the possibility that the bullpens of 2007 to 2015 have been consistently better than those of 2000 to 2006. (Was this due to new coaching staff, or a more consistent philosophy or approach? Was this around the time when bullpen strategy began to tilt more toward “one inning” (or even one batter) “specialists”?)

Using a simple T-test, I got the surprising (to me) p-value of 0.012 (only an approximate 1% risk that this difference might not be real).

Using this along with the standard deviation estimate from all the data in the previous analysis (~0.37), figure 2 reveals:

Figure 2: Red Sox bullpen ERA (using the same scale as the graph in figure 1)

It also seems to confirm that the standard deviation estimate of ~0.37 is reasonable (and hence, so is an R max of 1.36 or ~1.4).

Another angle: I was curious and made an assumption that a u-chart ANOM could be used to compare Boston’s 2006 and 2007 ERAs. Considering “runs” as somewhat discrete random events and “innings” as the window of opportunity, I got the results seen in figure 3:

Figure 3: ANOM comparing Red Sox bullpen ERA, 2006 vs. 2007

Based on the u-chart being an appropriate analysis, the 2007 bullpen ERA does seem to be significantly lower than 2006, with a risk of less than 1 percent (because the results are outside the second set of red lines at 1.8 SL).

Bottom line
The 1.41 drop seems to be a special cause—but is it for the reasons he cites? Aren’t a) “lightning in a bottle” and b) “a breakthrough” based on random luck?

Regarding c), the “drastic defensive improvement” from 2006 to 2007, the nature of fielding percentage lends itself beautifully to a p-chart ANOM. Figure 4 shows 2006:

Figure 4: ANOM of fielding performance comparing the 30 teams

Obviously listed in descending order (and the team order will be different for 2007), look who’s got the highest fielding percentage as a true special cause: Boston (0.989)! 

Let’s take a look at the 2007 data as an ANOM (figure 5):

Figure 5: ANOM of fielding performance comparing the 30 teams

Boston is team No. 3 (0.986) on the horizontal axis—and average.

To review a key point: Statistically (based on these data only), there is no difference between teams 2 through 29.

What planet was the writer on to conclude c)?

‘A considerable amount of luck’

The Globe article concludes: “The lesson? Bullpen improvement can happen even without adding a single ‘name’ relief arm. That said, there’s a considerable amount of luck involved in getting the sort of performances from unheralded relievers that allow a bullpen transformation.”

To paraphrase statistically: Alleged “improvement” due to common cause = luck (and it is). Everything that could possibly go right does go right. Sort of like those rare days when you get all the green lights driving to work. Try to reproduce it? You can’t! You know it’s going to happen again, but when? You don’t know!

Common cause “lightning in a bottle”: Given what amounts to usually 10 or so good teams of relatively equal ability, it’s going to happen to someone—for an entire season or, increasingly, even for some mediocre teams during the playoffs (think wildcards)—but you can’t say just who until the end. However, that doesn’t stop people from trying to explain it as special cause using opportunistic data torturing!

Final thoughts from the article

Quoting the Red Sox general manager: “What you really try to do is... project some people’s performance taking a step forward, through scouting and analytics, and try to go that way.... [T]here’s so much inconsistency in bullpen performances throughout the years. [DB: No kidding!] So the good arm just doesn’t settle, because you can have a good arm and still get hit.... I think sometimes you have to look at the year before.” [DB: Given two different numbers, one will be larger.]

Perhaps a plot of an individual’s performance over more than one year might be better for prediction, especially to predict a special cause drop-off in performance?

“....[E]ven in an area where dramatic improvement is possible, the path to achieve it is, for now, obscure.” [DB: Especially when people keep treating common cause as special cause.]

So why not use some statistical thinking applications to find true special causes that help focus and motivate better questions for prediction

Common cause inconsistency is predictably, consistently inconsistent. People might not like its level, but it is what it is. Perhaps in the end, winning the World Series is somewhat of a lottery.

A ‘what if…?’ to ponder
What if a  Green or Black Belt certification exam consisted of simply passing out this or a similar article with the only instructions being, “Apply any statistics you have learned to statements made in this article”?

I’ve said it before, and I’ll say it again: There is no app for critical thinking.


About The Author

Davis Balestracci’s picture

Davis Balestracci

Davis Balestracci is a past chair of ASQ’s statistics division. He has synthesized W. Edwards Deming’s philosophy as Deming intended—as an approach to leadership—in the second edition of Data Sanity (Medical Group Management Association, 2015), with a foreword by Donald Berwick, M.D. Shipped free or as an ebook, Data Sanity offers a new way of thinking using a common organizational language based in process and understanding variation (data sanity), applied to everyday data and management. It also integrates Balestracci’s 20 years of studying organizational psychology into an “improvement as built in” approach as opposed to most current “quality as bolt-on” programs. Balestracci would love to wake up your conferences with his dynamic style and entertaining insights into the places where process, statistics, organizational culture, and quality meet.