Our PROMISE: Our ads will never cover up content.
Our children thank you.
Davis Balestracci
Published: Tuesday, April 12, 2016 - 12:07 In part one yesterday, we looked at stats of the Boston Red Sox bullpen, a typical example of baseball’s tendency to find special cause in just about anything. The Boston Globe article on which these two columns are based has been a gold mine for teaching many useful, basic concepts about variation. Today we’ll continue the analysis with a closer focus on common cause. Once again, for those of you who aren’t interested in the statistical mechanics but want to be aware of how this type of analysis can drastically change one’s thinking, just skip to the bottom-line conclusions where indicated. For my non-U.S. readers, I hope you’ll be able to follow the analysis philosophy and see parallels to your favorite sports and news articles. Any italics in direct quotes are mine, and if I make comments within a quote, I show that by inserting [DB:...]. From the Globe article: “The hallmark of bullpens is their inconsistency. The Mariners offer a compelling microcosm, having gone in the last four years, from a 3.39 ERA in 2012 to a 4.58 ERA in 2013 (a rise of 1.19 runs) to a 2.59 ERA in 2014 (a drop of 1.99 runs) to a 4.15 ERA (a rise of 1.56 runs) in 2015.” The sportswriter might have a point, since each of these year-to-year differences in earned-run averages (ERAs) was greater than the 1.1 calculated in yesterday’s column. To prove it, he went on a fishing expedition in the last 10 years of data. (Was his choice of 10 years arbitrary?) I think the following qualifies as a “masterful” example of explaining probable common cause as special cause, while at times calling it common, then making a special-cause conclusion: “While [the Mariners’ inconsistency] is an extreme snapshot, it’s far from isolated. Over the last 10 years [DB: 30 teams x 9 year-to-year differences = 270 ranges], there are 30 instances of teams whose relief ERAs changed by at least one run [DB: ERA: lower = better]—with 16 of them representing improvements by at least a run, and 14 representing declines of at least one run [DB: Half went up, half went down. Sounds average to me—as well as common cause (< 1.1)]. “On average, teams saw their bullpen ERA change by 0.52 runs on a season-to-season basis over the last 10 years—meaning that a ‘normal’ ERA adjustment [from 4.24 to 3.71] could give the Sox at least an average bullpen [DB: Huh?], and with the possibility that it would be far from an outlier for the team to improve by, say, a full run, which would in turn suggest a bullpen that had gone from a weakness to a strength.” Actually, Seattle is isolated. As you will see, it was the only bullpen with more than one special cause. “....At least an average bullpen”: The Red Sox already have an average bullpen. “....Far from an outlier”: He’s right. Changing by a run would be common cause, i.e., not an outlier. But his conclusion implies that nothing could in essence change, yet it would now be a strength. (A special-cause conclusion?) If he can fish, I can fish—but more carefully and statistically. I didn’t want to rely solely on the differences between 2014 and 2015. Because the sportswriter initially looked at the ERA data for 2012 to 2015, I decided to start there. Optional technicalities: I did a quick and dirty two-way analysis of variance (ANOVA) to see whether there were any differences by year and/or league, and there weren’t. Bottom line Taking the average of the 90 ranges of two: Note how close this four-year average is to the sportswriter’s 10-year average of 0.52. (Consistent inconsistency?) What he didn’t realize was that this average range by itself isn’t very useful. It needs to be converted to the maximum difference between two consecutive years that’s due just to common cause. In that case: Two teams were much higher than that: Seattle, from 2013 to 2014 (–1.99); and Oakland, from 2014 to 2015 (+1.72). These need to be taken into consideration to get a more accurate answer. Optional technicalities: It’s standard practice to begin a process of omitting special-cause ranges and recalculating until all the remaining ranges are within common cause. 1. Eliminating these two, R avg now equals 0.465 and R max = 1.52, which then flags: Note the similar pattern to yesterday’s figures, when I used just the 2014–2015 data: Oakland, Seattle, and Houston get flagged on their 2014–2015 difference. 2. Eliminating these, R avg now equals 0.4398, and R max = 1.44 3. Eliminating Milwaukee, R avg now equals 0.4276 and R max = 1.40 Largest remaining: According to this analysis, 1.38 is not a special cause. However, a deeper—and subsequently confirmatory—analysis using ANOVA left little doubt that 4.69 was a special cause (just like yesterday’s results). So, omitting this range, I get a final R max of 1.36. Two anomalies from yesterday: Neat trick to avoid all this eliminating and recalculating: One can alternatively use the median range of the original 90 differences at the outset as a good initial estimate of what constitutes an outlier. In this case: And from this: Which is very close to the final answer using the average range with successive eliminations. This is oftentimes “one stop shopping.” Using the median range on the final data with six outliers eliminated, it matched the R avg result. Using a box plot to do an analysis on the original 90 actual differences yields the information that any range greater than ~1.5 is a special cause. Here’s another quote from the Globe article: “The Sox’s biggest one-year improvement of the last decade came between 2006 and 2007, when the relief ERA dropped by 1.41 runs en route to a championship, primarily thanks to a) lightning in a bottle...; b) a breakthrough... in middle relief; and c) drastic defensive improvement that permitted a bullpen group with modest strikeout numbers to record outs.” Given 10 years, isn’t one of the differences going to be the largest? This is what’s called “cherry picking,” but we can test it. That difference of 1.41 is a borderline special cause and needs to be examined more closely. Optional technicalities: I did an I-chart of the Red Sox bullpen ERA from 2000 to 2015, as shown in figure 1: Figure 1: Red Sox bullpen ERA Looking at this graph, I wondered whether 1.41 indicated a distinct shift in overall bullpen performance, i.e., the possibility that the bullpens of 2007 to 2015 have been consistently better than those of 2000 to 2006. (Was this due to new coaching staff, or a more consistent philosophy or approach? Was this around the time when bullpen strategy began to tilt more toward “one inning” (or even one batter) “specialists”?) Using a simple T-test, I got the surprising (to me) p-value of 0.012 (only an approximate 1% risk that this difference might not be real). Using this along with the standard deviation estimate from all the data in the previous analysis (~0.37), figure 2 reveals: Figure 2: Red Sox bullpen ERA (using the same scale as the graph in figure 1) It also seems to confirm that the standard deviation estimate of ~0.37 is reasonable (and hence, so is an R max of 1.36 or ~1.4). Another angle: I was curious and made an assumption that a u-chart ANOM could be used to compare Boston’s 2006 and 2007 ERAs. Considering “runs” as somewhat discrete random events and “innings” as the window of opportunity, I got the results seen in figure 3: Figure 3: ANOM comparing Red Sox bullpen ERA, 2006 vs. 2007 Based on the u-chart being an appropriate analysis, the 2007 bullpen ERA does seem to be significantly lower than 2006, with a risk of less than 1 percent (because the results are outside the second set of red lines at 1.8 SL). Bottom line Regarding c), the “drastic defensive improvement” from 2006 to 2007, the nature of fielding percentage lends itself beautifully to a p-chart ANOM. Figure 4 shows 2006: Figure 4: ANOM of fielding performance comparing the 30 teams Obviously listed in descending order (and the team order will be different for 2007), look who’s got the highest fielding percentage as a true special cause: Boston (0.989)! Let’s take a look at the 2007 data as an ANOM (figure 5): Figure 5: ANOM of fielding performance comparing the 30 teams Boston is team No. 3 (0.986) on the horizontal axis—and average. To review a key point: Statistically (based on these data only), there is no difference between teams 2 through 29. What planet was the writer on to conclude c)? The Globe article concludes: “The lesson? Bullpen improvement can happen even without adding a single ‘name’ relief arm. That said, there’s a considerable amount of luck involved in getting the sort of performances from unheralded relievers that allow a bullpen transformation.” To paraphrase statistically: Alleged “improvement” due to common cause = luck (and it is). Everything that could possibly go right does go right. Sort of like those rare days when you get all the green lights driving to work. Try to reproduce it? You can’t! You know it’s going to happen again, but when? You don’t know! Common cause “lightning in a bottle”: Given what amounts to usually 10 or so good teams of relatively equal ability, it’s going to happen to someone—for an entire season or, increasingly, even for some mediocre teams during the playoffs (think wildcards)—but you can’t say just who until the end. However, that doesn’t stop people from trying to explain it as special cause using opportunistic data torturing! Quoting the Red Sox general manager: “What you really try to do is... project some people’s performance taking a step forward, through scouting and analytics, and try to go that way.... [T]here’s so much inconsistency in bullpen performances throughout the years. [DB: No kidding!] So the good arm just doesn’t settle, because you can have a good arm and still get hit.... I think sometimes you have to look at the year before.” [DB: Given two different numbers, one will be larger.] Perhaps a plot of an individual’s performance over more than one year might be better for prediction, especially to predict a special cause drop-off in performance? “....[E]ven in an area where dramatic improvement is possible, the path to achieve it is, for now, obscure.” [DB: Especially when people keep treating common cause as special cause.] So why not use some statistical thinking applications to find true special causes that help focus and motivate better questions for prediction? Common cause inconsistency is predictably, consistently inconsistent. People might not like its level, but it is what it is. Perhaps in the end, winning the World Series is somewhat of a lottery. A ‘what if…?’ to ponder I’ve said it before, and I’ll say it again: There is no app for critical thinking. Quality Digest does not charge readers for its content. We believe that industry news is important for you to do your job, and Quality Digest supports businesses of all types. However, someone has to pay for this content. And that’s where advertising comes in. Most people consider ads a nuisance, but they do serve a useful function besides allowing media companies to stay afloat. They keep you aware of new products and services relevant to your industry. All ads in Quality Digest apply directly to products and services that most of our readers need. You won’t see automobile or health supplement ads. So please consider turning off your ad blocker for our site. Thanks, Davis Balestracci is a past chair of ASQ’s statistics division. He has synthesized W. Edwards Deming’s philosophy as Deming intended—as an approach to leadership—in the second edition of Data Sanity (Medical Group Management Association, 2015), with a foreword by Donald Berwick, M.D. Shipped free or as an ebook, Data Sanity offers a new way of thinking using a common organizational language based in process and understanding variation (data sanity), applied to everyday data and management. It also integrates Balestracci’s 20 years of studying organizational psychology into an “improvement as built in” approach as opposed to most current “quality as bolt-on” programs. Balestracci would love to wake up your conferences with his dynamic style and entertaining insights into the places where process, statistics, organizational culture, and quality meet.Data Torturing in the Baseball World, Part 2
The queasy shifting of probable, common, and special cause
‘A compelling microcosm’
I can look at the three year-to-year differences for each team (2012 to 2013, 2013 to 2014, and 2014 to 2015), which gives me 90 ranges to work with.
R avg ~0.50.
R max = 3.268 (from theory—for use with an average range of 2) x 0.50 ~1.6
2012
2013
2014
2015
Seattle
3.39
4.58
2.59 (–1.99)
4.15
Oakland
2.94
3.22
2.91
4.63 (+1.72)
2012
2013
2014
2015
Seattle
3.39
4.58
2.59 (–1.99)
4.15 (+1.56)
Houston
4.46
4.92
4.80
3.27 (–1.53)
2012
2013
2014
2015
Milwaukee
4.66
3.19 (–1.47)
3.62
3.40
2012
2013
2014
2015
Atlanta
2.76
2.46
3.31
4.69 (+1.38)
• Given the 1.36, San Diego’s 2014 to 2015 difference of 1.29 seems to have been common cause.
• The previous R max of 1.1 based on 2014–2015 was probably low and quite variable in its estimate due to the use of only 25 ranges. Today’s 2012 to 2015 analysis ends up using 84 ranges, which makes it more reliable and accurate.
R med = 0.375
R max = 0.375 x 3.865 (from theory—used with a median range of 2) ~1.43
Bottom line
My approach of using several analyses simultaneously to seek convergence was successful: Three different, simple approaches (along with some slight help from ANOVA) yield a very similar conclusion: Two consecutive years’ ERA can have a difference of ~1.4 due just to common cause.‘Lightning in a bottle’
The 1.41 drop seems to be a special cause—but is it for the reasons he cites? Aren’t a) “lightning in a bottle” and b) “a breakthrough” based on random luck?‘A considerable amount of luck’
Final thoughts from the article
What if a Green or Black Belt certification exam consisted of simply passing out this or a similar article with the only instructions being, “Apply any statistics you have learned to statements made in this article”?
Our PROMISE: Quality Digest only displays static ads that never overlay or cover up content. They never get in your way. They are there for you to read, or not.
Quality Digest Discuss
About The Author
Davis Balestracci
© 2023 Quality Digest. Copyright on content held by Quality Digest or by individual authors. Contact Quality Digest for reprint information.
“Quality Digest" is a trademark owned by Quality Circle Institute, Inc.