Featured Product
This Week in Quality Digest Live
Management Features
Gleb Tsipursky
Only a third of organizations have hybrid policies in place
Joe Judge
How you do anything is how you do everything
Stephanie Ojeda
How addressing customer concerns benefits the entire quality process
Shiela Mie Legaspi
Set SMART goals
Mike Figliuolo
Creating a guiding maxim helps your people think ahead, too

More Features

Management News
For companies using TLS 1.3 while performing required audits on incoming internet traffic
Accelerates service and drives manufacturing profitability
New video in the NIST ‘Heroes’ series
A tool to help detect sinister email
Developing tools to measure and improve trustworthiness
Manufacturers embrace quality management to improve operations, minimize risk
How well are women supported after landing technical positions?

More News

Davis Balestracci


Data Torturing in the Baseball World, Part 1

Explaining anything as special cause

Published: Monday, April 11, 2016 - 16:35

In honor of baseball season, I’m going to apply some simple statistical thinking to my favorite sport in a two-part series today and tomorrow. I want anyone to be able to enjoy this, so I’ll mark any technical statistics as optional reading. For those of you interested only in the interpretations, I'll offer the “bottom line” conclusions, many of which I think will surprise you.

Maybe you can’t do the math and don’t even want to, but you should at least realize the importance of understanding these types of analyses and have access to someone who can do them. In many similar daily situations you encounter, anything else would be data insanity.

With baseball’s innate tendency to explain common cause as special, opportunistic data torturing abounds. For example, “Could the Red Sox bullpen make a leap of an improvement?” a Nov. 6, 2015, article from The Boston Globe about my favorite team, the Boston Red Sox, had just one too many red flags. I smiled as I read it and couldn’t resist the urge to dig deeper to find the data to test the sportswriter’s statements.

There’s a wonderful site with just about any baseball statistic you could want if you search long enough. I went to its section on bullpens, which had three pages’ worth of stats, figuring that if they went through the trouble to compile them, they must somehow contribute to bullpen performance.

‘Worst in the majors’

Here’s a quote from the Globe article: “... After all, the Sox bullpen was by many measures the worst in the majors last year, with the absence of Koji Uehara and late struggles of Junichi Tazawa leaving the team bereft of the sort of strikeout-per-inning arms that have become a staple of the game.”

I took my best guess and used nine of the compiled stats:
1. Earned-run average (ERA—lower is better)
2. Blown save percentage (lower is better)
3. Walks per nine innings (minus intentional walks—lower is better)
4. Strikeouts per nine innings (higher is better)
5. Batting average against (lower is better)
6. Something called OPS (sum of on-base percentage and slugging percentage—lower is better)
7. Home runs given up per nine innings (lower is better)
8. Steal success rate (lower is better)
9. Pitches per plate appearance (lower is better)

I got the 30-team rankings for each measure—1 to 30, best to worst—and summed them for an analysis. Any score between 9 to 270 (lower is better) is possible, with the calculated average being (9 × 15.5) = 139.5. But one needs to determine how much common cause there is around that average.

Optional technicalities: I did a nonparametric analysis of variance (ANOVA) on the individual rankings, from which I obtained the variation to perform an analysis of means (ANOM) on the sum of the nine ranks for each team. Here is the result:

Figure 1: Analysis of means (ANOM) for sum of nine rankings

Very important for everyone: This ANOM type of analysis to expose special causes is woefully underutilized in most improvement work. Everyone needs to have a basic grasp of it. W. Edwards Deming often used this technique, invented by Ellis Ott, and was emphatic that any points between the two common cause limits (in this case, 71.7 and 207.3) could not be ranked. This is a concept that initially is difficult to wrap one’s arms around. The 25 teams (or 26, depending on how one interprets Baltimore, which is team No. 3 on the horizontal axis in figure 1) between those two limits are indistinguishable from each other—and the overall average.

Some might think that Boston (team No. 4) is below average because its rank sum score (197) is greater than the average of 139.5. However, to be truly below average requires a score greater than 207. Based on this snapshot of data, Boston is not a special cause.

Bottom line
• “Below average” bullpens: Atlanta (No. 2), Colorado (No. 9), Detroit (No. 10)
• “Above average” bullpens: Pittsburgh (No. 22) and maybe Baltimore (No. 3)

As Deming would say, these 30 teams form a “system,” and individual teams are either inside the system (common cause) or outside the system (special cause in either direction).

And then there’s the other half of the quote from the Globe article trying to explain an alleged difference in strikeout rates during the absences of Uehara and Tazawa.

About those strikeout rates—p-chart ANOM

Optional technicalities: Figures two and three below are a p-chart ANOM (p = proportion/percentage) of bullpen performances. Figure 2 compares the Boston bullpen’s 2014 and 2015 strikeout rates (strikeouts/total outs). I put in even less conservative criteria (5% and 1% significance limits) as well as the standard “3.”

Figure 2: P-chart ANOM comparison of Red Sox strikeout rates

Bottom line
No difference in Boston bullpen’s 2014 and 2015 strikeout rates. The data lie between even the narrowest decision limits.

The graph in figure 3 compares Major League Baseball's (MLB) bullpens’ total strikeout rates for 2014 and 2015 by combining the rates of all 30 bullpens (using the same criteria as figure 2):

Figure 3: P-chart ANOM of Major League Baseball (MLB) strikeout rates

Bottom line
Similarly, the 2014 and 2015 data both lie within the narrowest limits—no year-to-year difference.

I wouldn’t be a bit surprised if someone has said, “The Red Sox followed the overall trend of the major league bullpen strikeout rate for the 2015 season by being down slightly from 2014.” Sorry, just not true.

Using the same p-chart ANOM technique, how does Boston compare to the other 29 bullpens in terms of its individual strikeout rate for 2014 and 2015 (figures 4 and 5)?

Figure 4: P-chart ANOM of 2014 strikeout rates

Figure 5: P-chart ANOM of 2015 strikeout rates

Bottom line
• Boston (team No. 4) is between the limits both years, so it was average for both seasons—no difference.
• I didn’t realize how strikeout-dominant the New York Yankees (team No. 19) were in both 2014 and 2015.
• The L.A. Dodgers (team No. 14) were also truly above average for 2015.
• Detroit (team No. 10) and Minnesota (team No. 17) were truly below average in 2015 (Minnesota in 2014, as well).

‘What are the odds?’

Here’s another quote from the Globe article: “Red Sox relievers finished with a 4.24 ERA last year. The league average bullpen had a 3.71 mark. What are the odds of bridging that divide of 0.53 earned runs per nine innings? Excellent, actually.”

I needed an estimate of the standard deviation to be able to perform an ANOM on the 2015 individual team bullpen ERAs.

Optional technicalities: I analyzed ERAs using the variation from the combined 2014/2015 data. An initial ANOVA showed no difference either by year or league. It also identified five outliers in terms of the difference between 2014 and 2015.

Bottom line
Any difference between 2014 and 2015 that is greater than ~1.1 is considered significant. This occurred for five bullpens:

  2014 2015 Diff
Atlanta 3.31 4.69 +1.38
Houston 4.80 3.27 -1.53
Oakland 2.91 4.63 +1.72
San Diego 2.73 4.02 +1.29
Seattle 2.59 4.15 +1.56

Boston was 3.33 in 2014 and 4.21 in 2015, for a difference of +0.91—common cause.

Optional technicalities
• Because the initial ANOVA showed no difference by either year or league, I also did a simpler, SPC-type analysis using only the individual year-to-year ranges of each team (i.e., the absolute value of the individual teams’ 2014–2015 difference).
• Using both the median and average of these ranges to detect outliers until they pretty much concurred, the conclusions matched those of the more formal ANOVA, both in terms of outlier criteria (difference > 1.1) and approximate standard deviation (~0.29).
• I also used a nonparametric box plot analysis of the actual individual team year-to-year differences (not absolute value), and it determined that an outlying range was greater than ~1.1 – 1.2 .

Bottom line
Three different techniques concluded that a difference greater than ~1.1 was significant, and a good-enough estimate of the standard deviation is 0.29.

This standard deviation of 0.29 was then used for the ANOM comparing the 2015 ERAs of the 30 teams (figure 6):

Figure 6: ANOM comparing 2015 ERAs of bullpens

• Special cause “high” ERAs: Atlanta (team No. 2), Colorado (team No. 9), Oakland (team No. 20)
• Special cause “low” ERAs: Kansas City (team No. 12), Pittsburgh (team No. 22), St. Louis (team No. 26)
• Boston (4.24): Above the average of 3.71, but not a special cause. The team was no different from the 24 other teams between the limits (or 3.71, for that matter).

Bottom line from both analyses

Actually, the Globe sportswriter came to the right conclusion in terms of the odds of “bridging that divide of 0.53” being excellent—but for the wrong reasons. Based on the 2014–2015 data analysis and its calculated common cause of 0.29:
There was no divide. Boston’s 4.23 was statistically indistinguishable from 3.71.
• Boston could easily go from 4.23 to as low as 3.13 (a difference of 1.1) just due to common cause, which wouldn’t necessarily indicate improvement.
• But he’s right that the odds are indeed excellent... just due to chance.

He then states, “The hallmark of bullpens is their inconsistency.” Once again, true, but for the wrong reasons. I hope I’ve shown that variation can be routinely “consistently inconsistent” within a predictable, but humanly unacceptable, range. Rather than accept this, the sports writer then went on a fishing expedition into bullpen stats to “explain” his position further. More about that tomorrow in part two, as well as a surprising conclusion from yet another, totally different ANOM method analyzing the 2015 ERAs.


About The Author

Davis Balestracci’s picture

Davis Balestracci

Davis Balestracci is a past chair of ASQ’s statistics division. He has synthesized W. Edwards Deming’s philosophy as Deming intended—as an approach to leadership—in the second edition of Data Sanity (Medical Group Management Association, 2015), with a foreword by Donald Berwick, M.D. Shipped free or as an ebook, Data Sanity offers a new way of thinking using a common organizational language based in process and understanding variation (data sanity), applied to everyday data and management. It also integrates Balestracci’s 20 years of studying organizational psychology into an “improvement as built in” approach as opposed to most current “quality as bolt-on” programs. Balestracci would love to wake up your conferences with his dynamic style and entertaining insights into the places where process, statistics, organizational culture, and quality meet.


Thanks for your insights

Mr. Balestracci,

As a "stat geek", I have baseball to thank for my interest in statisical analysis. I also subscribed to SABR for a time and still keep up with them. I dislike listening to baseball broadcasters, former MLB players and managers, and so-called analysts who spout useless trivia to "bolster" their cases but who don't know the meaning of "statistical significance", as in your example.

In fact, I'd love to see you present your insights on the "MLB Network". Think about it.

Best wishes.