The quote “Baseball been berry, berry good to me” comes from one of my favorite Saturday Night Live skits from the late 1970s. Garrett Morris played Chico Escuela, a retired Hispanic baseball player who knew very little English. His pat answer for most questions—“Baseball been berry, berry good to me”—became embedded in the American lexicon. Baseball is indeed “berry, berry good” to anyone who loves statistics. For more than 100 years, virtually everything that has ever happened in professional baseball has been recorded, collated, analyzed, tortured, and twisted into the most bizarre statistical oddities you might imagine. If you wish to know who holds the record for home runs on his birthday while batting left-handed on the road, someone has probably already determined the answer.
Binary logistic regression (BLR) is generally preferred over ordinary linear regression when the dependent variable takes on only two values (usually zero and 1). For example, suppose we wish to determine the correlation between the number of lifetime home runs hit by a Major League player and his status as a member of the Hall of Fame, and we are considering only players known for hitting, and excluding pitchers, umpires, coaches, and others in the baseball Hall of Fame not known for hitting.
Currently, there are 90 players who have hit 300 or more home runs and are eligible to be elected to the Hall of Fame. We can plot the data with the number of lifetime home runs being on the x-axis, and the players’ Hall of Fame status being 1 on the y-axis if they are in the Hall of Fame, and zero if they are not. Ordinary linear regression will yield the straight line shown in figure 1, and binary logistic regression yields the S-shaped curve in figure 1.
From figure 1, one can quickly see that ordinary linear regression yields a nonsensical result because the probability of being elected to the Hall of Fame can be less than zero at about 250 home runs and greater than unity at about 625 home runs. In contrast, the binary logistic regression curve approaches zero and unity asymptotically, yielding a much more realistic model.
Figure 1: Pr Hall of Fame (HOF) vs. lifetime home runs (HR)—linear and binary logistic regressions
The general form of the logistic regression equation is:
ln (P / (1 - P)) = β0 + β1 X1 + β2 X2 + … + βn Xn
where P = probability of being in the Hall of Fame, and the Xs are the independent variables.
There can be any number of independent variables. This example has only one X—the number of lifetime home runs. Although you might recognize the right-hand side of this equation from linear regression models you have seen before, the left-hand side is not the standard Y, but is the natural log of the “odds ratio” (i.e., logit function).
The above formula can be rearranged to give:
P = exp(β0 + β1 X1 + β2 X2 + …+ βn Xn) / (1 + exp(β0 + β1 X1 + β2 X2 + … + βn Xn))
where exp is the exponential function.
For our example of home runs and Hall of Fame status, the logistic regression equation is:
P(HOF) = exp(0.0143*HR - 5.95) / (1 + exp(0.0143*HR - 5.95))
When there is only one independent variable, as is the case here, it is easy enough to plot P vs. home runs as shown by the S-shaped curve in figure 1. Because we chose zero and 1 as the binary response, the result gives us a proportion or probability.
How good is the correlation? For the ordinary linear regression, the r2 value is 0.224—not very good. When utilizing binary logistic regression, a statistic called McFadden’s pseudo-r square is calculated. Values below 0.20 are considered poor, values above 0.20 but less than 0.40 are considered fair, and above 0.40 are good. In our example in figure 1, the pseudo-r square is 0.197.
Perhaps another independent variable can be added to the analysis to improve the model. Batting average (BA) comes to mind as another major lifetime statistic. When this statistic is added to the model, the pseudo-r square is 0.394, bordering on a good correlation.
The logistic regression equation is:
P(HOF) = exp(0.0179*HR + 0.065*BA - 25.86) / (1 + exp(0.0179*HR + 0.065*BA - 25.86))
Figure 2 shows the relationship between the function of home runs and batting average and the P Hall of Fame.
Figure 2: Logistic regression of Pr Hall of Fame (HOF) vs. lifetime home runs (HR) and batting average (BA)
Now that we have two independent variables in the model, we can plot the data as seen in figure 3 to see the interaction between home runs and batting average in terms of being in the Hall of Fame. If your batting average is less than 0.225, it will be very difficult to be enshrined unless you have a massive number of round-trippers. On the other hand, you don’t need many dingers to be in the Hall of Fame if your batting average is 0.325 or better. Babe Ruth’s lifetime batting average, coupled with 714 home runs, virtually guaranteed his induction.
Figure 3: Lifetime home runs (HR) and batting average (BA) vs. P Hall of Fame (HOF)
Another important batting statistic is, of course, runs batted in (RBIs). Any batter who leads his league in home runs, batting average, and RBIs for one season, wins the coveted “Triple Crown.” This feat has not been accomplished since Carl Yastrzemski of the Boston Red Sox did it in 1967. When RBIs are added to the binary logistic regression, some interesting things happen to the logistic model. One of the other two independent variables becomes not significant, and the pseudo-r square becomes much better than fair! I leave it to the readers to determine this interesting result.
There are, of course, some flaws in using current baseball statistics to predict Hall of Fame status. Excluded from the data are the players who have more than 300 home runs, but are not yet eligible. Plus, there are eligible players who will be inducted sometime in the future but in the meantime are assigned a value of zero for Hall of Fame status. In addition, the data set excluded any player with fewer than 300 lifetime home runs. Despite these and other possible flaws in the data set, these baseball statistics serve well to help us understand logistic regression techniques.
Besides baseball, binary (as well as ordinal and binomial) logistic regression is useful in real life. Binary logistic regression is often used in the medical profession, for example, to determine the relative risk factors for lung cancer (i.e., age, sex, smoking, lifestyle). Weather forecasting is often determined by logistic regression. “A 40-percent chance of rain” means that, given the prevailing conditions, the logistic model predicts a 40-percent likelihood of rain. In the world of manufacturing, binary logistic regression can help to understand conditions which cause certain quality flaws that are either there or not there, and are hard to quantify. For example, an automobile manufacturer may need to determine what conditions cause paint to peel or not peel after curing.