Saturday, November 27, 2010

Improving a team's Pitching

I have already written two posts on the best way of improving a team, and improving a team's hitting. In this post, I want to do much of the same as the hitting post, but this time on pitching statistics. I am again going to run a linear regression model to determine which statistics are best correlated with pitching performance, which will show us which statistics can be best used to improve pitching.

In this model, instead of trying to estimate runs scored, I am going to use ERA as the dependent variable. Using runs against is a possibility, but since we are estimating the effect of statistics on pitching, and not pitching and defense, using runs against would include the effect of defense, so it is not an appropriate DV in this scenario. We again need to be careful in our selection of independent variables as to avoid collinearity.

Pitching statistics are almost opposite of hitting statistics. Good hitters are generally grouped into two categories: those that can get on base, and those that can hit for power. Good pitchers are those who do not allow very many baserunners and do not allow many home runs. We can measure these qualifications by using the two statistics Walks and hits per innings pitched (WHIP), which measures the average number of baserunners a pitcher allows per inning, and home runs allowed, which will not encompass all extra base hits, but should give us a good feel for pitchers who do and do not allow many home runs that will hopefully be a decent predictor for all extra base hits. Finally, I am also going to include strikeouts as a predictor, because pitchers with high strikeouts rates are valuable, and maybe a pitcher with more strikeouts will allow less runs because he has to rely less on his defense. When we run the regression, we get the following:

                  Estimate        Std. Error     t value    Pr(>|t|)
Intercept    3.2986958     0.4207396    7.840      6.45e-14 ***
WHIP        0.4431320     0.2483891    1.784      0.0753 . 
SO            -0.0014423    0.0001842     -7.829    6.97e-14 ***
HR            0.0117201     0.0008172     14.342   < 2e-16 ***
R2 = 0.551

As you can see from the R2 value, this regression explains a lot less variability than the hitting regression. However, if we replace WHIP by the number of hits and walks given up, we get a lot better regression:

                  Estimate        Std. Error     t value    Pr(>|t|)  
Intercept    -3.410e+00   2.855e-01    -11.942    <2e-16 ***
Hits           3.815e-03      1.547e-04    24.661     <2e-16 ***
BB            2.463e-03      1.414e-04    17.424     <2e-16 ***
SO            -4.711e-05     1.063e-04    -0.443      0.658  
HR            5.298e-03      4.567e-04    11.601     <2e-16 ***
R2 = 0.8906

Now, the R2 value is almost as high as the hitting regressions. All of the variables are significant except for strikeout, so when we take it out of the regression we get the following:

                  Estimate        Std. Error    t value    Pr(>|t|)  
Intercept    -3.5115171   0.1703878   -20.61     <2e-16 ***
Hits           0.0038502     0.0001321    29.14     <2e-16 ***
BB            0.0024598     0.0001410    17.45     <2e-16 ***
HR            0.0053015     0.0004561    11.62     <2e-16 ***
R2 = 0.8905

We can see how insignificant strikeouts were in the regression, because when we remove it the R2 value decreases by only 0.0001 (0.01%). We can now determine which variables impact pitching the most. One more hit given up is associated with a 0.00395 increase in ERA, one more walk given up is associated with a 0.00246 increase in ERA, and one more home run given up is associated with a 0.00530 increase in ERA. Since there are vastly different numbers of hits, walks, and home runs given up, we must also look at the mean of each to determine which will most affect ERA. The mean number of hits given up by a team in a single season is 1469.9, the mean walks is 540.2, and the mean home runs is 172.0. If we multiple the means by the coefficients, we get that, on average, hits will increase team ERA by 5.66, walks will increase ERA by 1.33, and home runs will increase ERA by 0.91. Obviously, we are only looking at statistics that will negatively impact (increase) ERA, so the numbers will look very high, as we are not inputting statistics such as outs or double plays that will positively impact (lower) ERA.

So from the results we can easily see that hits are the statistic that most impacts a team's ERA. So the obviously solution for a team would be to give up less hits, but how? One way would be to acquire pitchers with greater command, possibly leading to those pitchers being able to "nibble" more, making hitters swing at worse pitches. This would probably increase walks, and we already saw that walks also are bad for ERA. A better solution would be to acquire pitchers with a low batting average against and also a low batting average on balls in play (BABIP - although it has been shown that BABIP fluctuates year to year and may not be consistent for any pitchers). Pitchers also want to give up less home runs, but if they can reduce the number of hits against them this should in turn reduce the number of home runs against them.

Thursday, November 25, 2010

Improving a team's Hitting

As a follow up to my last post, which shows how improving a team through hitting or pitching is equally valuable, I wanted to look at how a team should improve their hitting. There are different ways to score runs, and because higher run totals lead to higher win totals, I want to figure out what is the best way to improve a team's hitting performance, thus leading to more runs and accordingly more wins.

Similar to last post, I am going to run a linear regression model, except this time I am going to use "Runs scored" as the dependent variable. Why not simply use "Wins"? If we were to run a regression model with hitting statistics as predictors and Wins as the dependent variable, we will have a much higher standard error, which means that the R2 value will be much lower, showing the the variability in wins is not explained very much by the hitting statistics. So if we have runs as the DV, we need appropriate hitting statistics for the independent variables. This is trickier to figure out then expected, as we cannot have statistics that are correlated with each other, or the regression model will experience "multicollinearity". What this means is that although the overall regression will predict the dependent variable nicely, we will not be able to tell which independent variables are accounting for the variability in the dependent variable. Although this my sound hard to prevent, it can be fairly straightforward, as a quick example will show. If we were to use on-base percentage, slugging percentage, and on-base plus slugging percentage as predictors for runs (or wins), our equation would have a multicollinearity problem. The overall regression may result in a low p-value, showing that we have predicted runs well, but each statistic individually would have a high p-value. We would not be able to tell which statistic is heavily influencing runs as OPS is an extraneous variable, and since OPS is basically measuring what OBP and SLG are already measuring, the best course of action is to remove it from the equation.

This regression demonstrates the collinearity issue:
                   Estimate    Std. Error     t value     Pr(>|t|)  
Intercept     -5.8651      0.2167         -27.070    <2e-16 ***
OBP            22.4591     17.5672       1.278       0.202   
SLG            15.0106     17.5936       0.853       0.394   
OPS            -4.2768      17.5879      -0.243       0.808

R2 = 0.9089

In short, we are going to need to carefully pick our independent variables so they do not experience collinearity. I am going to use the following statistics to try and predict runs: OBP, SLG, and stolen base %. Although there are many different statistics to use, I am using these three because they represent the three basic ways to improve your team's hitting: get on base more, hit for more power, or become a more successful team running the bases. When we run the regression we get the following:

                   Estimate    Std. Error     t value    Pr(>|t|)  
Intercept     -6.0689     0.2246          -27.022   < 2e-16 ***
OBP           17.9231     0.9630         18.612     < 2e-16 ***
SLG           10.7619     0.4947         21.756     < 2e-16 ***
SBperc       0.3956       0.1387         2.852      0.00463 **
R2 = 0.9111


So these three factors explain over 91% of the variability in Runs per Game. Although the R2 value is only slightly higher than the first regression that involved OPS, we can see that all of the statistics are now significant, as opposed to none of the statistics being significant. All three factors have significant effects on runs per game, and OBP has the largest effect. A ten percentage point increase in OBP (e.g. from .350 to .360) is associated with a 0.0179 increase in runs per game. A ten percentage point increase in SLG (e.g. from .450 to .460) is associated with a .0108 increase in runs per game. Finally, a one percentage point increase in SB% (e.g. from 70% to 71%) is associated with a 0.00396 increase in runs per game.

So what does this mean? The best way to increase your team's hitting is to try and score more runs per game. The best way to score more runs per game is to increase OBP. So the best way to increase a team's hitting is to acquire players that will get on base more often, whether it be through a hit, a walk, or a hit-by-pitch. Acquiring players that hit for power will also positively impact a team's hitting, but not as much as players that get on base. So if a team had a limited budget and could only acquire one or two significant players, they should try and acquire those players that can most improve their team's OBP.

Tuesday, November 23, 2010

Improving a Team - Pitching or Hitting?

After the close of the baseball season in early November, teams look to build for next year through trades, free agency, and the draft (which is more for 3-5 years into the future). But how do you build a better team, and more specifically, what will allow you to have a better team? The old question of pitching vs. hitting is always addressed differently by different teams. Last year, the Giants used spectacular pitching with timely hitting to win the World Series, but just the year before the Yankees used a powerful lineup to bulldoze their way to a World Series win. So which is preferable - scoring more runs, or preventing more runs?

In order to answer that question, I looked at every team's statistics in the last 11 years (2000-2010), and gave each team a value for "Playoffs". A 1 meant that the team made the playoffs, a 0 meant the team did not make the playoffs. To estimate hitting, I used the statistic "runs per game", and to estimate pitching I used the statistic "runs against per game" (this really represents overall defense, including both fielding and pitching - to truly isolate pitching a more appropriate statistic would be something like ERA). I then ran a simple linear regression model, with RpG and RApG estimating the binary "Playoffs" statistic. The table below shows the results:

Coefficients:
                  Estimate       Std. Error      t value      Pr(>|t|)   
Intercept    0.28029       0.24027         1.167        0.244   
RpG           0.41595       0.03884         10.709      <2e-16 ***
RApG        -0.41884      0.03694        -11.339      <2e-16 ***
***: significant at the 0.001 level

So the regression line is: Playoffs = 0.28029 + 0.41595*RpG - 0.41884*RApG. The intercept means that disregarding runs for and against, a team will have a 28% chance of making the playoffs. We know that this is close to being correct, as there are 30 teams competing for 8 playoffs spots, so the probability of any team making the playoffs, given that all teams are equal, is 0.2667. The coefficient of runs per game shows that a one-run per game increase in RpG is associated with a 41.6% higher probability of a team making the playoffs. Runs against per game is very similar except it is inverse, as a one-run per game increase in RApG is associated with a 41.9% lower probability of a team making the playoffs. As an example, if a team scores 4.50 runs per game and allows 4.50 runs per game, the probability of the team making the playoffs is 0.2673. If they increase their runs per game to 5.50 (an increase of exactly 1), the probability of the team making the playoffs will increase by 0.416 to 0.6832. If they then increase their runs against per game to 5.50 (again an increase of 1), their probability of making the playoffs will decrease to 0.2644 (a decrease of 0.419).

What this all means is that scoring runs and preventing runs have a very similar impact on a team's success (success defined by a team making the playoffs). Runs against is very slightly more important, but the difference is most likely negligible. Obviously this was a quick study, and only based on a small sample, but we can see that teams should be more concerned with overall talent of acquisitions rather than worry about acquiring only players that will help their hitting or pitching.

Saturday, November 6, 2010

World Series MVP

(Quick note, I typed out this post in entirety before Blogger deleted the whole thing even though it is supposed to save, so this version is unfortunately a little shorter!)

Last Monday night, the San Fransisco Giants won their first World Series since 1954, when Willie Mays and the New York Giants won. Ironically, there has never been a World Series won by the Giants with a MVP, as the MVP award did not begin until 1955, the year after the Giants last one. Unlike previous Giants' teams, with superstars such as Mays, Willie McCovey, and Barry Bonds, this team was led by great pitching and a team that managed to get just enough timely hits to squeak into the postseason on the last day. In this post, I want to determine who should have been the MVP of this team, and whether or not the voters picked the right player (Edgar Renteria ended up winning the award).

We need to ask two questions before we decide who the MVP should have been. First, what exactly is the MVP? Obviously, it is the "most valuable player", but what exactly does that mean? Does it mean the best overall player? Obviously not, as only one player in World Series history has won the WS MVP while playing for the losing team (Bobby Richardson in 1960, when the Yankees outscored the Pirates 55-27 but still managed to lose in seven games). I believe that the MVP is the player that best gives his team the chance to win each game, and as a result the series. The question now becomes: how do you measure exactly how "valuable" a player is to his team? I am going to use the Win Probability Added statistic, which is the sum of the changes in the probability of a player's team winning. More simply, WPA "looks at" each play and determines the teams probability of winning before, and then after the play occurs, and the difference is debited and credited to the players involved. For example, if a team was winning in the ninth inning, there probability of winning would be fairly high, say 80%, but if a player on the opposing team then hit a home run to tie the game, and the probability of the first team winning fell to say 55%, then the player who hit the home run would have a +0.25 WPA for that play, and the pitcher who gave up the home run would have a WPA of -0.25 for the play. Each team starts at a 50% chance of winning, and one team ends with a 100% chance of winning, so WPA measures exactly how much each player individually contributed to winning the game. It is heavily dependent on the leverage of the situation, as obviously a go-ahead home run in the ninth inning gives the team a better chance to win than a go-ahead home run in the first inning.

To determine which player should be the MVP, I summed the WPA for each player during each game in the World Series to determine the overall World Series WPA. I have created two tables below, for the position players and the pitchers, which rank the players in terms of overall WPA for the World Series, and also which games they appeared in.

Players:

Player
World Series WPA
Games Played In
Edgar Renteria
0.403
1, 2, 3, 4, 5
Aubrey Huff
0.147
1, 2, 3, 4, 5
Andres Torres
0.139
1, 2, 3, 4, 5
Cody Ross
0.108
1, 2, 3, 4, 5
Freddy Sanchez
0.094
1, 2, 3, 4, 5
Mike Fontenot
0.000
2
Juan Uribe
-0.046
1, 2, 3, 4, 5
Travis Ishikawa
-0.057
1, 4
Aaron Rowand
-0.058
2, 5
Pablo Sandoval
-0.081
3
Nate Schierholtz
-0.091
1, 2, 4
Buster Posey
-0.154
1, 2, 3, 4, 5
Pat Burrell
-0.423
1, 2, 3, 5

Pitchers:

Pitcher
World Series WPA
Games Pitched In
Matt Cain
0.495
2
Tim Lincecum
0.477
1, 5
Madison Bumgarner
0.477
1
Brian Wilson
0.131
1, 4, 5
Santiago Casilla
0.059
1
Javier Lopez
0.047
1, 2
Sergio Romo
0.021
1
Jeremy Affeldt
0.021
1, 3
Guillermo Mota
0.020
2, 3
Ramon Ramirez
0.009
1, 3
Jonathan Sanchez
-0.162
3

As you can see above, three of the four most valuable players on the Giants in the World Series happened to be pitchers. Edgar Renteria was the only significant hitter with a .403 WPA, Matt Cain had a .495 WPA, and Tim Lincecum and Madison Bumgarner both had a .477 WPA. The question now becomes, do you give the MVP vote to a pitcher such as Matt Cain, who pitched brilliantly and had the highest overall WPA, but only appeared in one game, or do you give it to Edgar Renteria, who had a lower WPA but played in every game? The voters decided to give it to Renteria, and I tend to agree with them (another issue would have been which pitcher to give it to? All three pitchers were very close in WPA, and I tend to think they would have given it to Lincecum as he pitched in two games and had the performance most fresh in the voters' minds). In this case, the voters' "gut instincts" actually agreed with the statistics.

One final note of interest is the Giants overall 1.5 WPA (they won 4 games, each with a WPA of +0.5, and lost one game with a WPA of -0.5). Of that 1.5, approximately 1.6 WPA came from the pitchers, while the hitters actually had a negative impact on the probability of the Giants winning the World Series with a WPA of -0.1. So even though the Giants became the first team to score at least 20 runs in the first two World Series games, the pitching, just like all year, was the reason that they won the World Series. Ironic then that a hitter still managed to win the MVP.