Thursday, July 14, 2011

Differentiating between Pitching Luck and Skill Part I

A few weeks ago, I did a couple of posts on differentiating between luck and skill for hitters. I want to now look at it from the other side: how to differentiate between luck and skill for pitchers. This time, instead of using batting average like I did for hitters, I am going to use ERA. Batting averages against pitchers have shown to be wildly inconsistent, and as such, a better dependent variable would be to look at the runs that a pitcher gives up, because it is much more consistent and definitive over time. I don't want to simply look at counting variables such as strikeouts, walks, and home runs, but look at batted ball statistics and detailed pitching statistics.

Just like last time, I want to introduce a bunch of variables and figure out which of them are important using Mallow's Cp and p-values. I tried running a stepwise regression on all of the variables together, but there were too many variables, so I ran two separate regressions and combined the results.

The first stepwise regression I ran is with counting and batted ball stats. The regression predicting ERA includes K/9, BB/9, HR/9, WHIP, GB/FB, LD%, GB%, FB%, and HR/FB. When we run the stepwise regression, we find that the best regression predicting ERA is ERA = K/9 + BB/9 + HR/9 + WHIP + GB/FB + FB%.

The second stepwise regression involves the "plate discipline" variables. These variables deal with things such as how often a batter swings and makes contact or the percentage of pitches in the strike zone. I collected 9 of these variables from Fangraphs, divided into three categories. Swinging includes O-Swing %, the percentage of pitches a batter swings at outside of the strike zone, Z-Swing %, the percentage of pitches a batter swings at inside of the strike zone, and Swing %, which is the total percentage of pitches swung at. Contact includes O-Contact %, the percentage of pitches a batter makes contact with when swinging at pitches outside the strike zone, Z-Contact %, the percentage of pitches a batter makes contact with when swinging at pitches inside the strike zone, and Contact %, which is the total percentage of contact made when swinging at all pitches. Finally, accuracy includes Zone %, the percentage of pitches inside of the strike zone, F-Strike %, which is the first strike percentage, and SwStr %, which is the percentage of strikes that were swung at and missed.

When I ran a stepwise regression with all of these variables, the best regression output is ERA = Z-Swing % + Swing % + Z-Contact % + First Strike % + SwStr %.

Now that I have run two separate stepwise regressions, I can combine the results and run one more stepwise regression to make sure the model is the best it can be. When I do that, I find that Swing % no longer becomes needed. Another change I need to make concerns confounding variables. Since the variable WHIP includes walks in it's calculations, I can't have both WHIP and BB/9 in the regression. Since WHIP includes hits, which could be important in predicting ERA, I must remove BB/9 from the regression.

The regression now looks like this: ERA = K/9 + HR/9 + WHIP + GB/FB + FB% + Z-Swing % + Z-Contact % + First Strike % + SwStr %. When I run a linear regression on this model, I find that the p-value for GB/FB rate is an astronomically high 0.864, so it is clearly not as important as I first thought. K/9 also has a very high p-value of 0.594, so that can also be taken out. We are left with seven variables that should, with relatively high confidence, predict a pitcher's ERA. The final regression model is as follows: ERA = WHIP + HR/9 + FB% + Z-Swing % + Z-Contact % + First Strike % + SwStr %. The output table from R is below.

Coefficients:
                      Estimate   Std. Error   t value   Pr(>|t|)   
(Intercept)     -6.89402   0.92749     -7.433    2.83e-13 ***
WHIP            3.82090    0.12118     31.532    < 2e-16 ***
HRper9         0.95049    0.05343     17.791    < 2e-16 ***
Fbperc           0.49971    0.24764     2.018      0.043950 * 
Zswingperc   0.95022    0.42860     2.217      0.026914 * 
Zcontactperc 3.55909    0.80202     4.438      1.04e-05 ***
Fstrikeperc    1.32946    0.39701     3.349      0.000851 ***
Swstrperc      2.44227    1.26266     1.934      0.053452 .
R-Squared = 0.8071

The R-squared value for the regression is actually quite good, showing that over 80% of the variation in ERA can be explained by the seven independent pitching variables.

An increase of one in WHIP is associated with an increase of 3.821 runs in ERA. Again, this one makes a lot of logical sense. Giving up one more baserunner per earning should really hurt your ERA. Since ERA is based on 9 innings, we can see that the one extra baserunner per inning would increase the runs allowed per inning by 0.425 runs. This number makes a lot of sense both intuitively and through statistics. Looking at the expected runs matrix from The Book, we can see the effect of one extra baserunner per inning. If you subtract the expected runs for a certain base/out situation by the base/out situation with one less runner, and sum all of the possibilities, we can get a good estimate of the effect of extra baserunners. For example, the expected runs for no out and no runners is 0.555, and the expected runs for no outs and a runner on first is 0.953. The difference between those is 0.398. If we calculate all of the differences, we find that the expected increase in runs per inning is 0.4129 with one more baserunner per inning. Now this is not rigorous math, but a simple way of showing that the coefficient for ERA makes a lot of sense.

An increase of one HR/9 is associated with an increase of 0.9505 runs in ERA. Clearly, giving up more home runs is the fastest way to increase your ERA. However, this value does seem somewhat low. In January, I found that the true value of a home run hit in 2010 was worth 1.406 runs. So giving up a home run should right away be worth about 1.41 runs, which means your ERA should increase by about 1.41. Unearned runs are playing a part in decreasing that value, but shouldn't decrease it by close to half a run.

A one-percentage point increase in a pitcher’s fly ball percentage is associated with an increase of 0.5 runs in ERA. As I found in my post on hitting luck vs. skill, a higher fly ball percentage leads to a lower batting average for hitters, which seems to contradict this result. However, fly balls are associated with a much higher slugging percentage than ground balls, and the chance of a fly ball becoming a home run is a great risk to ERA. As an example, Javier Vazquez had a great season in 2009, with a 2.87 ERA and a FB% of only 34.8%. When he moved to the Yankees in 2010, his fly ball rate jumped to 47% and his ERA blew up to 5.32. So far in 2011, he has a 48.1% FB% and an ERA of 5.23, pretty much in line with his 2010 stats.  Although FB% is clearly not the only reason why his ERA jumped, it certainly contributed, especially because his HR/FB rate jumped from 10.1% in 2009 to 14.0% last year.

A one-percentage point increase in a pitcher's swing percentage in the strike zone is associated with an increase of 0.95 runs in ERA.  If pitchers are inducing more swings and misses, then this should be a good thing, but it is possible that hitters could simply be hacking more often because the pitches look much better to hit. Pitchers that are truly successful will be able to get outs by pitching to corners and making the batter only swing at a good "pitcher's pitch". A pitcher constantly painting corners will make the batter take more pitches as he looks for better pitches to hit, before being forced to swing with two strikes.

A one-percentage point increase in a pitcher's contact percentage in the strike zone is associated with an increase of 3.559 runs in ERA. Obviously, if hitters are hitting a higher percentage of pitches, then they are most likely seeing the ball better and hitting it more squarely. This would definitely lead to a higher ERA. Although the coefficient may seem very high, contact percentages are pretty consistent, so a big jump is rare and would lead to a much higher ERA.

A one-percentage point increase in a pitcher's first strike percentage is associated with an increase of 1.33 runs in ERA. This is really the first debatable result in the regression. One would think that throwing more first pitch strikes would lead to a lower ERA, but that is not the case. One plausible explanation can be found in this table. That shows the hitting splits for all of MLB in 2010 on different counts. The slash stats for hitters on the first pitch of an at-bat is a robust .334/.340/.534. That is well above league average, so if a hitter hits a first pitch they are going to have more success overall. Throwing more first pitch strikes leads to more hittable pitches and thus a higher ERA. However, throwing less first pitch strikes leads to pitchers getting behind in the count, and when that happens hitters hit .302/.473/.498. Although BA and SLG are lower, the OBP is much higher (mostly due to walks), and it is the statistic that is most important in creating runs. This coefficient needs to be looked at more in-depth, but right now the regression believes that more first pitch strikes leads to a higher ERA, so we are going to take that as a given.

A one-percentage point increase in a pitcher's swinging strike percentage is associated with an increase of 2.442 runs in ERA. This is an almost identical explanation to the coefficient for swing percentage in the strike zone. The more strikes a batter swings at means there are less strikes that they are simply taking. Strikes that aren't swung at have no negative consequences (other than maybe stolen bases) because the ball has no chance of being put in play, so pitchers should want the swinging strike percentage to be lower, because it will lead to a lower ERA. However, having a lower swinging strike percentage does not necessary lead to a lower ERA. A pitcher must have great control in order to take advantage of a hitter.

In my next post, I will explore different pitchers' luck and skill, just like I did for hitters. Now that the model has been defined, it will again show a pitcher's predicted ERA, and the fluctuations from career ERA to predicted ERA will show the improvements the pitcher has made that season and will be defined as skill. The difference between predicted ERA and actual ERA will show the pitcher's luck. It will be interesting to look at certain examples of pitchers who are lucky or not, and whether they fit a certain stereotype. Maybe ground ball pitchers have, on the whole, a lower ERA than their counterpart fly ball pitchers. I will show examples of certain pitchers, and we will be able to figure out whether they are truly a good pitcher, or have simply gotten lucky.

No comments:

Post a Comment