Tuesday, June 14, 2011

Ichiro Without Speed

After finishing my posts on luck vs. skill, I decided to look at a couple more individual players who would have very interesting regression lines. The first one that came to mind was Ichiro. He came to MLB from Japan in 2001, and since then he has had 10 straight 200 hit seasons and never hit lower than .303. However, this year he is hitting .256, and only on pace for about 177 hits. There has been a lot of talk about what is wrong with him, including this article on Fangraphs which shows his BABIP on different types of hits.

Clearly Ichiro is an outlier in the regression model. We know that even before looking at his luck and skill, because his unique hitting approach allows him to get hits on balls that almost every other major leaguer would be out on. Since 2002 (the first year advanced batted ball data was available, unfortunately we cannot capture his rookie season), Ichiro has 407 infield hits, by far the most in MLB, and has the highest infield hit % (IFH/GB) in baseball. As such, although his batted ball data may not look all that impressive, he still manages to get a ton of hits.

However, although we expect him to be an outlier, his batting average graph is still very surprising.

Ichiro AVG, 2002-2011
Ichiro's career .328 batting average is much higher than any of his predicted averages, and until this year his actual averages were also always quite higher. The most incredible season, 2004, he had a predicted batting average of .263, which would mean his "skill" cost him 65 points relative to his career average, yet his luck accounted for 109 points, leaving him with an incredible batting average of .372. It's impossible for a hitter to hit .370 over an entire season by just getting lucky, yet that's exactly what this graph is showing. So obviously there is something going on.

The difficulty in predicted Ichiro's average relative to league average is easy to pinpoint. His speed gets him hits, and until this year that speed made up for any sort of batted ball statistics that he had in any year. An interesting way to look at it is that if Ichiro had league-average speed, his batted ball statistics show him to be about a career .260-.270 hitter. In actuality he is a .328 career hitter, so that speed has increased his batting average by about 60 points. So what is happening this year? Has he lost a step, or maybe just gotten slightly unlucky?

There are a couple of ways we can determine if he has lost a step. He has stolen 16 bases so far this year and only been caught 4 times, which are not far off from his 162 game averages of 39 steals and 9 CS. So at first glance, he does not seem to be any slower. But if we look further into the statistics, we can see an interesting trend. His infield hit % is down over 5% this year compared to last year and 2.5% off his career rate. This has already cost him about 17 hits this year, which would bump his average up about 22 points to .278, still not .300 but much closer.

Does this decrease in IFH% due to luck or skill? Unfortunately, we can't exactly quantify the differences, but we do see that over his career, his IFH% has fluctuated between 9.5% in 2005 to 16% in 2009. That is a huge difference, and one major reason why he had his worst batting average of his career (.303) in 2005 and his second best (.352) in 2009. So maybe ground balls are just not quite finding the holes that they normally do. This is plausible, but we are getting into a large enough sample size (300 PAs) that the IFH% should start regressing towards the mean. If it doesn't, then he has definitely lost a step.

One other contributing factor is that Ichiro has yet to hit a home run this year. He has never been a big power guy, but he has hit at least 6 homers in every season, and is currently looking like he might not hit more than 3 or 4. With his FB% at 20.8%, the lowest total since 2004, he looks like he is becoming even more of a pure singles hitter. This could be resulting in outfielders playing even more shallow, taking hits away that used to drop in front of them as they are not afraid of balls going over their heads.

So why is Ichiro having such a poor season? We can see that although he may be getting unlucky, it is also due to age slowly creeping up to him. It is affecting both his power and also probably slightly affecting his speed. Ichiro may bounce back the rest of the season and end up hitting .300, but it is much more likely that he will end the year hitting .280-.290. Unfortunately, the regression model does not help much in predicting his batting averages by season, but it is very interesting to look at what type of player he would be with just league-average speed.

Saturday, June 4, 2011

Differentiating between Luck and Skill Part II

The last post explained how we used batted ball statistics to determine a batter's skill in his batting average. In this post, I want to show how much of a batter's difference from his career mean is due to luck and how much is due to skill.

The first thing to do is to figure out a hitter's career batting average. This is, with a large enough sample size, his "average skill". So a hitter with ten seasons in the majors with a career .300 batting average is a ".300 hitter". If he is batting over .300, then he is having a better season, and there is some amount of skill and luck as to why he is batting better. If he is batting under .300, then maybe he is getting unlucky, or maybe he is losing some skill as he ages.

Once we get a hitter's career average, as well as his actual batting average for every season and his predicted average for every season from the regression equation we have already run, we can graph all three. The career average is the baseline and the difference between the career average and the predicted average is the batter's difference due to skill. The difference between the hitter's predicted average and his actual average is the difference due to luck. Anyone familiar with statistics will realize the procedure: the difference between a variable's mean and it's predicted value is the "Sums of Squares Explained", and the difference between the predicted value and observed value is the "Sums of Squares Residuals". The batter's skill is explained, and his luck is a residual, or error, which is unexplained.

The first example I want to use is a player very familiar to the Blue Jays: Vernon Wells. He spent his entire career in Toronto until being traded to the Angels this past offseason. He had a couple of good seasons, as well as some bad seasons, so he should be a good example, showing variance in both actual and predicted AVGs. He is a career .277 hitter, so the blue line in the graph below is his "average skill" as a hitter. The red line shows his observed averages over each season from 2002-2011 (excluding 2008 when he missed a lot of time due to injuries), and the green line is the regression's predicted values for his average.

Vernon Wells AVG from 2002-2011
There are a lot of interesting things to see on the graph. Although Wells is a career .277 hitter, he has had only two full seasons hitting above that (as well as hitting .300 in 2008 in limited action). In 2003, his second full season, he hit .317 even though his predicted average was only .282. This means that of the extra 40 points of batting average above his career mean (.317-.277), 5 of those were due to skill, and 35 due to luck, or randomness. The regression saw him not as a .317 hitter, but more of a .282 hitter, and was correct as the next season he batted only .272. It is interesting to note that in both of his first full season, his batted ball statistics suggested that he was a .282 hitter, but he hit 42 points higher in his second season. That seems to show the luck he had in his second year. This also happened in 2005-06, when the regression predicted he was a .271 hitter, but he hit .269 in 2005 and then .303 in 2006. Unfortunately Wells was seen as a much better hitter than he actually was, and was rewarded with a huge contract that the Jays had to unload for players with lesser than Wells' ability.

Another thing to notice on the graph is how far Vernon has fallen off this year. He is currently hitting only .183 (through Thursday's games), but the regression is expecting him to be hitting .247. This means that although he has lost about 30 points of skill, from .277 to .247, he has also been very unlucky, hitting 64 points lower than expected. This is still a small sample size, so we will need to see how he performs the rest of the season to truly judge whether or not he has simply lost it or if he was unlucky.

Now that we have seen a good example, we can move on to what this entire exercise was all about: figuring out how much of Jose Bautista's increase in average so far this year is due to luck and how much is due to skill. Bautista is a career .252 hitter, so any average above that means that he has either been getting lucky or has become more skilled. Last year, the regression predicted that he would hit .261, and he actually hit .260, showing that all of his increase in batting AVG was skill, and he was actually slightly unlucky (by 1 point, more due to randomness than anything else).

Jose Bautista AVG from 2010-2011

And that leads us to this year. Bautista is currently hitting .363 (through Thursday's games), an amazing 111 points above his career average. The regression predicted that he would be hitting .326 at this point, so 74 points are due to skill and only 37 are due to luck. This means that 2/3 of his increase in batting average is entirely due to skill. Obviously it is still a somewhat small sample size, and we need to see how he finishes the year, but we can say with certainty that this increase in batting average is not due to some fluke. All of the keys to increasing batting average that I mentioned in the last post are evident in Bautista this year. He has increased his GB/FB rate, increased his line drive % by 5%, decreased his FB% by 8%, increased his HR/FB rate by 8%, and decreased his strikeout percentage by 3%. These have all lead to a predicted increase in average, and thus we can see that the increase is mostly due to skill.

This post was a mission to answer the question "can we differentiate between luck and skill in a batter's AVG?", and ended up answering an emphatic yes. We showed that much of a hitter's variation in batting average can be attributed to skill, especially in the case of Jose Bautista. As I said in this post on Bautista, "Bautista probably won't end the year hitting .350, but we can reasonably expect him to finish the year hitting .320 or so." That was my gut feeling, and it is nice to know that the numbers back up that statement. He may well end up hitting over .350 this year, but it is more reasonable to expect him to hit somewhere around .330. He has fundamentally changed his approach in the last two years, and last year reaped the benefits by hitting 54 home runs. This year, he is still hitting home runs, but has now also become a high average hitter, due almost all to skill. 

Friday, June 3, 2011

Differentiating between Luck and Skill Part I

In my last post on Jose Bautista, I noted that "the difference in batting average from his career to this season is due to both luck and skill, and unfortunately we can't exactly differentiate the two." After finishing the post, I realized that there was a way to differentiate between luck and skill, although it would take some time to figure out.

To accomplish this task, I decided to get hitting statistics and run regressions showing what statistics will predict AVG and BABIP. These stats aren't hits, home runs, or RBIs, rather I looked at stats such as line drive, ground ball, and fly ball percentages, and home run per fly ball ratios. These stats should more accurately show whether a player is getting lucky, or has made an adjustment and is now reaping the benefits.

The batted ball statistics that I am using are only available back until 2002, so my data set is all hitters that qualified for the batting title from 2002 through 2010. I decided not to include 2011, as most hitters have only played around 50 games, and the smaller sample size may skew results, albeit slightly. There are a total of 1397 players in the data set, each with at least 500 plate appearances per season.

The first thing to do is get the best regression model. Batting average could be modeled very well through statistics such as BABIP, sac flies, and home runs, but the goal of this analysis is to model average using batted ball statistics. As such, the regression equation will not explain as much of the data (lower R2 value), but should better explain how much of the batting average is due to luck and how much due to skill. To figure out the best regression model, I included all of the variables I thought would be useful (GB/FB ratio, LD%, GB%, FB%, HR/FB ratio, and SO), and ran a best subsets analysis. I did not include home runs, because HR/FB ratio represents how often a player will hit a home run, and including home runs would just be introducing a confounding variable. The model also recognized this by recommending the regression model AVG = GB/FB + LD% + FB% + HR/FB + SO, omitting GB%. This was measured using Mallow's Cp, and also makes sense intuitively, as if we have all GB/FB, GB%, and FB%, we have three variables, where one can be solved for without taking up a degree of freedom.

Now that I had the proper regression equation, I could run a simple linear regression to determine the coefficients for each of the independent variables.

Coefficients:
                  Estimate      Std. Error     t-value     Pr(>|t|)
Intercept    0.2579         1.899e-02    13.581     <2e-16 ***
GB/FB       5.233e-03    4.113e-03    1.272      0.2036   
LD%          0.2365         2.492e-02    9.491      <2e-16 ***
FB%          -0.05494      2.899e-02    -1.895     0.0583 . 
HR/FB       0.2129         1.151e-02    18.492     <2e-16 ***
SO             -3.791e-04   2.166e-05   -17.503    <2e-16 ***


Although the R2 value of the regression is only 0.3437, the goal of the regression is not to explain most of the variance in batting averages, it is to use advanced batted ball statistics to try and best model batting averages. The main statistics that we are interested in are the coefficients on each of the independent variables. I want to quickly interpret these coefficients, as well as provide some reasoning as to why these values make intuitive sense.

An increase of 1 in the ratio of ground balls to fly balls is associated with an increase of 5.23 points in a player’s batting average (e.g. from .250 to .25523). Ground balls have a higher probability of becoming hits than fly balls, so hitting more ground balls should lead to more hits. However, this will not affect batting average all that much, as 90% of the GB/FB rates are between 0.75 and 1.75. A player that tried to increase this ratio without fundamentally changing his swing would only see a small increase, and thus only a small increase in batting average.
A one-percentage point increase in a player’s line drive percentage is associated with an increase of 2.36 points in a player’s batting average. Line drives are the batted balls that most often fall for hits, so hitting more line drives will result in a higher batting average. An increase in LD% is due to almost all skill, as it means that a player is squaring the ball up better. It's not possible to simply be lucky and have a significantly higher line drive percentage for an extended period of time.
A one-percentage point increase in a player’s fly ball percentage is associated with a decrease of 0.55 points in a player’s batting average. Fly balls fall for hits the least, so hitting more fly balls will have a negative impact on batting average. The magnitude of this coefficient is about four times less than line drive %, which can be explained by two things: fly balls happen about twice as much as line drives, so the change in hitting one more line drive is worth more than one fly ball. This means that line drives have about twice the effect on batting average that fly balls do.
An increase of 1 in the ratio of home runs to fly balls is associated with an increase of 2.13 points in a player’s batting average. The more fly balls that leave the yard, the less that get caught, so the higher the batting average.

Finally, an increase of 10 strikeouts is associated with a decrease of 3.79 points in a player’s batting average. Obviously, if the batter does not put the ball in play, he cannot get a hit, so the more often a player strikes out, the lower his average will be. The following table summarizes the interpretation of the coefficients.

An increase of 1 in the following:
Leads to a change in AVG of the following:
GB/FB rate
+ 5.23 points
LD%
+ 2.37 points
FB%
- 0.55 points
HR/FB rate
+ 2.13 points
Strikeouts
- 0.38 points
Overall, the key to being a good hitter is to hit lots of line drives, hit more ground balls than fly balls, but those fly balls that you hit should have a higher than league average chance of becoming home runs, and avoiding striking out. All of these traits are due mostly to skill, as batters must make conscious adjustments to make more contact as well as better contact.

In my next post, I am going to show how we can use this model to determine whether batters get lucky or whether their increases in batting average are due to skill. We can graphically show how a batter's increase (or decrease) from his career average can be broken down into both luck and skill partitions. We will be able to see whether Jose Bautista's average this year is due to luck or skill, which will help us determine whether his average is sustainable or not.









Sunday, May 29, 2011

Jose Bautista's Hot Start

I haven't written a post in awhile, but I am now home for the summer and will hopefully be writing a few posts a week. I wanted to write one today on Jose Bautista. I wrote about him at the end of last season here, and I wanted to do a study on why he is even better than last year.

The major difference between this year's version of Bautista and last year's is his much better batting average. He is still managing to hit a ton of home runs, but after hitting only .260 last season, he is now hitting .353 (all statistics through Saturday's games), good enough for second in the AL. One of the biggest reasons behind this increase in batting average is that his BABIP (batting average on balls in play) has increased from .233 to .321 this year. His career rate of .273 suggests that he was somewhat unlucky last year, and has gotten lucky this year. This may be misleading, as his changed swing naturally leads to more fly balls, usually meaning a lower BABIP. It seems as though last year he was simply trying to hit the ball out of the park, while this year he has become more of a line drive hitter while still hitting home runs. This can be seen in his line drive %, which was only 14.4% last year and is up to 17.1% this year. Line drive % shows how "lucky" a hitter is getting, as line drives are usually end up falling for hits, while ground balls and fly balls are more frequently outs. The increase in LD% shows that Bautista hasn't actually been any luckier this season, he is simply hitting the ball much harder in a higher percentage of at-bats and is being rewarded with a higher BABIP and subsequently batting average.

We can demonstrate what could have happened in previous seasons had Bautista hit as many line drives, leading to a higher BABIP. BABIP is calculated as: (hits - home runs) divided by (at-bats minus strikeouts and home runs plus sac flies). Last year, Bautista had 569 at-bats, 148 hits of which 54 were home runs, 116 strikeouts, and 4 sac flies. If we set his hits total as unknown, we can solve for the amount of hits he would have had with different BABIP. He had a total of 94 non-HR hits last year, and if he had even had his career BABIP of .273, he would have produced 110 non-HR hits. This would have left him with a batting average of (164/569) = .288. If he had this year's .321 BABIP, he would have had an batting average of .322, much closer to this year's .353. The difference in batting average from his career to this season is due to both luck and skill, and unfortunately we can't exactly differentiate the two, but we do know that Bautista has become a better hitter, and is hitting for a higher average at least due to some skill.

So where is the other 30 point difference coming from? The BABIP formula shows us that it comes from the number of home runs and strikeouts a hitter accumulates (also sac flies, but they are minimal and can be ignored). As discussed in the last post, a large determinant of the number of home runs a hitter has is his HR/FB ratio, which is mainly due to luck. Obviously, hitters like Bautista who make a conscious effort to hit fly balls very hard will have higher HR/FB rates, and thus more home runs. The league average is usually around 10.6% (actually, the HR/FB rate is now below 9% for 2011), and last year Bautista ended the year with a 21.7% rate, more than double league average. This year, Bautista has a 31.3% HR/FB rate, which is insane. This means that roughly for every three fly balls he hits, one ends up over the wall for a home run. This is by far the highest rate in the majors this year, with Lance Berkman having the second highest rate at 23.4%, which is 34% lower than Bautista's rate. Even with Bautista's violent swing, this rate is bound to regress at least somewhat towards the mean. This shows that Bautista has gotten somewhat lucky this year, but it is impossible to determine what his final HR/FB rate will be, so we cannot determine exactly how lucky.

The last determinant of BABIP is from strikeouts. Bautista has dramatically increased in this area, decreasing his strikeout rate from 20.4% last year to 17.3% this year. This may not sound like much, but over a full season of 500 or so at-bats, that's 16 more balls put in play, and with a BABIP of .321, five more hits. That's about 10 extra points on his batting average over a full season, simply due to striking out less. What makes this even more impressive is that while he is striking out less, mainly due to the fact that he decreased his swinging strike percentage from 7.7% to 6.7%, he still has the ability to hit the ball extremely hard, actually harder this year. In almost every case, a hitter will sacrifice power in order to make more contact, yet Bautista has managed to become better at both. This is certainly not luck, and we can attribute this part of his increased batting average all to skill.

What does this all mean? I have presented a lot of different statistics, and what I hoped to accomplish was to show that while Bautista has gotten a little lucky in his huge increase in AVG this year, it has mainly been from skill. Although at first you may want to credit luck from his increased BABIP, he hasn't simply had more balls "find holes" this year, he has been hitting more line drives, which show that he has become a better hitter. Yes, his home run total has been somewhat inflated by his incredible HR/FB rate, so maybe we shouldn't expect him to hit 40+ home runs the rest of the season and instead expect him to finish with 50 or so home runs. If his HR/FB rate regresses even all the way to last year (which I don't expect it will), he should still end up with 48 home runs. Finally, he has managed to swing and miss less pitches, which decreases strikeouts and allows him to hit more balls in play. This is due to a systematic adjustment, and is not lucky at all. Although we know Bautista won't end the year hitting .350, with all of these factors we just discussed, we can reasonably expect him to finish the year hitting .320 or so. Many people expected Bautista to slump this year and not be able to hit as well as last year, but he has managed to play even better, and should end up with better numbers this season than last, which is hard to believe.

Wednesday, March 2, 2011

Blue Jays Batting Order

Now that the offseason has winded down and spring training games have begun, it is time to look forward to the 2011 season. An important question for this year (and any year) is what will the batting order look like? This breaks down how exactly a lineup should be constructed to optimize players' talents. Basically, the old-school thoughts on building a lineup (speed at #1, bunter #2, power hitters #3-5, worst hitters #6-9) are mostly incorrect, according to sabermetric research. The article does a nice job of explaining who should hit where and why. It does note that a specific permutation of players in a batting order does not make a huge difference, only about a maximum of one win per season. However, it is a fun exercise to construct a projected lineup.

This post was inspired by this post, which detailed the optimized lineup for the Indians. I want to do the same thing with the Blue Jays. The data I am using is the Cairo projections (found here), which project a player's upcoming season based on weighted average of a player's past few seasons. The statistic that is used to figure out the optimal lineup is wOBA, or weighted on-base average. Two good explanations of the statistic can be found here and here. In short, wOBA combines on-base percentage and slugging percentage into one statistic, scaled to OBP, so it is easy to understand. Why not simply use on-base plus slugging? For one, OPS weighs OBP and SLG equally, while in reality OBP is more important. wOBA is calculated using the actual run values of each event, so it will better predict how much more valuable something is worth rather than simply OPS.

Although it sounds confusing, and the math behind it is, the end result is one simple number which tells you how valuable a player is to his team while hitting. If this sounds similar to Wins Above Replacement, it's because wOBA is used to calculate the hitting aspect of WAR.

Now, onto the data. Here are the splits for all of the Blue Jays involved in the Cairo projections.


Player
Projected wOBA
Vs L
Vs R
Jose Bautista
.373
.367
.375
Travis Snider
.339
.314
.344
J.P. Arencibia
.335
.350
.327
Adam Lind
.331
.293
.345
Edwin Encarnacion
.331
.349
.325
Juan Rivera
.329
.343
.323
Luis Figueroa
.329
.327
.330
Randy Ruiz
.328
.339
.322
Yunel Escobar
.326
.335
.322
Rajai Davis
.325
.339
.318
Aaron Hill
.321
.336
.316
Jason Lane
.312
.323
.306
Chris Aguila
.309
.317
.304
Mike McCoy
.307
.320
.298
John McDonald
.288
.302
.281
Callix Crabbe
.285
.288
.283
Jose Molina
.283
.297
.276
 
I am going to use lineups against both LH and RH pitchers, as the projections include platoon splits. The optimal lineup, according to sabermetrics, is #1, #4, #2, #5, #3, #6, #7, #8, #9 in terms of avoiding outs. So the highest wOBA will be first, then 4th, all the way to 9th.

Lineup vs. LHP

#
Name
Position
wOBA
1
Jose Bautista
RF
.367
2
Edwin Encarnacion
3B
.349
3
Randy Ruiz
1B
.339
4
J.P. Arencibia
C
.350
5
Juan Rivera
LF
.343
6
Rajai Davis
CF
.339
7
Aaron Hill
2B
.336
8
Yunel Escobar
SS
.335
9
Luis Figueroa
DH
.327

Lineup vs. RHP

#
Name
Position
wOBA
1
Jose Bautista
CF
.375
2
Travis Snider
LF
.344
3
J.P. Arencibia
C
.327
4
Adam Lind
1B
.345
5
Luis Figueroa
2B
.330
6
Edwin Encarnacion
3B
.325
7
Juan Rivera
RF
.323
8
Randy Ruiz
DH
.322
9
Yunel Escobar
SS
.322

We can see some interesting things in the two lineups. Seven players appear in both lineups, although maybe not who you would think: Bautista, Encarnacion, Ruiz, Arencibia, Rivera, Escobar, and Figueroa. Rajai Davis and Aaron Hill are only hitting against lefties, and Lind and Snider are only hitting against righties.

The projections are obviously not perfect, if Luis Figueroa is starting every day for the Jays, but they do provide some insight into where certain hitters should hit. With a league average wOBA of .321 last year in the majors, the Jays' lineups should be much better than average again this year, even with the loss of Vernon Wells and John Buck.

Monday, January 31, 2011

30 HRs or 30 saves?

I have done two posts, the true value of a home run and the true value of a save. These posts sprung out of the question: which is worth more, 30 home runs or 30 saves?

We found that the true value of a home run was worth 1.406 runs, and that the true value of a save was 0.11415 WPA. So how do we compare these two variables in different units? Eventually, we want to set a dollar value to each event, but we must first translate each into a win value.

It has been estimated that a win is worth somewhere between 9.5 and 10 runs. There are many different explanations how that was calculated and why it is so, but for simplicity I am just going to accept the argument that 10 runs = 1 win. We can now change the run value of a home run into a win. One home run is worth 0.1406 wins, so 30 home runs would be worth 4.218 wins.

Although it is usually not helpful to sum up WPA, in this case, it is the best we can do to approximate the value of a save. We found that the average save is worth 0.114 wins, so 30 saves would be worth 3.425 wins, using WPA. If we use WPA/LI, the average save was worth 0.0614 wins, so 30 saves would now only be worth 1.841 wins.

We have found out that, mathematically, 30 home runs are clearly worth more than 30 saves. We can now figure out how much each are worth in dollars. It has been estimated each win is worth about $4.5 million on the open market (so each win above replacement will cost approximately $4.5 million to replace, obviously a player with 8 wins above replacement is not going to be paid $36 million per year). So the value of 30 home runs, on the open market, is $18.98 million. This seems to be an unrealistic number, but there are players such as Jayson Werth who hit 27 home runs last year and received a 7-year, $126 million contract (average of $18 million/year) this offseason from the Nationals.

30 saves measured by WPA are worth 3.425 wins, or $15.41 million, and 30 saves measured by WPA/LI are worth $8.28 million. This dollar amount for WPA/LI is much more realistic than the amount for home runs. One example is Bobby Jenks, who compiled 27 saves last year and got a 2-year, $12 million contract this offseason.

So, the answer to the question of 30 home runs or 30 saves has clearly been answered. Home runs are either only slightly more valuable, or much more valuable than saves, depending on your view of relief pitchers. I believe that the math agrees with intuition here, as it seems as though it would be much easier (and cheaper) to acquire a player that will get 30 saves as opposed to a player that will hit 30 home runs. The marginal difference between an average closer (like Frank Fransisco for the Jays) and another pitcher in the bullpen (say, Jason Frasor) is much smaller than the marginal difference between a player like Aaron Hill and a bench player, such as John MacDonald.

In conclusion, I want to show one more example. This is a list of the 18 players who hit at least 30 home runs last year. The average Wins Above Replacement for the players was 4.39. If we look only at Batting Wins (WAR with the defense and running statistics removed), the players still have an average of 3.58 wins. This is a list of the 14 pitchers who saved at least 30 games last year. They have an average WAR of 2.09 wins. I believe that this shows that the pitchers who save 30 games are less valuable to their teams than the players who hit 30 home runs, which we have seen over the past three posts.

Sunday, January 30, 2011

The True Value of a Save

This post will be a little different than the previous post, in that it is much more difficult to quantify the value of a save than the value of a home run. For home runs, we can fairly easily calculate the differences between each one, as there are only 24 different ways for a home run to occur (the base-out states). There are many, many more ways for a save to unfold. There are one and two inning saves; one, two, or three run leads; and a number of different base-out combinations throughout a save attempt.

To determine the true value of a save, we are going to look at all 1,204 saves in 2010, and determine the WPA of each save. A description on WPA can be found here, but put simply, it is the probability of a team winning after an event subtracted by the probability of a team winning before the event. It will show how much a player contributed to his team winning the game. Every save from last year can be found here, sorted by WPA. The true value of a save will then be the average WPA for all saves.

The most valuable save last year was recorded by Andy Sonnanstine of Tampa Bay, with a WPA of 0.662. The least valuable save was recorded by Matt Harrison of Texas, with a WPA of only 0.001 (he actually pitched 3 innings in a blow out game, which is one of the obscure ways a reliever can get a save). The average of all saves last year was a WPA of 0.114. What this means is that the average reliever recording a save will increase his team's expected win probability by about 11% (from 89.6% to 100%).

Unfortunately, there are many debates going on (such as here) as to whether or not WPA is an accurate measurement of a relief pitcher's value. The probability of a team winning when leading going into the 9th inning has not changed whatsoever from 1952 to 2010 (which is pretty amazing!). Naturally, this calls into question the value of the modern day closer. So instead of using WPA, many sabermetricians use WPA/LI, otherwise known as Context Neutral Wins, which is described here. LI is the leverage index of a certain play, as a tie game in the 9th inning will have much more pressure than a play in the 1st inning of a game. Simply using WPA will not account for the context of the situation, so the value of a reliever could be drastically overvalued merely because they pitch in higher-leverage situations.

WPA/LI takes care of this problem by neutralizing the leverage of the situation. As a result, a player's contribution will almost always be less, especially for relievers. If we look at the WPA/LI for all of the saves from 2010, the average is WPA/LI is now only 0.061 (almost half of the WPA value).

So the problem now becomes, which statistic do we use? WPA or WPA/LI? This really depends on your own beliefs. If you believe that closers are really good pitchers who can do things other relief pitchers cannot do, especially in high-pressure situations, then you would want to use WPA. However, personally I believe that closers are only marginally better pitchers than their bullpen counterparts, and as such, are getting some undue credit. So I believe that using WPA/LI is better, especially considering that many closers are failed starting pitchers. However, I will use both statistics in comparing home runs and saves. My next post will finally answer the question of whether 30 home runs or 30 saves are more valuable.

Thursday, January 27, 2011

The True Value of a Home Run

I have been meaning to do a post of the true value of a home run for awhile, but unfortunately I put it on the back burner for awhile until I was asked this question: which is worth more, 30 home runs or 30 saves? In this post, I am going to examine the true value of a home run, and in the next post I will examine exactly how much a save is worth, so I can compare the two.

The data I am going to use is for all teams in the 2010 regular season. The first thing to do is to find the number of home runs hit in each base-out state, which can be found from baseball-reference:
RUNNERS HR_OUTS_0 HR_OUTS_1 HR_OUTS_2
None 1220 811 617
1st 248 318 312
2nd 74 133 147
3rd 9 39 52
1st and 2nd 54 119 142
1st and 3rd 27 49 47
2nd and 3rd 9 30 30
Bases Loaded 23 43 60

The total number of home runs hit last year was 4613, and over half of those were solo home runs. It was very rare for players to hit home runs with no outs and runners on third, as it would usually require a triple, or a double and steal.

The next step is to find the expected runs matrix for 2010 (from Baseball Prospectus):
RUNNERS EXP_R_OUTS_0 EXP_R_OUTS_1 EXP_R_OUTS_2
None 0.49154 0.26151 0.10374
1st 0.85877 0.50512 0.2282
2nd 1.10113 0.67765 0.3215
3rd 1.35798 0.93308 0.34192
1st and 2nd 1.42099 0.88181 0.45503
1st and 3rd 1.80042 1.0982 0.46571
2nd and 3rd 1.96584 1.38849 0.58205
Bases Loaded 2.36061 1.51185 0.77712

We can use these two matrices together to determine the true value of a home run. The equation we will use is: value of a home run = Expected runs at the end of the play - Expected runs at the beginning of the play + the number of runs scored during the play. What this means is that we are taking the expected runs after - before to determine the value of the play (e.g. a leadoff out would be calculated as 0.26151 - 0.49154 = -0.23003, meaning the expected runs for the team in that inning would decrease by 0.23 runs), and then adding the number of runs that were scored.

This matrix shows the true value of a home run for each base-out state. Obviously, when there are no runners on base, the value of a home run will be 1, as the beginning and end states will be the same.
RUNNERS Value_OUTS_0 Value_OUTS_1 Value_OUTS_2
None 1 1 1
1st 1.63277 1.75639 1.87554
2nd 1.39041 1.58386 1.78224
3rd 1.13356 1.32843 1.76182
1st and 2nd 2.07055 2.3797 2.64871
1st and 3rd 1.69112 2.16331 2.63803
2nd and 3rd 1.5257 1.87302 2.52169
Bases Loaded 2.13093 2.74966 3.32662

The most valuable home runs, obviously, are grand slams, as they score 4 runs, while home runs hit with two outs are more valuable than those hit with 0 or 1 out as there will be fewer chances remaining in the inning to drive in the runners or base, thus making the home run more valuable.

Finally, we need to multiply the matrix containing the number of home runs hit by the matrix showing the true value of a home run for each base-out state to find the run values for each base-out state.
RUNNERS Value_OUTS_0 Value_OUTS_1 Value_OUTS_2
None 1220 811 617
1st 404.92696 558.53202 585.16848
2nd 102.89034 210.65338 261.98928
3rd 10.20204 51.80877 91.61464
1st and 2nd 111.8097 283.1843 376.11682
1st and 3rd 45.66024 106.00219 123.98741
2nd and 3rd 13.7313 56.1906 75.6507
Bases Loaded 49.01139 118.23538 199.5972

To find the true value of a home run, we simply add up all of the runs (6485) and divide by the total number of home runs hit (4613) to find the average value of a home run: 1.406 runs. What this means is that the average home run hit in 2010 was worth 1.406 runs for the player's team. We will use this number later to figure out exactly how much each home run is worth in a dollar amount, and whether or not it is worth more than a save.

Thursday, December 23, 2010

Using Markov Chains to Evaluate the Hitting of the 2009 Toronto Blue Jays

This is a paper I wrote for one of my statistics classes last year. The goal of the paper was to figure out the run probabilities for each of the 24 base-out states in baseball (8 base states, 3 out states). I have included the matrices I used to create the run expectancy table if you would like more background info on how exactly the table was created.

Matrices with probabilities:
Markov Chains - Probabilities

Markov Chains Paper:
Markov Chains