Tuesday, June 14, 2011

Ichiro Without Speed

After finishing my posts on luck vs. skill, I decided to look at a couple more individual players who would have very interesting regression lines. The first one that came to mind was Ichiro. He came to MLB from Japan in 2001, and since then he has had 10 straight 200 hit seasons and never hit lower than .303. However, this year he is hitting .256, and only on pace for about 177 hits. There has been a lot of talk about what is wrong with him, including this article on Fangraphs which shows his BABIP on different types of hits.

Clearly Ichiro is an outlier in the regression model. We know that even before looking at his luck and skill, because his unique hitting approach allows him to get hits on balls that almost every other major leaguer would be out on. Since 2002 (the first year advanced batted ball data was available, unfortunately we cannot capture his rookie season), Ichiro has 407 infield hits, by far the most in MLB, and has the highest infield hit % (IFH/GB) in baseball. As such, although his batted ball data may not look all that impressive, he still manages to get a ton of hits.

However, although we expect him to be an outlier, his batting average graph is still very surprising.

Ichiro AVG, 2002-2011
Ichiro's career .328 batting average is much higher than any of his predicted averages, and until this year his actual averages were also always quite higher. The most incredible season, 2004, he had a predicted batting average of .263, which would mean his "skill" cost him 65 points relative to his career average, yet his luck accounted for 109 points, leaving him with an incredible batting average of .372. It's impossible for a hitter to hit .370 over an entire season by just getting lucky, yet that's exactly what this graph is showing. So obviously there is something going on.

The difficulty in predicted Ichiro's average relative to league average is easy to pinpoint. His speed gets him hits, and until this year that speed made up for any sort of batted ball statistics that he had in any year. An interesting way to look at it is that if Ichiro had league-average speed, his batted ball statistics show him to be about a career .260-.270 hitter. In actuality he is a .328 career hitter, so that speed has increased his batting average by about 60 points. So what is happening this year? Has he lost a step, or maybe just gotten slightly unlucky?

There are a couple of ways we can determine if he has lost a step. He has stolen 16 bases so far this year and only been caught 4 times, which are not far off from his 162 game averages of 39 steals and 9 CS. So at first glance, he does not seem to be any slower. But if we look further into the statistics, we can see an interesting trend. His infield hit % is down over 5% this year compared to last year and 2.5% off his career rate. This has already cost him about 17 hits this year, which would bump his average up about 22 points to .278, still not .300 but much closer.

Does this decrease in IFH% due to luck or skill? Unfortunately, we can't exactly quantify the differences, but we do see that over his career, his IFH% has fluctuated between 9.5% in 2005 to 16% in 2009. That is a huge difference, and one major reason why he had his worst batting average of his career (.303) in 2005 and his second best (.352) in 2009. So maybe ground balls are just not quite finding the holes that they normally do. This is plausible, but we are getting into a large enough sample size (300 PAs) that the IFH% should start regressing towards the mean. If it doesn't, then he has definitely lost a step.

One other contributing factor is that Ichiro has yet to hit a home run this year. He has never been a big power guy, but he has hit at least 6 homers in every season, and is currently looking like he might not hit more than 3 or 4. With his FB% at 20.8%, the lowest total since 2004, he looks like he is becoming even more of a pure singles hitter. This could be resulting in outfielders playing even more shallow, taking hits away that used to drop in front of them as they are not afraid of balls going over their heads.

So why is Ichiro having such a poor season? We can see that although he may be getting unlucky, it is also due to age slowly creeping up to him. It is affecting both his power and also probably slightly affecting his speed. Ichiro may bounce back the rest of the season and end up hitting .300, but it is much more likely that he will end the year hitting .280-.290. Unfortunately, the regression model does not help much in predicting his batting averages by season, but it is very interesting to look at what type of player he would be with just league-average speed.

Saturday, June 4, 2011

Differentiating between Luck and Skill Part II

The last post explained how we used batted ball statistics to determine a batter's skill in his batting average. In this post, I want to show how much of a batter's difference from his career mean is due to luck and how much is due to skill.

The first thing to do is to figure out a hitter's career batting average. This is, with a large enough sample size, his "average skill". So a hitter with ten seasons in the majors with a career .300 batting average is a ".300 hitter". If he is batting over .300, then he is having a better season, and there is some amount of skill and luck as to why he is batting better. If he is batting under .300, then maybe he is getting unlucky, or maybe he is losing some skill as he ages.

Once we get a hitter's career average, as well as his actual batting average for every season and his predicted average for every season from the regression equation we have already run, we can graph all three. The career average is the baseline and the difference between the career average and the predicted average is the batter's difference due to skill. The difference between the hitter's predicted average and his actual average is the difference due to luck. Anyone familiar with statistics will realize the procedure: the difference between a variable's mean and it's predicted value is the "Sums of Squares Explained", and the difference between the predicted value and observed value is the "Sums of Squares Residuals". The batter's skill is explained, and his luck is a residual, or error, which is unexplained.

The first example I want to use is a player very familiar to the Blue Jays: Vernon Wells. He spent his entire career in Toronto until being traded to the Angels this past offseason. He had a couple of good seasons, as well as some bad seasons, so he should be a good example, showing variance in both actual and predicted AVGs. He is a career .277 hitter, so the blue line in the graph below is his "average skill" as a hitter. The red line shows his observed averages over each season from 2002-2011 (excluding 2008 when he missed a lot of time due to injuries), and the green line is the regression's predicted values for his average.

Vernon Wells AVG from 2002-2011
There are a lot of interesting things to see on the graph. Although Wells is a career .277 hitter, he has had only two full seasons hitting above that (as well as hitting .300 in 2008 in limited action). In 2003, his second full season, he hit .317 even though his predicted average was only .282. This means that of the extra 40 points of batting average above his career mean (.317-.277), 5 of those were due to skill, and 35 due to luck, or randomness. The regression saw him not as a .317 hitter, but more of a .282 hitter, and was correct as the next season he batted only .272. It is interesting to note that in both of his first full season, his batted ball statistics suggested that he was a .282 hitter, but he hit 42 points higher in his second season. That seems to show the luck he had in his second year. This also happened in 2005-06, when the regression predicted he was a .271 hitter, but he hit .269 in 2005 and then .303 in 2006. Unfortunately Wells was seen as a much better hitter than he actually was, and was rewarded with a huge contract that the Jays had to unload for players with lesser than Wells' ability.

Another thing to notice on the graph is how far Vernon has fallen off this year. He is currently hitting only .183 (through Thursday's games), but the regression is expecting him to be hitting .247. This means that although he has lost about 30 points of skill, from .277 to .247, he has also been very unlucky, hitting 64 points lower than expected. This is still a small sample size, so we will need to see how he performs the rest of the season to truly judge whether or not he has simply lost it or if he was unlucky.

Now that we have seen a good example, we can move on to what this entire exercise was all about: figuring out how much of Jose Bautista's increase in average so far this year is due to luck and how much is due to skill. Bautista is a career .252 hitter, so any average above that means that he has either been getting lucky or has become more skilled. Last year, the regression predicted that he would hit .261, and he actually hit .260, showing that all of his increase in batting AVG was skill, and he was actually slightly unlucky (by 1 point, more due to randomness than anything else).

Jose Bautista AVG from 2010-2011

And that leads us to this year. Bautista is currently hitting .363 (through Thursday's games), an amazing 111 points above his career average. The regression predicted that he would be hitting .326 at this point, so 74 points are due to skill and only 37 are due to luck. This means that 2/3 of his increase in batting average is entirely due to skill. Obviously it is still a somewhat small sample size, and we need to see how he finishes the year, but we can say with certainty that this increase in batting average is not due to some fluke. All of the keys to increasing batting average that I mentioned in the last post are evident in Bautista this year. He has increased his GB/FB rate, increased his line drive % by 5%, decreased his FB% by 8%, increased his HR/FB rate by 8%, and decreased his strikeout percentage by 3%. These have all lead to a predicted increase in average, and thus we can see that the increase is mostly due to skill.

This post was a mission to answer the question "can we differentiate between luck and skill in a batter's AVG?", and ended up answering an emphatic yes. We showed that much of a hitter's variation in batting average can be attributed to skill, especially in the case of Jose Bautista. As I said in this post on Bautista, "Bautista probably won't end the year hitting .350, but we can reasonably expect him to finish the year hitting .320 or so." That was my gut feeling, and it is nice to know that the numbers back up that statement. He may well end up hitting over .350 this year, but it is more reasonable to expect him to hit somewhere around .330. He has fundamentally changed his approach in the last two years, and last year reaped the benefits by hitting 54 home runs. This year, he is still hitting home runs, but has now also become a high average hitter, due almost all to skill. 

Friday, June 3, 2011

Differentiating between Luck and Skill Part I

In my last post on Jose Bautista, I noted that "the difference in batting average from his career to this season is due to both luck and skill, and unfortunately we can't exactly differentiate the two." After finishing the post, I realized that there was a way to differentiate between luck and skill, although it would take some time to figure out.

To accomplish this task, I decided to get hitting statistics and run regressions showing what statistics will predict AVG and BABIP. These stats aren't hits, home runs, or RBIs, rather I looked at stats such as line drive, ground ball, and fly ball percentages, and home run per fly ball ratios. These stats should more accurately show whether a player is getting lucky, or has made an adjustment and is now reaping the benefits.

The batted ball statistics that I am using are only available back until 2002, so my data set is all hitters that qualified for the batting title from 2002 through 2010. I decided not to include 2011, as most hitters have only played around 50 games, and the smaller sample size may skew results, albeit slightly. There are a total of 1397 players in the data set, each with at least 500 plate appearances per season.

The first thing to do is get the best regression model. Batting average could be modeled very well through statistics such as BABIP, sac flies, and home runs, but the goal of this analysis is to model average using batted ball statistics. As such, the regression equation will not explain as much of the data (lower R2 value), but should better explain how much of the batting average is due to luck and how much due to skill. To figure out the best regression model, I included all of the variables I thought would be useful (GB/FB ratio, LD%, GB%, FB%, HR/FB ratio, and SO), and ran a best subsets analysis. I did not include home runs, because HR/FB ratio represents how often a player will hit a home run, and including home runs would just be introducing a confounding variable. The model also recognized this by recommending the regression model AVG = GB/FB + LD% + FB% + HR/FB + SO, omitting GB%. This was measured using Mallow's Cp, and also makes sense intuitively, as if we have all GB/FB, GB%, and FB%, we have three variables, where one can be solved for without taking up a degree of freedom.

Now that I had the proper regression equation, I could run a simple linear regression to determine the coefficients for each of the independent variables.

Coefficients:
                  Estimate      Std. Error     t-value     Pr(>|t|)
Intercept    0.2579         1.899e-02    13.581     <2e-16 ***
GB/FB       5.233e-03    4.113e-03    1.272      0.2036   
LD%          0.2365         2.492e-02    9.491      <2e-16 ***
FB%          -0.05494      2.899e-02    -1.895     0.0583 . 
HR/FB       0.2129         1.151e-02    18.492     <2e-16 ***
SO             -3.791e-04   2.166e-05   -17.503    <2e-16 ***


Although the R2 value of the regression is only 0.3437, the goal of the regression is not to explain most of the variance in batting averages, it is to use advanced batted ball statistics to try and best model batting averages. The main statistics that we are interested in are the coefficients on each of the independent variables. I want to quickly interpret these coefficients, as well as provide some reasoning as to why these values make intuitive sense.

An increase of 1 in the ratio of ground balls to fly balls is associated with an increase of 5.23 points in a player’s batting average (e.g. from .250 to .25523). Ground balls have a higher probability of becoming hits than fly balls, so hitting more ground balls should lead to more hits. However, this will not affect batting average all that much, as 90% of the GB/FB rates are between 0.75 and 1.75. A player that tried to increase this ratio without fundamentally changing his swing would only see a small increase, and thus only a small increase in batting average.
A one-percentage point increase in a player’s line drive percentage is associated with an increase of 2.36 points in a player’s batting average. Line drives are the batted balls that most often fall for hits, so hitting more line drives will result in a higher batting average. An increase in LD% is due to almost all skill, as it means that a player is squaring the ball up better. It's not possible to simply be lucky and have a significantly higher line drive percentage for an extended period of time.
A one-percentage point increase in a player’s fly ball percentage is associated with a decrease of 0.55 points in a player’s batting average. Fly balls fall for hits the least, so hitting more fly balls will have a negative impact on batting average. The magnitude of this coefficient is about four times less than line drive %, which can be explained by two things: fly balls happen about twice as much as line drives, so the change in hitting one more line drive is worth more than one fly ball. This means that line drives have about twice the effect on batting average that fly balls do.
An increase of 1 in the ratio of home runs to fly balls is associated with an increase of 2.13 points in a player’s batting average. The more fly balls that leave the yard, the less that get caught, so the higher the batting average.

Finally, an increase of 10 strikeouts is associated with a decrease of 3.79 points in a player’s batting average. Obviously, if the batter does not put the ball in play, he cannot get a hit, so the more often a player strikes out, the lower his average will be. The following table summarizes the interpretation of the coefficients.

An increase of 1 in the following:
Leads to a change in AVG of the following:
GB/FB rate
+ 5.23 points
LD%
+ 2.37 points
FB%
- 0.55 points
HR/FB rate
+ 2.13 points
Strikeouts
- 0.38 points
Overall, the key to being a good hitter is to hit lots of line drives, hit more ground balls than fly balls, but those fly balls that you hit should have a higher than league average chance of becoming home runs, and avoiding striking out. All of these traits are due mostly to skill, as batters must make conscious adjustments to make more contact as well as better contact.

In my next post, I am going to show how we can use this model to determine whether batters get lucky or whether their increases in batting average are due to skill. We can graphically show how a batter's increase (or decrease) from his career average can be broken down into both luck and skill partitions. We will be able to see whether Jose Bautista's average this year is due to luck or skill, which will help us determine whether his average is sustainable or not.