Friday, June 3, 2011

Differentiating between Luck and Skill Part I

In my last post on Jose Bautista, I noted that "the difference in batting average from his career to this season is due to both luck and skill, and unfortunately we can't exactly differentiate the two." After finishing the post, I realized that there was a way to differentiate between luck and skill, although it would take some time to figure out.

To accomplish this task, I decided to get hitting statistics and run regressions showing what statistics will predict AVG and BABIP. These stats aren't hits, home runs, or RBIs, rather I looked at stats such as line drive, ground ball, and fly ball percentages, and home run per fly ball ratios. These stats should more accurately show whether a player is getting lucky, or has made an adjustment and is now reaping the benefits.

The batted ball statistics that I am using are only available back until 2002, so my data set is all hitters that qualified for the batting title from 2002 through 2010. I decided not to include 2011, as most hitters have only played around 50 games, and the smaller sample size may skew results, albeit slightly. There are a total of 1397 players in the data set, each with at least 500 plate appearances per season.

The first thing to do is get the best regression model. Batting average could be modeled very well through statistics such as BABIP, sac flies, and home runs, but the goal of this analysis is to model average using batted ball statistics. As such, the regression equation will not explain as much of the data (lower R2 value), but should better explain how much of the batting average is due to luck and how much due to skill. To figure out the best regression model, I included all of the variables I thought would be useful (GB/FB ratio, LD%, GB%, FB%, HR/FB ratio, and SO), and ran a best subsets analysis. I did not include home runs, because HR/FB ratio represents how often a player will hit a home run, and including home runs would just be introducing a confounding variable. The model also recognized this by recommending the regression model AVG = GB/FB + LD% + FB% + HR/FB + SO, omitting GB%. This was measured using Mallow's Cp, and also makes sense intuitively, as if we have all GB/FB, GB%, and FB%, we have three variables, where one can be solved for without taking up a degree of freedom.

Now that I had the proper regression equation, I could run a simple linear regression to determine the coefficients for each of the independent variables.

Coefficients:
                  Estimate      Std. Error     t-value     Pr(>|t|)
Intercept    0.2579         1.899e-02    13.581     <2e-16 ***
GB/FB       5.233e-03    4.113e-03    1.272      0.2036   
LD%          0.2365         2.492e-02    9.491      <2e-16 ***
FB%          -0.05494      2.899e-02    -1.895     0.0583 . 
HR/FB       0.2129         1.151e-02    18.492     <2e-16 ***
SO             -3.791e-04   2.166e-05   -17.503    <2e-16 ***


Although the R2 value of the regression is only 0.3437, the goal of the regression is not to explain most of the variance in batting averages, it is to use advanced batted ball statistics to try and best model batting averages. The main statistics that we are interested in are the coefficients on each of the independent variables. I want to quickly interpret these coefficients, as well as provide some reasoning as to why these values make intuitive sense.

An increase of 1 in the ratio of ground balls to fly balls is associated with an increase of 5.23 points in a player’s batting average (e.g. from .250 to .25523). Ground balls have a higher probability of becoming hits than fly balls, so hitting more ground balls should lead to more hits. However, this will not affect batting average all that much, as 90% of the GB/FB rates are between 0.75 and 1.75. A player that tried to increase this ratio without fundamentally changing his swing would only see a small increase, and thus only a small increase in batting average.
A one-percentage point increase in a player’s line drive percentage is associated with an increase of 2.36 points in a player’s batting average. Line drives are the batted balls that most often fall for hits, so hitting more line drives will result in a higher batting average. An increase in LD% is due to almost all skill, as it means that a player is squaring the ball up better. It's not possible to simply be lucky and have a significantly higher line drive percentage for an extended period of time.
A one-percentage point increase in a player’s fly ball percentage is associated with a decrease of 0.55 points in a player’s batting average. Fly balls fall for hits the least, so hitting more fly balls will have a negative impact on batting average. The magnitude of this coefficient is about four times less than line drive %, which can be explained by two things: fly balls happen about twice as much as line drives, so the change in hitting one more line drive is worth more than one fly ball. This means that line drives have about twice the effect on batting average that fly balls do.
An increase of 1 in the ratio of home runs to fly balls is associated with an increase of 2.13 points in a player’s batting average. The more fly balls that leave the yard, the less that get caught, so the higher the batting average.

Finally, an increase of 10 strikeouts is associated with a decrease of 3.79 points in a player’s batting average. Obviously, if the batter does not put the ball in play, he cannot get a hit, so the more often a player strikes out, the lower his average will be. The following table summarizes the interpretation of the coefficients.

An increase of 1 in the following:
Leads to a change in AVG of the following:
GB/FB rate
+ 5.23 points
LD%
+ 2.37 points
FB%
- 0.55 points
HR/FB rate
+ 2.13 points
Strikeouts
- 0.38 points
Overall, the key to being a good hitter is to hit lots of line drives, hit more ground balls than fly balls, but those fly balls that you hit should have a higher than league average chance of becoming home runs, and avoiding striking out. All of these traits are due mostly to skill, as batters must make conscious adjustments to make more contact as well as better contact.

In my next post, I am going to show how we can use this model to determine whether batters get lucky or whether their increases in batting average are due to skill. We can graphically show how a batter's increase (or decrease) from his career average can be broken down into both luck and skill partitions. We will be able to see whether Jose Bautista's average this year is due to luck or skill, which will help us determine whether his average is sustainable or not.









No comments:

Post a Comment