Thursday, December 23, 2010

Using Markov Chains to Evaluate the Hitting of the 2009 Toronto Blue Jays

This is a paper I wrote for one of my statistics classes last year. The goal of the paper was to figure out the run probabilities for each of the 24 base-out states in baseball (8 base states, 3 out states). I have included the matrices I used to create the run expectancy table if you would like more background info on how exactly the table was created.

Matrices with probabilities:
Markov Chains - Probabilities

Markov Chains Paper:
Markov Chains

Tuesday, December 7, 2010

Predicting MLB Salaries through Offensive Statistics

This is a paper I wrote for my Econometrics class on predicting MLB Salaries through offensive statistics before and after Moneyball was written. It is a pretty long paper (15 pages plus figures, graphs, etc.), but it nicely blends economics and baseball.

MLB Salaries

Saturday, November 27, 2010

Improving a team's Pitching

I have already written two posts on the best way of improving a team, and improving a team's hitting. In this post, I want to do much of the same as the hitting post, but this time on pitching statistics. I am again going to run a linear regression model to determine which statistics are best correlated with pitching performance, which will show us which statistics can be best used to improve pitching.

In this model, instead of trying to estimate runs scored, I am going to use ERA as the dependent variable. Using runs against is a possibility, but since we are estimating the effect of statistics on pitching, and not pitching and defense, using runs against would include the effect of defense, so it is not an appropriate DV in this scenario. We again need to be careful in our selection of independent variables as to avoid collinearity.

Pitching statistics are almost opposite of hitting statistics. Good hitters are generally grouped into two categories: those that can get on base, and those that can hit for power. Good pitchers are those who do not allow very many baserunners and do not allow many home runs. We can measure these qualifications by using the two statistics Walks and hits per innings pitched (WHIP), which measures the average number of baserunners a pitcher allows per inning, and home runs allowed, which will not encompass all extra base hits, but should give us a good feel for pitchers who do and do not allow many home runs that will hopefully be a decent predictor for all extra base hits. Finally, I am also going to include strikeouts as a predictor, because pitchers with high strikeouts rates are valuable, and maybe a pitcher with more strikeouts will allow less runs because he has to rely less on his defense. When we run the regression, we get the following:

                  Estimate        Std. Error     t value    Pr(>|t|)
Intercept    3.2986958     0.4207396    7.840      6.45e-14 ***
WHIP        0.4431320     0.2483891    1.784      0.0753 . 
SO            -0.0014423    0.0001842     -7.829    6.97e-14 ***
HR            0.0117201     0.0008172     14.342   < 2e-16 ***
R2 = 0.551

As you can see from the R2 value, this regression explains a lot less variability than the hitting regression. However, if we replace WHIP by the number of hits and walks given up, we get a lot better regression:

                  Estimate        Std. Error     t value    Pr(>|t|)  
Intercept    -3.410e+00   2.855e-01    -11.942    <2e-16 ***
Hits           3.815e-03      1.547e-04    24.661     <2e-16 ***
BB            2.463e-03      1.414e-04    17.424     <2e-16 ***
SO            -4.711e-05     1.063e-04    -0.443      0.658  
HR            5.298e-03      4.567e-04    11.601     <2e-16 ***
R2 = 0.8906

Now, the R2 value is almost as high as the hitting regressions. All of the variables are significant except for strikeout, so when we take it out of the regression we get the following:

                  Estimate        Std. Error    t value    Pr(>|t|)  
Intercept    -3.5115171   0.1703878   -20.61     <2e-16 ***
Hits           0.0038502     0.0001321    29.14     <2e-16 ***
BB            0.0024598     0.0001410    17.45     <2e-16 ***
HR            0.0053015     0.0004561    11.62     <2e-16 ***
R2 = 0.8905

We can see how insignificant strikeouts were in the regression, because when we remove it the R2 value decreases by only 0.0001 (0.01%). We can now determine which variables impact pitching the most. One more hit given up is associated with a 0.00395 increase in ERA, one more walk given up is associated with a 0.00246 increase in ERA, and one more home run given up is associated with a 0.00530 increase in ERA. Since there are vastly different numbers of hits, walks, and home runs given up, we must also look at the mean of each to determine which will most affect ERA. The mean number of hits given up by a team in a single season is 1469.9, the mean walks is 540.2, and the mean home runs is 172.0. If we multiple the means by the coefficients, we get that, on average, hits will increase team ERA by 5.66, walks will increase ERA by 1.33, and home runs will increase ERA by 0.91. Obviously, we are only looking at statistics that will negatively impact (increase) ERA, so the numbers will look very high, as we are not inputting statistics such as outs or double plays that will positively impact (lower) ERA.

So from the results we can easily see that hits are the statistic that most impacts a team's ERA. So the obviously solution for a team would be to give up less hits, but how? One way would be to acquire pitchers with greater command, possibly leading to those pitchers being able to "nibble" more, making hitters swing at worse pitches. This would probably increase walks, and we already saw that walks also are bad for ERA. A better solution would be to acquire pitchers with a low batting average against and also a low batting average on balls in play (BABIP - although it has been shown that BABIP fluctuates year to year and may not be consistent for any pitchers). Pitchers also want to give up less home runs, but if they can reduce the number of hits against them this should in turn reduce the number of home runs against them.

Thursday, November 25, 2010

Improving a team's Hitting

As a follow up to my last post, which shows how improving a team through hitting or pitching is equally valuable, I wanted to look at how a team should improve their hitting. There are different ways to score runs, and because higher run totals lead to higher win totals, I want to figure out what is the best way to improve a team's hitting performance, thus leading to more runs and accordingly more wins.

Similar to last post, I am going to run a linear regression model, except this time I am going to use "Runs scored" as the dependent variable. Why not simply use "Wins"? If we were to run a regression model with hitting statistics as predictors and Wins as the dependent variable, we will have a much higher standard error, which means that the R2 value will be much lower, showing the the variability in wins is not explained very much by the hitting statistics. So if we have runs as the DV, we need appropriate hitting statistics for the independent variables. This is trickier to figure out then expected, as we cannot have statistics that are correlated with each other, or the regression model will experience "multicollinearity". What this means is that although the overall regression will predict the dependent variable nicely, we will not be able to tell which independent variables are accounting for the variability in the dependent variable. Although this my sound hard to prevent, it can be fairly straightforward, as a quick example will show. If we were to use on-base percentage, slugging percentage, and on-base plus slugging percentage as predictors for runs (or wins), our equation would have a multicollinearity problem. The overall regression may result in a low p-value, showing that we have predicted runs well, but each statistic individually would have a high p-value. We would not be able to tell which statistic is heavily influencing runs as OPS is an extraneous variable, and since OPS is basically measuring what OBP and SLG are already measuring, the best course of action is to remove it from the equation.

This regression demonstrates the collinearity issue:
                   Estimate    Std. Error     t value     Pr(>|t|)  
Intercept     -5.8651      0.2167         -27.070    <2e-16 ***
OBP            22.4591     17.5672       1.278       0.202   
SLG            15.0106     17.5936       0.853       0.394   
OPS            -4.2768      17.5879      -0.243       0.808

R2 = 0.9089

In short, we are going to need to carefully pick our independent variables so they do not experience collinearity. I am going to use the following statistics to try and predict runs: OBP, SLG, and stolen base %. Although there are many different statistics to use, I am using these three because they represent the three basic ways to improve your team's hitting: get on base more, hit for more power, or become a more successful team running the bases. When we run the regression we get the following:

                   Estimate    Std. Error     t value    Pr(>|t|)  
Intercept     -6.0689     0.2246          -27.022   < 2e-16 ***
OBP           17.9231     0.9630         18.612     < 2e-16 ***
SLG           10.7619     0.4947         21.756     < 2e-16 ***
SBperc       0.3956       0.1387         2.852      0.00463 **
R2 = 0.9111


So these three factors explain over 91% of the variability in Runs per Game. Although the R2 value is only slightly higher than the first regression that involved OPS, we can see that all of the statistics are now significant, as opposed to none of the statistics being significant. All three factors have significant effects on runs per game, and OBP has the largest effect. A ten percentage point increase in OBP (e.g. from .350 to .360) is associated with a 0.0179 increase in runs per game. A ten percentage point increase in SLG (e.g. from .450 to .460) is associated with a .0108 increase in runs per game. Finally, a one percentage point increase in SB% (e.g. from 70% to 71%) is associated with a 0.00396 increase in runs per game.

So what does this mean? The best way to increase your team's hitting is to try and score more runs per game. The best way to score more runs per game is to increase OBP. So the best way to increase a team's hitting is to acquire players that will get on base more often, whether it be through a hit, a walk, or a hit-by-pitch. Acquiring players that hit for power will also positively impact a team's hitting, but not as much as players that get on base. So if a team had a limited budget and could only acquire one or two significant players, they should try and acquire those players that can most improve their team's OBP.

Tuesday, November 23, 2010

Improving a Team - Pitching or Hitting?

After the close of the baseball season in early November, teams look to build for next year through trades, free agency, and the draft (which is more for 3-5 years into the future). But how do you build a better team, and more specifically, what will allow you to have a better team? The old question of pitching vs. hitting is always addressed differently by different teams. Last year, the Giants used spectacular pitching with timely hitting to win the World Series, but just the year before the Yankees used a powerful lineup to bulldoze their way to a World Series win. So which is preferable - scoring more runs, or preventing more runs?

In order to answer that question, I looked at every team's statistics in the last 11 years (2000-2010), and gave each team a value for "Playoffs". A 1 meant that the team made the playoffs, a 0 meant the team did not make the playoffs. To estimate hitting, I used the statistic "runs per game", and to estimate pitching I used the statistic "runs against per game" (this really represents overall defense, including both fielding and pitching - to truly isolate pitching a more appropriate statistic would be something like ERA). I then ran a simple linear regression model, with RpG and RApG estimating the binary "Playoffs" statistic. The table below shows the results:

Coefficients:
                  Estimate       Std. Error      t value      Pr(>|t|)   
Intercept    0.28029       0.24027         1.167        0.244   
RpG           0.41595       0.03884         10.709      <2e-16 ***
RApG        -0.41884      0.03694        -11.339      <2e-16 ***
***: significant at the 0.001 level

So the regression line is: Playoffs = 0.28029 + 0.41595*RpG - 0.41884*RApG. The intercept means that disregarding runs for and against, a team will have a 28% chance of making the playoffs. We know that this is close to being correct, as there are 30 teams competing for 8 playoffs spots, so the probability of any team making the playoffs, given that all teams are equal, is 0.2667. The coefficient of runs per game shows that a one-run per game increase in RpG is associated with a 41.6% higher probability of a team making the playoffs. Runs against per game is very similar except it is inverse, as a one-run per game increase in RApG is associated with a 41.9% lower probability of a team making the playoffs. As an example, if a team scores 4.50 runs per game and allows 4.50 runs per game, the probability of the team making the playoffs is 0.2673. If they increase their runs per game to 5.50 (an increase of exactly 1), the probability of the team making the playoffs will increase by 0.416 to 0.6832. If they then increase their runs against per game to 5.50 (again an increase of 1), their probability of making the playoffs will decrease to 0.2644 (a decrease of 0.419).

What this all means is that scoring runs and preventing runs have a very similar impact on a team's success (success defined by a team making the playoffs). Runs against is very slightly more important, but the difference is most likely negligible. Obviously this was a quick study, and only based on a small sample, but we can see that teams should be more concerned with overall talent of acquisitions rather than worry about acquiring only players that will help their hitting or pitching.

Saturday, November 6, 2010

World Series MVP

(Quick note, I typed out this post in entirety before Blogger deleted the whole thing even though it is supposed to save, so this version is unfortunately a little shorter!)

Last Monday night, the San Fransisco Giants won their first World Series since 1954, when Willie Mays and the New York Giants won. Ironically, there has never been a World Series won by the Giants with a MVP, as the MVP award did not begin until 1955, the year after the Giants last one. Unlike previous Giants' teams, with superstars such as Mays, Willie McCovey, and Barry Bonds, this team was led by great pitching and a team that managed to get just enough timely hits to squeak into the postseason on the last day. In this post, I want to determine who should have been the MVP of this team, and whether or not the voters picked the right player (Edgar Renteria ended up winning the award).

We need to ask two questions before we decide who the MVP should have been. First, what exactly is the MVP? Obviously, it is the "most valuable player", but what exactly does that mean? Does it mean the best overall player? Obviously not, as only one player in World Series history has won the WS MVP while playing for the losing team (Bobby Richardson in 1960, when the Yankees outscored the Pirates 55-27 but still managed to lose in seven games). I believe that the MVP is the player that best gives his team the chance to win each game, and as a result the series. The question now becomes: how do you measure exactly how "valuable" a player is to his team? I am going to use the Win Probability Added statistic, which is the sum of the changes in the probability of a player's team winning. More simply, WPA "looks at" each play and determines the teams probability of winning before, and then after the play occurs, and the difference is debited and credited to the players involved. For example, if a team was winning in the ninth inning, there probability of winning would be fairly high, say 80%, but if a player on the opposing team then hit a home run to tie the game, and the probability of the first team winning fell to say 55%, then the player who hit the home run would have a +0.25 WPA for that play, and the pitcher who gave up the home run would have a WPA of -0.25 for the play. Each team starts at a 50% chance of winning, and one team ends with a 100% chance of winning, so WPA measures exactly how much each player individually contributed to winning the game. It is heavily dependent on the leverage of the situation, as obviously a go-ahead home run in the ninth inning gives the team a better chance to win than a go-ahead home run in the first inning.

To determine which player should be the MVP, I summed the WPA for each player during each game in the World Series to determine the overall World Series WPA. I have created two tables below, for the position players and the pitchers, which rank the players in terms of overall WPA for the World Series, and also which games they appeared in.

Players:

Player
World Series WPA
Games Played In
Edgar Renteria
0.403
1, 2, 3, 4, 5
Aubrey Huff
0.147
1, 2, 3, 4, 5
Andres Torres
0.139
1, 2, 3, 4, 5
Cody Ross
0.108
1, 2, 3, 4, 5
Freddy Sanchez
0.094
1, 2, 3, 4, 5
Mike Fontenot
0.000
2
Juan Uribe
-0.046
1, 2, 3, 4, 5
Travis Ishikawa
-0.057
1, 4
Aaron Rowand
-0.058
2, 5
Pablo Sandoval
-0.081
3
Nate Schierholtz
-0.091
1, 2, 4
Buster Posey
-0.154
1, 2, 3, 4, 5
Pat Burrell
-0.423
1, 2, 3, 5

Pitchers:

Pitcher
World Series WPA
Games Pitched In
Matt Cain
0.495
2
Tim Lincecum
0.477
1, 5
Madison Bumgarner
0.477
1
Brian Wilson
0.131
1, 4, 5
Santiago Casilla
0.059
1
Javier Lopez
0.047
1, 2
Sergio Romo
0.021
1
Jeremy Affeldt
0.021
1, 3
Guillermo Mota
0.020
2, 3
Ramon Ramirez
0.009
1, 3
Jonathan Sanchez
-0.162
3

As you can see above, three of the four most valuable players on the Giants in the World Series happened to be pitchers. Edgar Renteria was the only significant hitter with a .403 WPA, Matt Cain had a .495 WPA, and Tim Lincecum and Madison Bumgarner both had a .477 WPA. The question now becomes, do you give the MVP vote to a pitcher such as Matt Cain, who pitched brilliantly and had the highest overall WPA, but only appeared in one game, or do you give it to Edgar Renteria, who had a lower WPA but played in every game? The voters decided to give it to Renteria, and I tend to agree with them (another issue would have been which pitcher to give it to? All three pitchers were very close in WPA, and I tend to think they would have given it to Lincecum as he pitched in two games and had the performance most fresh in the voters' minds). In this case, the voters' "gut instincts" actually agreed with the statistics.

One final note of interest is the Giants overall 1.5 WPA (they won 4 games, each with a WPA of +0.5, and lost one game with a WPA of -0.5). Of that 1.5, approximately 1.6 WPA came from the pitchers, while the hitters actually had a negative impact on the probability of the Giants winning the World Series with a WPA of -0.1. So even though the Giants became the first team to score at least 20 runs in the first two World Series games, the pitching, just like all year, was the reason that they won the World Series. Ironic then that a hitter still managed to win the MVP.

Thursday, October 28, 2010

Alex Anthopoulos and the Playoffs (Part 2)

Yesterday I posted part one, describing how we can measure the effectiveness of the Blue Jays' front office. I also analyzed the Roy Halladay trade, which showed that the Jays lost approximately 7.9 games because of the trade. In the post today, I want to look at the remaining meaningful trades made by the front office this year, in mostly chronological order.

To recap quickly, Alex Anthopoulos was promoted on October 3. His first move, trading Halladay, was completed on December 16. Just a week later, on December 23, he made his second significant trade. Looking for a shutdown late-inning relief pitcher, the Mariners believed that Brandon League could help the team as they challenged for a playoff berth this year (that turned out well!). In return, the Jays received Brandon Morrow, a highly anticipated starter with outstanding "stuff". The Mariners drafted him 5th overall in 2006, but never seemed to be able to decide whether to turn him into a starter or try to have him become a closer. Morrow never panned out in his years with Seattle, so he was traded to the Jays. He finally had his breakout season this year, showing occasional flashes of brilliance (see here) while finishing 10-7 with an ERA of 4.49. He ended the 2010 season with a WAR of 1.6, while League struggled, posting a -0.1 WAR. Assuming an "average" replacement for League in the bullpen, the Jays gained 1.5 wins from the trade. The trade was one-sided, even when it was made, that many people wondered if it was part of the Halladay trade (where Cliff Lee ended up going to Seattle), but that was not true. The Jays simply made a great trade, which paid off this year and should pay off for many years to come.


After the two big December trades, Anthopoulos mostly tinkered with the lineup until the season started, with only a couple of minor moves. On January 20, the Jays traded for relief pitcher Merkin Valdez, sending cash considerations to the Giants. Valdez ended the year with a -0.1 WAR. On February 6, AA traded for Dana Eveland, a pitcher with the A's. Eveland ended the year with a -0.8 WAR, but assuming an average replacement player, we have already counted his stats in the Roy Halladay trade (see part one), so we cannot count them again. Eveland pitched poorly enough to be sent to the Pirates on June 1 for Ronald Uviedo, who did not pitch in the MLB this year. There were also some insignificant moves scattered throughout the year where AA acquired minor league players such as Zach Johnson and Casey Fien.

The next significant trade was made on April 15. The Jays acquired outfielder Fred Lewis from the Giants for a player to be named later or cash. Lewis ended the year with a 0.8 WAR, which looks like it helped the team, but we must also take into consideration some other things. The plate appearances that Lewis had would have been given to someone else if he had not been acquired. The likely recipients of the PAs would have been Travis Snider and Dewayne Wise, with most going to Snider. In 319 PAs, Snider had a 0.9 WAR, so if assuming that 319 of the 440 PAs that Lewis had would have gone to Snider and Snider would have a constant performance, those 319 PAs could have been worth 0.9 wins. If we give the other 121 PAs to Wise, who had a 0.1 WAR in 118 PAs, he probably would have produced about 0.1 more wins. So overall, those 440 PAs would have produced about 1.0 wins if Lewis had not been acquired. So the net loss on the trade was actually -0.2 wins.

The final significant trade was made on July 14, when the Jays traded Alex Gonzalez to the Braves for Yunel Escobar and Jo-Jo Reyes (who did not pitch in the majors in 2010 for the Jays). Gonzalez had gotten off to a hot start with 17 home runs in the first 85 games of the year, only hit 6 more for the Braves. This trade was a classic case of selling high, with the 33-year old Gonzalez having a career year, while Escobar was in the doghouse in Atlanta but still had a lot of potential. Before the trade, Gonzalez had a 2.8 WAR, while Escobar only had a 0.9 WAR. However, after the trade, Gonzalez only had a 0.9 WAR while Escobar had a 1.0 WAR. So the net gain of the trade was 0.1 win, and considering the trade was made with the future in mind, turned out very well for this year (and should help out the next couple of years with Gonzalez slowing down and Escobar just hitting his peak).

Now that we have looked at each individual trade, we can see the overall result of the trading. Here is a quick summary:
Halladay trade: -7.9 wins
Morrow trade: +1.5 wins
Valdez trade: -0.1 wins
Lewis trade: -0.2 wins
Escobar trade: +0.1 wins

So the trades had an overall net of -6.6 wins (6.6 more losses). This means that had none of the trades been made, the Jays would have won between 6 and 7 more games this year. Looking at the AL East standings this year:

TB: 96-66 (won division)
NYY: 95-67 (won wild card)
BOS: 89-73
TOR: 85-77

So the Jays would have won between 91 and 92 games, putting them ahead of Boston but still out of the playoff picture. If we want to look at the best case scenario for this year only, then Halladay would not have been traded. If the Jays were going for the playoffs this year, they probably would not have traded Gonzalez, as he was the "rental" player in the deal, or the one that could help his team win now, not in the future. So if we assume only the Morrow trade was made (which may not be true, because the Mariners only traded Morrow after trading for Cliff Lee), the Jays would have gained 9.6 wins. That would have put them between 94 and 95 wins, ahead of Boston, and possibly one game behind or tied with the Yankees. Given that the Yankees somewhat tanked during the stretch in order to draw the Twins in the first round of the playoffs, they probably could have won more games if they were in a "real" race for the playoffs, and not just the division. So, we can say with confidence that the Jays still probably would not have made the playoffs (but you never know, if most of those extra wins had come against the Yankees, maybe they would have!).

Even in this fairly simple analysis, we can see that if the Jays really tried hard to make the playoffs this year, they still would have fallen just a little short, and would be sitting in a far worse position for the upcoming years. Seeing that almost every move this year besides the Halladay trade helped, or at least didn't hurt the Jays, the front office did a good job both strengthening the team now as well as building for the future. The jury is still out on the Halladay trade, and will be for awhile until all of Drabek, D'Arnaud, and Gose have played significant time in (hopefully) the majors. But the overall conclusion of this analysis is that Alex Anthopoulos and the rest of the front office did a very nice job with the trades this year, and hopefully the Jays will reap the benefits in the near future. 

Wednesday, October 27, 2010

Alex Anthopoulos and the Playoffs (Part 1)

This post started out as an analysis of how the Blue Jays could have possibly made the playoffs this year if they had not traded Roy Halladay, but it evolved into something much bigger. I want to look at all of the trades in the last 12 months and determine how well the Jays did in each trade for this year only, and then see if they could have made the playoffs by not trading anyone, or by only making certain trades.

The starting date of this analysis takes place on October 3, 2009, when the Jays, among other moves, fired J.P. Riccardi and promoted Alex Anthopoulos to general manager. This signaled a shift for the strategy of the front office, moving from signing older, high-priced veterans like Frank Thomas and A.J. Burnett to making the scouting staff the largest in the majors and developing the farm system. Although it has only been a year, the results are already palpable, and the future is looking bright for the Jays.

The first major move of Anthopoulus' reign as GM was to trade away Roy Halladay, who was set to become a free agent after this year and would not return to Toronto as he was in search of a playoff-caliber team. It was a difficult task, with one of the franchise's most popular players ever being traded, along with a very small market as many teams could not afford Halladay. Eventually, Philadelphia became just about the only option, which meant that they would have leverage over the Jays. Finally, on December 16, they Jays traded Doc to the Phillies for prospects Kyle Drabek, Travis D'Arnoud, and Michael Taylor. They then traded Taylor to the A's for Brett Wallace, another highly sought after prospect involved in the Matt Holliday trade. Later on this year, on July 29, AA traded Wallace to Houston for yet another prospect, Anthony Gose, who the Jays had been trying to acquire all along, but the Phillies had relented before shipping him to the Astros in the Roy Oswalt trade.

To determine the outcome of this trade (as well as every other trade), I am going to only look at this year's production. So even though the Halladay trade was to acquire prospects that will be major league ready in 2-3 years, I want to see how the trades played out this year. I am going to use the Wins Above Replacement statistic to measure each player's value to his team. Although it sounds simple, it can get complex (as it will in the Halladay trade), because we have to account for both the traded player's WAR as well as the WAR of the players who replaced the traded player.

This year, Roy Halladay had an overall WAR of 6.5, but -0.4 of that was due to offense, so if he had been playing for the Jays (with the DH), he would have had a WAR of 7.3. I am assuming all the of values would remain constant no matter what team the player is on. That was the easy part. Now we need to figure out which pitchers had starts this year that Halladay would have had if he pitched in Toronto. Doc made 33 starts this year, and conveniently, the top 5 starters for the Jays this year (Romero, Marcum, Cecil, Morrow, and Rzepczynski) had 139 starts, with 6 other pitchers recording a combined 33 starts. So presuming that Halladay would have started all of the other combined 33 starts, we can figure out exactly how much the Jays lost when they traded Halladay away.

The six other Blue Jays pitchers who started at least one game this year were (with starts in parentheses): Brian Tallet (5 starts), Jesse Litsch (9), Dana Eveland (9), Brad Mills (3), Shawn Hill (4), and Kyle Drabek (3). If we add up the WAR from each start, we can determine the total wins lost due to the trade. Tallet had an overall WAR of -1.4, but in his 5 starts his WAR was -0.35. Litsch only started, and had an overall WAR of -0.1. Eveland had an overall WAR of -0.8. In Mills' three starts, he had a WAR of 0.13. Hill had a WAR of 0.4, and finally Drabek had a WAR of 0.1. The total WAR for the 33 starts was -0.6, which means that if Roy Halladay were to make the 33 starts instead of these pitchers, the Jays would have won about 7.9 more games.

The Halladay trade was just the first of many trades this year by AA, so tomorrow I am going to post part 2, which will evaluate the rest of the trades and determine whether or not the Jays could have made the playoffs this year with certain trades. (See here for part 2)

Sunday, October 24, 2010

Fact of the Week XI: Most and Least Valuable Blue Jays Postseason Games

As the playoffs move past the championship series' and on to the World Series, I wanted to take a quick look at valuable, and not valuable, postseason games by a Blue Jays player and pitcher. Considering the Jays have only played 41 playoffs games ever, and haven't made the playoffs since they won the World Series back-to-back in 1992 and 1993, all of these games will be from the mid-80s and early 90s.

The most valuable game ever by a Blue Jays hitter was by Devon White, in game 4 of the 1993 World Series. His Win Probability Added (WPA) for that game was 0.719, which is extraordinarily high. It is actually the 11th most valuable game ever in the postseason by a player. In the 15-14 Blue Jays win, White went 3-for-5 with 2 runs and 4 RBIs. The big hit came in the top of the 8th inning, with 2 runners on and 2 out, with the Jays losing 14-13. White hit a two run triple to put the Jays on top 15-14, and the play increased the Jays' win expectancy from 24% to 74%.

The least valuable game ever by a Blue Jays hitter was John Olerud, in game 2 of the 1992 World Series. His WPA for the game was -0.273, as he went 0-for-4 with a strikeout. The biggest blow came in the top of the eighth with 1 out and runners on first and third and the Jays trailing 4-3. Olerud popped out to the third baseball, which resulted in a decrease in win expectancy of 16%. Coincidentally, the Jays still ended up winning the game 5-4, in large part because of Ed Sprague, who hit a pinch-hit, two-run home run in the top of the 9th to give the Jays the 5-4 lead. His one at-bat in the game increased the Jays' win expectancy by 67%, which was actually the second-most valuable postseason game by a Jays' hitter behind Devon White.

For pitchers, the most valuable postseason game ever pitched was by David Cone in game 2 of the ALCS in 1992. Cone pitched 8 innings, giving up 5 hits and only one run, and the Jays won 3-1, resulting in a game score of 71 and a win probability added of 0.413.

The least valuable postseason pitching performance was by Todd Stottlemyre in game 4 of the 1993 World Series. This was the 15-14 Jays win in which Devon White had the most valuable position player game. Stottlemyre started the game for the Jays and went 2 innings, giving up 6 runs, and had a WPA of -0.486. Interestingly enough, the fourth worst pitching performance was by Al Leiter, who came in to relieve Stottlemyre, pitching 2.2 innings and giving up another 6 runs. Amazingly, the Jays still won, and in both cases of the least value performances they had comeback victories which overshadowed the bad performances by players.

Hopefully in the next couple of years the Blue Jays will be able to add some players to any of the lists above, which will mean that they made the playoffs again after a long drought.

Sunday, October 17, 2010

Fact of the Week X: Multi Home Run Postseason Games

In game one of the NLCS last night, Cody Ross hit two home runs off of Roy Halladay. It is extremely hard to hit two home runs in the majors, let alone two against Roy Halladay in the postseason. Ross became only the 5th player to hit two home runs in the first game of a League Championship Series. As a comparison, there have been 10 players with at least 2 home runs in a LCS game two, 8 players in a game 3, 7 in a game 4, 2 in a game 5 or 7, and only 1 in a game 6. This is presumably based upon the starting pitcher on the opposing team, and the ace of the staff will usually pitch games 1, 4 or 5, and possibly 7. The lack of games in the later series games (5-7), is probably due to the lack of overall games (if a series ends in a sweep there will not be any game 5, 6, or 7), the quality of starting pitching, and the strategy of managers playing for one-run innings, so more sacrifice bunts and less at-bats.

Ross also became only the 4th player to hit 2 home runs in a postseason game while batting 8th in the lineup. More impressively, the other three players were all hitting with a position player behind them (a game involving the DH), so Ross became the first player to ever hit 2 home runs with the pitching batting behind him.

Finally, he became only the 19th player to hit at least 2 home runs in his first five postseason games in either the World Series or LCS (18 other players have hit at least two in their first five games in the LDS). All of these statistics are fairly impressive, and become more impressive when you consider that he did it against Roy Halladay, who threw the second no-hitter in postseason history in his last start.

Thursday, October 14, 2010

Predicting Playoff Series

Now that the first round of the playoffs is over, we are left with only 4 teams: the Yankees, the Rangers, the Phillies, and the Giants. In the American League Championship Series, the Rangers are hosting the Yankees, while in the NLCS the Phillies are hosting the Giants. The Phillies and Yankees are the prohibitive favorites to win their respective series' and advance to the World Series for a rematch of last year. But what are the chances of each of the four teams winning their series?

To calculate the odds of each team advancing, I used three different statistics. The first was the team's regular season record in 162 games, which I converted to a probability between 0 and 1 (each team is actually between 55% and 60%. The second is their Pythagorean Win-Loss record, which uses runs scored and run against to predict what each team's record should have been. I again converted the Pythagorean Win-Loss record to a probability. Finally, I used the team's overall record, combining the regular season record and playoff record through the Division Series', and converted it to a probability. These three estimates will all be used to predict the probability of each team advancing, as three estimates should produce a more accurate result than one. The summary for each team is found below:
 
Texas Rangers
Regular Season record: 90 wins - 72 losses (win probability = 0.55555)
Pythagorean W-L = 91-71 (0.56173)
Overall record: 93-74 (0.55689) - Won their first round series 3-2.

New York Yankees
Regular Season record: 95-67 (0.58642)
Pythagorean W-L = 97-65 (0.59877)
Overall Record: 98-67 (0.59394) - Won their first round series 3-0.

Philadelphia Phillies
Regular Season record: 97-65 (0.59877)
Pythagorean W-L = 95-67 (0.58642)
Overall Record: 100-65 (0.60606) - Won their first round series 3-0.

San Francisco Giants
Regular Season record: 92-70 (0.56790)
Pythagorean W-L = 94-68 (0.58025)
Overall Record: 95-71 (0.57229) - Won their first round series 3-1.

The next part gets a little tricky, as I used a statistical software package to compare the win probability means. I generated 1000 random data points for each win probability, and each data set had a normal distribution of mean = win probability and standard deviation = 1. These data sets are very close to the standard normal distribution of mean = 0 and standard deviation = 1 (if you want to know more about the standard normal distribution, you can see Wikipedia). What this means is that 68% of the data points will be between win probability - 1 and win probability + 1, and since we are creating two different data sets, there should be some difference between the sets if they have different means. The standard deviations represent the differences in performances, as a team will play a different game every night, and not always perform the same. I used 1000 data points so that I could sufficiently determine whether or not there was a difference between the teams, as the more data points, the more precise the estimate. I then subtracted the two data points for the teams that were playing against each other, and figured out a difference. The difference was "home team" - "away team", so if the difference was >= 0, it meant the home team was "better" than the away team, or in my analysis, that the home team won the series. If the difference of "home team" - "away team" < 0, then it meant that the away team would win the series. Finally, I totaled the differences to determine how many times out of 1000 each team would win the series. I did this for each set of win probabilities (regular season, Pythagorean, and total) to find three estimates to the probability of each team winning a series, found below.

ALCS
Using regular season record: Rangers win series 474 times = 47.4%, Yankees win series 52.6%
Using Pythagorean regular season record: Rangers win series 495 times = 49.5%, Yankees win 50.5%
Using regular season and playoff record: Rangers win series 483 times = 48.3%, Yankees win 51.7%

NLCS
Using regular season record: Phillies win series 527 times = 52.7%, Giants win series 47.3%
Using Pythagorean regular season record: Phillies win series 506 times = 50.6%, Giants win series 49.4%
Using regular season and playoff record: Phillies win series 529 times = 52.9%, Giants win series 47.1%

So in the ALCS, the Yankees win the series between 50.5% and 52.6% (and on average 51.6%) of the time according to this estimate. In the NLCS, the Phillies win the series between 50.6% and 52.9% (and on average 52.07%) of the time. These estimates are slightly lower than some other estimates (like here), but this is probably because as the number of data points increases, the probability of each time tends towards 50%. It is a delicate balance between having enough data points to provide an accurate estimate, but not having too many so that the probability is very close to 50% just because of the large n.

So my picks are that both series should be pretty close, but just like popular opinion, the Phillies and Yankees should prevail and meet in the World Series for a second straight year.

Friday, October 8, 2010

Fact of the Week IX: Postseason Debuts

Earlier this week, the 2010 baseball playoffs began. On Wednesday night, the Phillies and Reds matched up, with Roy Halladay finally starting the first playoff game of his career after being traded by the Jays this past offseason. He didn't disappoint, throwing only the second no-hitter in playoff history (after Don Larsen's perfect game in the 1956 World Series), and only allowing one baserunner on a 5th inning walk to Jay Bruce.

On Thursday night, the Giants and Braves faced off in the first game of their Division Series. Tim Lincecum threw a complete game shutout, giving up only two hits and a walk while striking out 14 batters, two away from the postseason record. Both of the games pitched were amazing, and they were actually only two of 37 complete game shutouts in the postseason since 1977 (the first year of the Blue Jays). They were the first since 2007, when Josh Beckett threw a complete game shutout with the Red Sox, and the second and third since 2004.

These two performances have made the first round of the playoffs already utterly breathtaking, and hopefully the next couple of weeks can live up to the potential that Halladay and Lincecum have shown in their first postseason starts.

In fact, the games were actually rated by Game Score as the fourth and fifth best pitched games in the postseason of all time. Halladay's game score of 94 tied Don Larsen, and Lincecum's game score of 96 was only behind three pitching performances: Babe Ruth in 1916 (a score of 97, boy could he pitch!), Dave McNally in 1969 (97), and Roger Clemens in 2000, with a game score of 98.

What made these performances even better was the fact that both pitchers had never pitched in the postseason before. If we look at the greatest pitching performances in a postseason debut, they rank second and third all-time behind Babe Ruth's 1916 performance (which was utterly amazing, a 14 inning complete game in which he only allowed 1 run in a 2-1 win). So these two games were the best postseason debuts for pitchers in almost 100 years. The fact that they happened on back-to-back nights was incredible.

Sunday, October 3, 2010

National League Playoff Race

The race for the final two playoffs spots in the National League has come down to the final day. Currently, the Giants lead the Padres by one game in the NL West, while the Padres and Braves are tied for the wild card. Tomorrow, the Padres and Giants face off for the NL West title, and possibly a wild card berth, while the Phillies and Braves square off with the Braves trying to get the last wild card spot. It will be an exciting day, with a possible three-team playoff (the first one in history) on Monday and Tuesday. In this post I want to determine the probability of each of the three teams (Giants, Padres, and Braves) making the playoffs and facing the Phillies and Reds.

There are four possible scenarios for tomorrow's games:

Scenario 1: Giants and Braves win, Padres lose
Scenario 2: Padres win, Giants and Braves lose
Scenario 3: Giants win, Padres and Braves lose
Scenario 4: Padres and Braves win, Giants lose

For the purpose of this exercise, I am going to assuming the probability of any team winning any game is .5 (a reasonable assumption, but probably not entirely accurate). So the probability of each scenario above happening is 0.25. In the first scenario, the Giants and Braves will make the playoffs, and San Diego will go home. This means that the probability of the Giants and Braves making the playoffs is 1.0, and the probability of the Padres making the playoffs is 0.0. The second scenario is similar, with the Padres and Giants both making the playoffs (prob. = 1.0) and the Braves going home (prob. of playoffs = 0.0).

The third scenario is where things start to get a little trickier. In this scenario, San Francisco wins, so they have a 1.0 probability of making the playoffs, while the Padres and Braves will face off in a one-game playoff on Monday. So the Padres and Braves will both have a probability of 0.5 of making the playoffs (in each case, 2 out of the 3 teams make the playoffs, so each scenario's probability should add up to 2.0).

The final scenario is the most difficult, as there is now a three-team playoff. The first playoff would be Monday, when the Giants and Padres play (winner goes to the playoffs), and then Tuesday the loser would play the Braves for the final spot. The easiest way to think about this is that in scenario 4, there are "four" possible scenarios given the two outcomes of each of the two playoff games. The four scenarios would work like this: a) Giants win, then Padres win, b) Giants win, then Braves win, c) Padres win, then Giants win, and finally d) Padres win, then Braves win. So in this final scenario, the probability of San Francisco and San Diego making the playoffs is 0.75 (they make the playoffs in 3 of the four cases), and the probability of Atlanta making the playoffs is 0.5 (they make the playoffs in 2 of the four cases).

The final step is to figure out the probability of each team making the playoffs. To do this, we multiply the probability of the team making the playoffs in each scenario with the probability of the scenario.

Probability(Giants making playoffs) = 1*.25 + 1*.25 + 1*.25 + .75*.25 = .9375 = 15/16
Probability(Padres making playoffs) = 1*.25 + .5*.25 + .75*.25 = .5625 = 9/16
Probability(Braves making playoffs) = 1*.25 + .5*.25 + .5*.25 = .50 = 8/16

So overall, there are 16 possible scenarios for the playoffs that will determine the two playoff teams in the next three days. In 15 of them, the Giants make the playoffs (the only way they will not make the playoffs is if they lose tomorrow and the Braves win tomorrow, then they lose consecutive one-game playoffs to the Padres and Braves, and the probability of this happening is 0.54, or 1/16. In 9 scenarios, the Padres will make the playoffs, and in 8 scenarios the Braves will make the playoffs. We can see that the total probability of the teams making the playoffs is 32/16 = 2, so we can see that the probabilities are correct in that two of the teams will make the playoffs.

What does this mean? Basically, the race is a two-team race for the final spot, with one team all but clinched. The Giants are all but assured of a playoff birth, while the Padres and Braves are left fighting for the last spot (almost always the wild card, but potentially the Padres could win the NL West and the Braves could win the wild card, although the probability is only 1/16). The Padres do have a slight edge because they can potentially win the division, while the Phillies have already locked up the NL East, leaving the Braves with only the wild card possibility.

So keep yours eyes on the games tomorrow, the Braves and Phillies play at 1:30 and the Padres and Giants play at 4, so even by 4 pm tomorrow we will have a much clearer picture of the possible playoff spots. It will be an interesting day, and possibly an interesting 2 or 3 days, all to probably get beaten by the Phillies and Reds in the first round.

Friday, October 1, 2010

Fact of the Week VIII: Home Run Records

The Blue Jays have accomplished a couple of notable feats in the past few days with home runs. The first happened on Wednesday night when John Buck homered, his 20th of the season, to become the 6th different Blue Jays to hit 20 home runs this year. They became only the 18th team in MLB history to have six hitters to hit at least 20 home runs each. If Edwin Encarnacion manages to hit two home runs in the last three games of the season (he currently has 18 HRs), the team will become only the 5th team to ever have seven players with at least 20 home runs, joining the 1996 Orioles, 2005 Rangers, 2009 Yankees, and the 2000 Blue Jays. No National League team has ever done it, possibly because they would need seven out of eight of their starting position players to do it as opposed to seven out of nine in the American League. One final note is that Alex Gonzalez, traded to the Braves midseason for Yunel Escobar, had 17 home runs as a Jay (only 6 so far with the Braves) and Escobar (who had 0 home runs as a Brave) has 4 home runs as a Jay. So the shortstop position for the Jays also has hit 20+ home runs, which means that the Jays possibly could have become the first team to ever have 8 players hit at least 20 home runs each (if they hadn't traded for Escobar, which I'm glad they did, and if EE hits 2 homers in the next three games).

On a related note, after Thursday night's blowout of the Twins, in which the Jays hit six home runs, they now have 253 home runs as a team so far this year. That is the 4th highest total in MLB history, trailing only the 1997 Mariners (264 HRs), the 2005 Rangers (260 HRs), and the 1996 Orioles (257 HRs). If they continue their pace of 1.591 HRs per game over the final three games, they should end up with 257.77 home runs on the season, which would be good for third all-time. It is interesting to see many of the same teams on both of these lists (the '97 Mariners also had 6 players with at least 20 home runs). One great home run hitter will not vault your team into the home run record books, the team needs at least 5-6 hitters who all can hit at least 20+ home runs, with usually one or two of those players hitting at least 30 or more (Vernon Wells with 31 and Jose Bautista with 54 so far on this year's team).

One final note of interest: also in last night's game, Jose Bautista hit his 53rd and 54th home runs of the season. The first was an upper-deck grand slam, the second an opposite field home run, the first home run to right field in his career! (Unfortunately HitTracker hasn't quite updated his home run total, so we cannot view his home run scatter plot with the one outlier to right field.) Although pitchers next year may try to beat him with outside pitches, he has again proved that he can hit pretty much anything you throw to home. Sure, he would much rather hit an inside fastball (as the first home run showed!), but if you must continue to pound the outside corner, he will either take a good swing and put the ball into right field, or simply walk. Speaking of walks, Bautista needs one more walk this weekend to become only the 14th player to ever have 50 home runs and 100 walks in one season.

So although this weekend will not bring playoff drama for the Blue Jays, there are a couple of milestones that they could reach with good performances. Hopefully they can win a couple of games, hit a few more home runs, and end 2010 with a bang.

Sunday, September 26, 2010

Jose Bautista

Jose Bautista has had himself an incredible year this year, as on Friday night he hit his 51st and 52nd home runs of the season, which is already 5 better than George Bell's previous franchise record of 47 in a season. He has become the first player since 2007 to hit at least 50 home runs and is also currently second in the AL in walks. He has seemingly come out of nowhere to start hitting home runs left and right, and as a result many people have questioned the legitimacy of his season. The question that everyone wants to know is, how can a 29-year old player, with a career high of 16 home runs in a season, suddenly hit 50+? Only one player in history, Cecil Fielder, had ever hit 50 home runs in a season without having previously hit at least 20 in a season. I want to try and explain in this post how it is possible for Bautista to have such a breakout season, without involving the dreaded s-word.

Bautista himself claims that the increase in his home run total is due to regular playing time instead of being a utility player (increased confidence), better pitches to hit, and a change in his swing. I am going to mainly focus on the latter: how he could hit so many more home runs by simply making changes to his swing to maximize his strengths and minimize his weaknesses. There are basically two main factors into hitting home runs: hitting a lot of fly balls, and getting lucky (or by getting stronger and getting lucky) by hitting a higher percentage of those fly balls out of the ballpark. The first factor, Fly Ball %, is a hitter by hitter case, as some hitters are ground ball hitters, while other are fly ball hitters. The second factor, HR/FB, is mainly due to luck. Similar to BABIP, a hitter has some control over his HR/FB rate, but it fluctuates around the league average of 10.6% (if you want to read more about HR/FB, you can visit the Sabermetrics Library here).

If you have watched Bautista hit any home runs at all this year, you can see how he has changed his swing. He simply waits for a pitch in one location (almost always inside), if the pitch is not there he will not swing at it, but if it is there he will take an extremely hard swing. As a result, he is sacrificing contact for power (surprisingly enough, he is hitting for a career high in batting average this year). This means that he is very patient at the plate, drawing 98 walks so far this year, and will simply mash any mistake pitch. He has also added a pronounced uppercut to his swing, adding loft to the balls that he hits, and thus increasing his FB%. He has also increased his HR/FB rate this year, either by luck, but also because he is swinging for the fences every time he steps in the batter's box.

This picture shows all (well the first 49) of Bautista's 2010 home run landing spots, courtesy of Hit Tracker. He has not hit a single home run to the right of dead center field this year, nor any other year. This is again a critical part of Bautista's success: he looks to hit inside pitches for power to left field, while he either doesn't swing at or looks to hit for contact the pitches on the outer half of the plate.

What I want to do now is figure out how many of Bautista's home runs this year are from his conscious adjustment at the plate, and how many are mainly due to luck. To do this, I am going to compare his FB% and HR/FB rates from last year and this year, holding one constant while determine how many home runs the other rate contributed (it will make more sense when I introduce the numbers).

The first thing to figure out is Bautista's predicted 2010 numbers based on his past career numbers. Given his amount of plate appearances this year (currently 649, we are going to assume he will get to 675 by the end of the year), we can predict his home runs, FB%, and HR/FB. His career FB% (before 2010) is 42.8% - this means that 42.8% of the balls he puts in play are fly balls (as opposed to ground balls or line drives). It is important to not misinterpret this number as the % of PA that are fly balls - that would grossly inflate his predicted home runs. His career HR/FB rate (again, before 2010) is 10.4%, which is very similar to the league average of 10.6%. So if he were to have 675 PA this year with his average rates, he would hit 19.54 home runs this year.

The next thing to calculate is his predicted 2010 numbers using this year's splits (we could almost use his numbers right now, but there are 8 games left). With 675 PA, and a FB% of 54.8% and a FB/HR of 21.8%, he is predicted to hit 54.08 home runs.  In 2010 he has increased his FB% by 12 percentage points (from 42.8% to 54.8%) and his HR/FB rate by 11.2% (from 10.6% to 21.8%). These two increases result in a 34.54 increase in home runs this year as opposed to his career average. The question I want to answer is what percentage of that increase is due to skill, and what percentage is due to "luck"?

To determine how much is due to Bautista's swing adjustment, we want to hold his HR/FB rate constant (keep it at his career average) and set his FB% to 54.8%, his rate for this year. This will show the effect of Bautista increasing his fly ball rate, which is a "skill", without increasing his HR/FB rate, which is due mostly to luck. So with 675 PA, FB% of 54.8%, and HR/FB of 10.4%, Bautista would have hit 25.83 home runs this season. This means that 6.29 extra home runs (25.83-19.54) came purely from Bautista's changed swing at the plate.

To figure out how many home runs came from Bautista being "lucky", we are going to hold his FB% constant at a career average of 42.8%, and increase his HR/FB rate to 21.8%, his rate this year. In 675 PA he would then hit 41.08 home runs, which means that 21.54 more home runs came from his HR/FB rate increasing, which has a lot to do with luck. However, since he has changed his swing to become much more powerful, he would have an increase in his HR/FB rate anyway, but for the purpose of this study we will credit these home runs to luck.

Finally, as you may have figured out from the math, there are a couple of home runs which are unaccounted for. If we take the base of 19.54 home runs, and add the increases of 6.29 and 21.54, Bautista would have hit 47.37. But we previously stated that his projection is 54.08 home runs, so we are missing 6.71 home runs. These home runs are found through the "interaction" term, which is when both Bautista's FB% and HR/FB increase. We can safely credit these to "skill", as I believe that his HR/FB rate has increased because of the change in his swing.

What this all means is that overall, Bautista has hit at least 32.54 home runs this season due to his skill and new swing mechanics, while at most he has hit 21.54 home runs due to luck, although that number is probably much lower. So what he is doing this year should not be a fluke, even if his HR/FB rate drops all the way back down to his career average of 10.4% next year, he should still hit at least 30 home runs. The large majority of his home runs this year have indeed come because of his adjustments at the plate, and possibly because of other intangible measures such as improved confidence and better pitches to hit (although that could be measured in an exhaustive study).

What I would like to conclude is that: a) Bautista's season is no fluke, he should return to the 30 or 40 home run club next season, b) I cannot say that he is not taking steroids, but I can say that they are not the reason he has hit so many home runs this season, and c) this season is going to cost the Blue Jays (or some other team) a lot of money, and I do believe that Bautista has at least a couple more good years left in him. I hope this post clears up at least a little bit of the shock and disbelief at Bautista's incredible season, but it is good to know that there are ways to measure why and how he is hitting all of these home runs.

Friday, September 24, 2010

Fact of the Week VII: 2010 - Year of the Pitcher?

As has already been reported in many, many places, 2010 has been known as the year of the pitcher. (You can view just some examples from ESPN, Fanhouse, and Time.) Although 1968 is known as THE year of the pitcher, because strikes zones were expanded and the mound was raised, 2010 has become the year of the return of dominant pitching.

It started early on, when Ubaldo Jimenez threw the first no-hitter on April 17. Then Dallas Braden and Doc Halladay threw perfect games within three weeks of each other in May. Edwin Jackson threw a no-hitter in June, and finally Matt Garza threw yet another no-hitter in July. In all, there have been 5 no-hitters and 23 one-hitters so far in 2010. This is actually the record for most no or one-hitters in a season, passing 1988 which had 26 no or one-hitters. So we can see that 2010 has been a year filled with dominant pitching performances.

However, another amazing thing about this season is how often there have been games where both teams have pitched extremely well. This shows up in the amount of 1-0 games we have seen this year. which has already happened 59 times this season, which is tied for the 6th most in a single season, and the most in any season since 1976.

So we can see that there have been a very high number of extraordinary pitching performances this year. We could compare different stats such as ERA or WHIP to see how 2010 stacks up compared to different years in terms of overall pitching performance, but that is not what I wanted to figure out. I just wanted this post to show that 2010 has in fact been the year of the return of dominant pitching.

Saturday, September 18, 2010

Fact of the Week VI: Home Runs Streaks (Part II)

This fact of the week is a follow-up to my post on Wednesday detailing the Blue Jays' current home run streak. Jose Bautista hit a home run in the 6th inning of tonight's win over the Red Sox, which is significant in two ways: it is his 48th of the season, which is a new Blue Jays single season record, passing George Bell's 47 in 1987. Secondly, it keeps the Blue Jays home run streak alive, now at 18. It is still short of the 23 they posted in 2000, but it is slowly creeping closer.

I just wanted to do some quick calculations on how likely the Jays streak is now that it is at 18 games and counting. They have now hit 228 home runs in 147 games, hitting at least one in 109 games for a probability of 0.7415. For 18 games, their probability is 0.741518, which equals 0.459%. This means that they would hit home runs in 18 consecutive games once every 217.8 "sets" of 18 games, with a total of 145 sets of 18 games per season. So given their home run productivity this year, they were predicted to hit home runs in 18 straight games 0.6659 times. This is interesting, considering their prediction for 16 straight games was 1.138, which shows just how hard it is to continue streaks like these for longer and longer periods of time.

Now I wanted to do a calculation based on the longest home run streak by a team ever. This details the longest home run streaks by any team since 1920 (there would be no long streaks before then with the dead ball era anyway). The record is held by the 2002 Texas Rangers, who hit home runs in 27 consecutive games. I wanted to quickly calculate the odds of that team accomplishing such a feat, as I feel they will be quite low. The 2002 Rangers were nothing out of the ordinary, finishing 72-90, dead last in the AL West, except for one thing: they could mash home runs (their top two HR hitters were A-Rod with 57 and Rafael Palmeiro with 43....hmm). They hit 230 home runs, hitting at least one in 122 out of 162 games. So the probability of them hitting a home run in any given game was 0.753, and the probability of them hitting a home run in 27 straight games was 0.75327, which is equal to 0.000473, or 0.0473%. This means that, on average, they would hit home runs in 27 games straight once out of every 2,114.4 sets of 27 games. Given that there are 136 "sets" of 27 games, they would be able to accomplish this feat an average of 0.0643 times that season. What this means is that with their home run production (which is one of the top home run totals ever), the probability of them hitting home runs in 27 straight games is only about 17%.

So we will see how long the Jays can continue their current streak. I predicted on Wednesday that the streak would end sometime this weekend, and I still believe that will probably be the case. It was a great night for the Jays, with the win over Boston, the continuation of the streak, and Bautista breaking the single season Jays record for home runs.

Wednesday, September 15, 2010

Home Run Streaks

Last night, the Jays hit a home run in their 16th consecutive game, which seems to be a fairly impressive streak. I wanted to find out if in fact it was impressive, and just how difficult is it to do?

First of all, this streak is now the second longest HR streak by the Jays in club history. It is also the second longest HR streak in the MLB so far this year. The 16 games in a row is only surpassed by the 23 games in a row in 2000. So while this isn't exactly uncharted territory, they do have a good streak going. What makes this streak really interesting is that, out of all HR streaks of at least 13 games, it is the only streak where they have a losing record (they are currently 5-11 in the streak, the next worst streak is when they went 7-7 in 14 games in 1996). Also interesting is that they have only scored 73 runs in the 16 games, which gives them a runs scored/game of 4.56, which is also the lowest of the ten times they have hit home runs in at least 13 games straight.

Another interesting fact is that while they have hit 31 home runs in the 16 games (1.94/game, as opposed to 1.51 HR/game the rest of the season), they are scoring fewer runs per game than for the entire season (they were averaging 4.65 runs/game in their first 129 games, they are averaging 4.56 runs/game in the last 16). So while they are hitting more home runs, those runs produced from the home runs seem to be just about the only runs they are scoring.

The last thing I wanted to do was figure out how difficult it is to hit home runs in 16 straight games. The Jays have hit 226 home runs so far this year, and have hit home runs in 107 of the 145 games they have played (here is a summary of every home run they have hit so far if you are so interested). So the probability of them hitting a home run in any given game is 0.738, or 73.8%. That means that the probability of them hitting a home run in n different games is simply 0.738n, as the probability of them hitting a home run in two games is 0.738*0.738, in three games 0.738*0.738*0.738, and so on up to n. So the probability of them hitting a home run in 16 straight games is 0.73816, which is equal to 0.774%. What this means is that out of 1000 "sets" of 16 games, the Blue Jays would hit a home run in each game 7.74 times, or one out of every 129.2 sets. Considering that there are 147 sets of 16 games in each season (games 1-16, 2-17, 3-18, ..., 146-161, 147-162), we can see that this should happen about 1.138 times this season.

So what we can see by looking at the math is that although this streak of home runs is impressive, it is certainly not out of the ordinary and mathematically probably should have happened at least once this season. Now, keep in mind that the Blue Jays are hitting home runs at a mindblowing pace this year (on pace for 252.5), in fact very close to the record for most home runs by a team in a single season (the 1997 Seattle Mariners hold the record with 264 HRs). So the chances of the 2010 Blue Jays to hit home runs in 16 straight games is a lot higher than the chance of any other Blue Jays team to hit home runs in 16 straight games. That is why this is the second longest streak in club history. It remains to be seen how long they can continue the streak, but don't be surprised if it ends tonight or during the weekend series with the Red Sox.