Going to WAR With Baseball Statistics

Story Stream
recent articles

Syria isn’t the only place where we ask “Why War?” Some ask the same question about baseball. In baseball lingo, “WAR” stands for Wins Above Replacement, an advanced statistic that played a prominent role in last season’s MVP debate, and will be prominent again as this year’s voting nears. 

In 2012, Miguel Cabrera won the Triple Crown; he was widely known as the league’s best hitter. However, Mike Trout was clearly the superior base runner and defensive player. Historically, there was no good way to determine whether Trout’s base running and defense overcame Cabrera’s hitting in an overall evaluation. This was simply a matter of opinion, akin to the old question of DiMaggio vs. Williams. Such unresolvable debates kept bars in business.   

Except we now live in the information age, and “sabermetricians” (the advanced stats gurus) seemed to shape the debate with the development of WAR. WAR yields a single number for each player’s overall contributions by combining separate assessments of hitting, base running and defense (with additional adjustments, and with separate evaluation of pitching). 

The several component measures of WAR are each expressed in the common unit of runs (added or lost). These are added together to yield a final run value. Runs are converted into wins, and wins are compared with a hypothetical minimum-salary “replacement” player. Thus, the final term: Wins Above Replacement (player). 

The three well-known WAR formulations (from Baseball-Reference.com, fangraphs.com and Baseball Prospectus) all ranked Trout first in the American League in 2012, with Cabrera third or fourth. Thus, some in the mainstream media insisted that Trout, rather than Cabrera, should have been MVP. Similarly, in 2013, Trout leads the American League in all three WAR formulations, with Cabrera second or fourth.

But is WAR valid in ranking players over one season? For WAR to provide an accurate assessment of players’ relative values, each of the three individual metrics — for hitting, base running, and defense — must be accurate. (This is putting aside the issue of intangibles — leadership, hustle and other factors that elevate players like Pete Rose and Derek Jeter beyond their statistics). Defensive and base-running performance will form the crux of this analysis because they are generally considered the most difficult to measure.

The three WAR formulations use different defense and base-running measuring systems, but they track similar ideas. The best-known defensive system is probably Ultimate Zone Rating (UZR), which we can analyze as a representative system. Briefly, UZR evaluates a fielder on every play by comparing the outcome of the play to historical results of other players on similar or, ideally, exactly alike batted balls; a certain kind of ball (i.e. velocity, trajectory, etc) hit to a specific spot on the field. This is an attempt to isolate the player’s ability as the single variable corresponding with the play’s result. This analysis is repeated for all batted balls for the season, and the final tabulation yields a quantitative assessment of each player’s defensive performance. 

Does UZR accurately capture a fielder’s performance? We can’t know for sure, but we do know that UZR and similar systems produce some seemingly bizarre results. According to UZR, Alfonso Soriano, universally regarded as a poor outfielder, turned in the single best defensive season (in 2007) of any player from 2002 to 2012. In fact, two years before his unprecedented defensive season, Soriano was graded as terrible by UZR. In 2013, age 37 and virtually immobile, he bounced back for a strong rating. According to UZR, Jeter went from terrible to very good and back to terrible quite quickly. Albert Pujois and Jayson Heyward had better defensive seasons than Omar Vizquel’s best. In Baseball-Reference.com’s defensive measurements, Willie Mays does not have one of the best 500 seasons ever, a list that includes Roberto Clemente only once. Joe DiMaggio was barely better than average.  

It is possible that these are novel and accurate insights into defense. Maybe Soriano really was intermittently brilliant in the field. Or maybe there are just a few anomalous results that mean nothing. For our purposes, the important point is that UZR single-season data are not reliable enough to generate confidence in a WAR measurement.

While of course there is some change in player ability, the greater the difference in individual player assessment from one year to the next, the less confident we are in the value of a single year’s measurement. Unfortunately, the year-to-year correlation coefficient (0.5) of UZR data is considered only moderately strong by statisticians (and weak-to-moderate by some). UZR has about the same year-to-year correlation coefficient as ERA (.51) and batting average (.56), two statistics widely criticized for too much randomness.

The modest year-to-year correlation coefficient speaks to the enormity of the task of measuring defense. The effort to capture countless variables in order to isolate the fielder’s performance is daunting. The same holds for the base-running metrics. For example, to evaluate a runner’s progress on a base hit (e.g. did he go from first to third on a single?) you have to consider exactly where the ball landed, its velocity and geometric course, how the fielder caught it (backhand, coming in, going out), who the fielder was or what kind of throw he made, the game context (the score and inning determine the runner’s and fielder’s aggressiveness), and much more. The difficulty in capturing the large number of base-running variables explains why the year-to-year correlation coefficient for Ultimate Base Running (fangraph.com’s base-running WAR component) is also only 0.5. 

Thus, WAR derives from (at least) two independent statistics of questionable accuracy.

These limitations, particularly as they pertain to single season WAR evaluation and MVP voting, are widely known among sabermetricians. In the Baseball Prospectus book “Extra Innings,” Dan Turkenhoff observes that “the more [WAR] is influenced by its fielding component, the more skeptical we should be. It’s more likely to be the product of uncertain data, or the influences of a small sample.”  

Sabermetricians continually try to improve their data. More cameras and other equipment are being placed in ballparks. Perhaps more information about the velocity and arc of batted balls will improve defensive metrics. But the father of sabermetrics, Bill James, remains skeptical: “We’ve had these cameras pointed at pitchers for years and haven’t learned a damn thing that is useful. … I suspect the same thing would be true with respect to fielding.” 

Indeed, there is a limit to the benefit of more information in this context — an inevitable trade-off between precision and sample size. If we are able to determine that a batted ball travelled at exactly X miles per hour, at an arc of Y degrees, and landed within a four square foot area in right field, on a sunny, 90 degree day, and we know the right fielder began the play at another four square foot area, we will have controlled for many variables and thus have greater accuracy when comparing the result of the play with other plays sharing those characteristics. The problem is that we will find very few plays with all those exact characteristics.

What is gained in precision is lost in sample size, and smaller sample size increases the chances of error. We can increase the sample size by easing the precision — enlarge the squares to sixteen square feet, for example. But every relaxation of precision increases the extent to which we are comparing unlike batted balls, thereby decreasing the usefulness of the comparison. This trade-off cannot be eliminated, and it means that acquiring more data via more technology will not lead to a linear increase in knowledge, and possibly very little increase at all (as James proposes).

Thus, in the end, we’re really back to the beginning. Cabrera versus Trout remains a largely subjective determination. WAR is a good idea but it utilizes several unreliable measurements that make its application to end-of-season MVP voting dubious. For all anyone knows, WAR fits H.L. Mencken’s observation that “for every complex problem there is an answer that is clear, simple, and wrong.”

Sheldon Hirsch is author of the forthcoming book "Hot Hands, Draft Hype, and DiMaggio’s Streak: Debunking America’s Favorite Sports Myths" from the University Press of New England.

Show commentsHide Comments

Related Articles