clock menu more-arrow no yes mobile

Filed under:

Metaprimer: UZR

Second in a series without end. I wrote in wOBA at the leadoff position.  I guess that makes UZR the bat handler.  And for now I'm going to sidestep similar systems like Plus/Minus, SAFE, or PMR.  The fact that UZR is on fangraphs makes it by far the most relevant.  Thank MGL next time you see him in a thread somewhere.

I know I was asked to do a WAR primer next, but I want to take on the major components before I look at position player WAR.  I think going slow will make it easier to make sense of everything when we throw it all in together.  Not to mention some guy named Dave Cameron took basically the same approach at fangraphs.  Whatever that is.

Where to start with fielding?  While errors or fielding percentage get some play, I think defense is assessed by the typical fan with the ever popular Eyeball Test.  This makes sense.  But let's first figure out...

What's Wrong With Errors?

Last time we started by dissecting batting average.  Errors/fielding percentage is much the same.  It's not that they don't tell us anything.  They do.  And, as we'll see, UZR incorporates errors and is actually able to assign them a run value.  But, like batting average, they simply exclude too much important stuff to do the work we want them to do.

Most importantly, they only measure skill after a player has reached the ball.  This is by definition and the usual rule of thumb used by official scorers usually boils down to "did it bounce off his glove?" The official rulebook definition looks like this:

The official scorer shall charge an error against any fielder:
(1) whose misplay (fumble, muff or wild throw) prolongs the time at bat of a batter, prolongs the presence on the bases of a runner or permits a runner to advance one or more bases, unless, in the judgment of the official scorer, such fielder deliberately permits a foul fly to fall safe with a runner on third base before two are out in order that the runner on third shall not score after the catch...

It goes on from there.  On and on.  Anyone who's watched the game has questioned an official scorer's decision on an error before.  So in addition to leaving out all defensive plays where a ball does not deflect off a glove, the definition itself is hard to nail down.  Dividing errors by chances to make the rate stat fielding percentage doesn't fix any of this.

The Eyeball Test

As I said, I think most fans are fairly aware of the issues with errors and have their own version of The Eyeball Test.  As it turns out, this is super useful.  Tom Tango has incorporated this fact into what he calls the Fan Scouting Report.  In his own words:

There is an enormous amount of untapped knowledge here. There are 70 million fans at MLB parks every year, and a whole lot more watching the games on television. When I was a teenager, I had no problem picking out Tim Wallach as a great fielding 3B, a few years before MLB coaches did so. And, judging by the quantity of non-stop standing ovations Wallach received, I wasn't the only one in Montreal whose eyes did not deceive him. Rondel White, Marquis Grissom, Larry Walker, Andre Dawson, Hubie Brooks, Ellis Valentine. We don't need stats to tell us which of these does not belong.

Tango solicits input from whomever he can get and averages these individual scouting reports into a total score.  James Surowiecki would certainly approve.  Moreover, random fan, the focal point of the sabermetric community values your own personal Eyeball Test.  It's kind of flattering, really.  But before you run off to scouting school, remember that it doesn't weight your eyes any more or less than anyone else's. 

The wisdom of crowds is a well accepted tenet, so checking out the FSRs for fielding evaluations is an absolute must.    Tango generally insists they're better than UZR, to the extent that they have fewer misses; this is good data.  But if we want a wOBA analog that can be used in a WAR calculation, we want a rate stat denominated in runs.  FSR is expressed on a 1-5 scale where 3 is average.  For example, here's all shortstops with at least five votes.  With a few assumptions this can be turned into runs above average per X games, but you'd have to do that yourself.  I'd love fangraphs to come up with something, but so far not so much.  In their current formulation, the FSRs serve as an excellent sanity check to the various other fielding stats.  For example, they're great for determining the appropriate mean to regress to.  That said, we're after a white whale.  FSRs are closer to desert sand in hue.

Enter The U

First, credit where credit's due: UZR is Mitchel "MGL" Lichtman's baby.  Praise be his name.*  Okay, onward.

It's important to think about what we should we be looking for in a more specifically quantitative fielding metric.  Think about it in terms of wOBA.  Tango's invention works well because we can compare everyone to some average and because it's denominated in runs.  This is obviously way easier with hitting since the records are evident.  No one wonders whether or not the back of a baseball card is correct in the number of hits attributed to the player.  A walk's a walk.  A hit's a hit.  But there's no record of how far the fielder ran to make the catch, or how much time he had to do it in, where on the field he did it.  In essence, we want to figure out whether or not he should have made the catch. But there's plenty in doubt aside from the putout itself.  This is why fans have to resort to the Eyeball Test in the first place.

But what if we had

Armies of new college grads who collectively would watch every game and keep a detailed log of what happened to every batted ball: what kind of pitch was hit, where the ball was hit, how hard it was hit, who fielded it and how it was or wasn't turned into an out.

It turns out there are two such companies paying for such: Baseball Info Solutions, founded by John Dewan (inventor of plus/minus), and STATS Inc.  Now, annoyingly, it turns out that there are some discrepancies between the two.  Say you run UZR on Andruw Jones before he got fat with BIS info, you'll find he was an incredible center fielder.  Do the same but with STATS info, and he was an average one.  Remember what I was saying about the FSRs as a sanity check? 

The discrepancies kind of distract from the real point, though.  Which is this:

The data used in calculating UZR was observed by human eyes and recorded as such — usually by multiple people per game, to weed out bias. In other words, UZR is based on eyeball data. It just takes a heap of such data and compiles it into a workable statistic.

UZR, like FSR, is at its core about processing human observations.  You can object to exactly how they're done, but if we're looking for something rigorous, something worth citing, then it we need to look at things systematically.  If there are discrepancies, they can be resolved with better observations.  And while we work on that, why not use what we already know is useful?  Like Tango says:

All data provides value, as long as you can pick out the biases.

In sum: the basis of UZR is folks watching games.  And I guarantee you they aren't getting paid a ton.  They're doing it because they love baseball.  Probably as much as you, Mr. Yu Z. R. Skeptik.

The Nitty Gritty

Sister site Bless You Boys' Mike Rogers has been running his own primer series.  He begins thusly:

UZR splits the field into 78 different slices called zones. Don't worry, only 64 of those are used in the UZR formula. You figure out the average number of balls in play in each zone and then the rate at which plays made are recorded in each zone. This will give you a baseline average for the position. Now, you do this on an individual basis and graded against what the average fielder would do. If a player comes out with less plays made recorded in their zone compared to league average, they have a negative zone rating. As well, they'll have a positive zone rating if the player records more outs in the zone than the average defender at the position.

Hmm.  Split how?  Like so:


64 of 78 are used because

...infield line drives, infield pop flies, and outfield foul balls are ignored. Pitchers and catchers are not included.

While I couldn't find exact reasons given, these are smart assumptions.  Pitchers and catchers mostly field bunts, swunts and the like, while infield line drives are almost exclusively the result of positioning.  Infield pop flies are almost always outs and outfield foul territory differs significantly from park to park.

You may also be wondering how MGL came up with 78 instead of, say, 50 zones.  Or 4.  Basically, there's a trade-off between getting sufficient samples per zone and enough differentiation between positions that share zones.  RAB illustrates this with an example:

To make things a bit clearer, we’re just trying to determine which player was responsible for which hits. So if there are 1,000 hits and 1,500 outs in Zone 56, we want to know how many of those outs the third baseman converted, and how many the shortstop converted. Using this ratio, we can determine the responsible party for the hits. So, if the third baseman made 70% of the outs recorded from Zone 56, 1,050 in this example, he’s also responsible for 70% of the hits, or 700. That’s the baseline we apply to individual fielders.

If the zone is designated such that it's too big, determining individual responsibility becomes problematic.  The point of zones is to give a point of reference in real terms between two fielders.  If every zone is significantly shared, the result is an insufficient gradient.

So far I've only outlined a range measurement that boils down to outs/chances. How do we go from range to runs? Remember linear weights?  The same concept goes to work here.  Using the specific details of the batted ball (which zone did it land in, was it a GB/FB/LD, etc.), we can generate an average expected value for the batted ball.  A line drive rocket to the gap probably won't often be caught, but it also likely won't be a HR or a single.  Let's say 50% of the time it's a double, 40% it's a triple and 10% of the time it gets caught. 

So the average value of that batted ball = .5*run value for 2B + .4*RV3B + .1*RVout = .5*.75 + .4*1 + .1*(-.3).  That's the value of that batted ball to the batter's team on average.  The difference between the safe run value and the out value is the runs saved by the catch.  So the frequency of that catch above average times the runs the catch saves times opportunities gives you runs above average for a fielder in a given zone.  Divide by opportunities and it becomes a rate stat like wOBA.  To make it more usable by the average fan, the result is often prorated to 150 games.  This isn't actually that different from calculating fielding percentage, except it's compared to some average.  The concept is the same, but the measurements are far more useful.

MGL doesn't stop there, however, and makes a number of adjustments.  For example, we know that managers adjust their lineups based on the handedness of the pitcher and that most balls put in play are pulled.  So if a pitching staff happens to be particularly RHP heavy, there will be more balls hit to the right side of the infield.  An average infielder on the right side of this hypothetical team will make more plays than the average fielder that has nothing to do with his ability to field.  The effect is fairly small, but it's still a clear bias in the data worth picking out. 

From there, there are calculations for contributions for outfield arms, double plays turned and errors made.  Add everything up and the result is the UZR figure found on fangraphs.

Um. So?

Well, for one this means that the problems of UZR have to do with attribution and observation quality.  At the team level, UZR works very well.  We more or less know the value of the batted balls thanks to linear weights and we know whether or not those batted balls get caught.  It's just a matter of deciding who was supposed to and figuring out where to draw the line between line drives and fly balls.**

And at the individual level?  The main problem is sample size, not methodology.  Colin*** Wyers, remembering that the Deputy loves dots, investigated the persistence of UZR and concluded

  • Everything regresses to the mean. A hitter in 300 PAs should be regressed roughly 50% to the mean. (Assuming all you have is those 300 PAs, of course.)

  • Defensive metrics are less reliable than offensive metrics. (Which - see above - are not as reliable as they are sometimes treated, when it comes to determining a player's inherent level of ability.)
  • An infielder's UZR is more reliable than an outfielder's UZR. This is partly because an outfielder sees fewer chances than an infielder, and partly because outfield defense is more difficult to measure than infield defense.

Tango usually says that 200 PA are about equal in terms of persistence to 400 BIP, where a SS/2B gets 5 BIP per 9 innings, 3B/CF get 4, LF/RF/1B get 3.  Even if Alexei gets 150 games at SS, that's only 750 chances, still not equal to a full season at the plate.  The data just does not accumulate rapidly enough to make a single season's UZR especially meaningful.


 Sorry!  I'm just trying to cover everything!  Anyway, Mike Rogers gives us some nice guidelines:

  1. 1 year of UZR data is on par with about 50-55 [75 is more accurate****] games worth of offense.  Would you judge Miguel Cabrera's talents at the plate on just his games from April 1st through June? I wouldn't, and neither would you (or so I hope). So don't do it with defense. Personally, if I have three years of UZR data for a player, I'd rather have four. If I have four years of UZR data, I'd rather have five. I don't believe that you can have enough.
  2. One full year of defensive data is at least 1200 innings worth of data.
  3. Do not use UZR per 150 games (UZR/150; found on Fangraphs' player pages) if at all possible. It's way too misleading.
  4. If Player A is a -10 one year, +10 the next year and then +0 the next year, he's likely an average fielder. Large swings in year-to-year data isn't out of the norm, but you should always use an average (preferably, a weighted average) and be conservative with it.
  5. When possible, use multiple defensive systems to grade a player (UZR, John Dewan's Plus/Minus system, etc).

Eyeball the numbers, throw in a mental regression, be sure to check the FSRs and, if you're looking up somebody on your own team, remember that you're likely to overrate him. 

If you want something slightly more rigorous, for players with 3+ seasons, I've been doing the following:

Add up total UZR, divide by innings and multiply by 1350.  That gives you the average over however many seasons of data prorated to 150 games.  So that's presumably about where he was at the midway point of those seasons.  From there we can add an age adjustment.  Fielders peak around 22-24 and from there lose 1/2 a run per season.  Let's say the player has 5 seasons starting with his age 25 season totaling 30 runs above average.  That's a +6/150 roughly estimated true talent at 2.5 seasons played.  So the age adjustment is 2.5*.5, giving a rough projection of +4.75/150.  Round up and we'll call Player X a true talent +5/150 fielder at his position.  For reference, the equation looks like:

(UZR/IP)*1350 + (Seasons/2)*(-.5)

That isn't Tango/MGL/Saberauthority approved, it just made sense to me based on what I know.  It's quick and dirty, but the framework has a fairly consistent logic.  Tweak it if you don't like it.




*is variously also known as Mitchell Lichtman, Mitchel Litchmann, Mmmitchel Llllitchman and Lover of the Jump.  to my knowledge he has not expressed a preference between these.  i merely made an arbitrary selection.

**eventually the hope is we'll have continuous functions with smoothing corrections instead of discrete zones and GB/LD/FB designations, replacing human observation with precision cameras able to give us specific vectors.

***I finally know how the Daves (Cameron, Gassko, Studeman, etc.) feel.  Not quite the beautiful unique snowflake I thought I was.

**** see