In a few posts here, I've mentioned the sabermetrics course, SABR101x, that Andy Andres taught at EdX.org. Now, that it's done, I thought I'd share a few of the more fundamental things I picked up from the course.
Personally, I had no problems with anything presented from a technical standpoint. I've also worked as a programming using SQL for several years, so I might not be the best judge here.
The course didn't pull punches with the technology. You were taught SQL, the primary language for extracting data from databases, and R, the primary language used by data analysts at the desktop.
The SQL instruction is fairly thorough and should allow just about anyone to use a database. I do have one very technical complaint about how SQL and databases are presented in the class, but it shouldn't have effected the learning of most people.
R is taught using R Studio. This is a good choice since it is a very nice, cross platform environment for working with R, and I do not suggest working with R without it. The class goes through all the basics of working with CSV files, manipulating data sets, and creating usable graphs with them.
For me, there wasn't anything too unusual for the class. Rather than talking about specific advanced metrics for players, it stuck with the basic concepts of talking about runs and their relationship to wins. The Bill James Pythagorian winning percentage theory was discussed along with looking at some of early formulas to model run creation.
For me, the discussion of replacement level and the average player was very helpful for me and really opened my eyes to what that really meant.
My a-ha moment from the class
We spend a lot of time here talking about the "average" batter or pitcher. But was is average? Below is the distribution of fWAR in the AL for this season for batters with 10 or more plate appearances.
As you can see, the distribution of fWAR is very left skewed. The biggest group of batters in the graph has a fWAR between -1 and 0. The median fWAR in this sample is 0.2 with a mean of 0.7133. In other words, half of all AL batters have a lower fWAR than Adrian Nieto or Adam Dunn. Assuming WAR is distributed normally, less than 16% of AL batters have more than a 2.0 WAR right now. The White Sox have two guys that fit that description, Jose Abreu and Alexei Ramirez, which is a bit better than would be expected. Conor Gillaspie and Adam Eaton are just below at 1.9.
This really changed how I think about the average MLB player. They aren't a bunch of guys sitting with a WAR around 2.0 ready to fend off any challenger. It's someone scraping by with a WAR near zero ready to be replaced at any time.
For me, this really changes how I look at the front office. It isn't about acquiring a bunch of guys that are already producing. Those guys are scarce and therefore expensive. It's really about acquiring guys that create WAR out of thin air. This certainly applies to the guys at the top for the Sox --- two Cuban signings and two guys that fell out of favor with their original organizations.
This can be applied to managing too. Adrian Nieto is the person Jim has brought up regularly. He's running slightly positive this season. For a guy that's coming straight from A-ball, being positive is almost magical, but as pointed out by Jim several times, it's finding the situations where he can be successful and only using him there. That's seems to be the key for managing your roster, but, to do that you need a roster flexible enough to support this.
In case you were wondering, the lowest fWAR in this sample is Michael Choice at -1.9. Second worse is Nick Swisher at -1.6.
Oh, yeah, I created this graph in R Studio too.
The course is done. This is great! What can I do now?
This is where the course kind of left people hanging on SABR201x. One of the big things to do on your own is to setup your own database unless you want to keep using the EdX one. If you are interested, the easiest way is downloading WAMP if you have Windows, or MAMP if you have a Mac for a relatively painless MySQL install. You can then grab the Lahman database from Sean Lahman's site, import it into your database, and query to your heart's content.
Also, if you want to do anything with FIP or wRC, you might want to grab this table from Fangraphs. You could, of course, calculate your own, but this is so much easier.
The next logical step after getting the database set up is stopping the "write a query, export the data to a CSV, process the data in R Studio" loop, and just connect to MySQL directly from R Studio and build your data tables that way. This is left as an exercise for the reader.
Overall, the course was a great experience. I recommend it for anyone who wants to increase their sabermetrics understanding. If you are coming into the class with a technical background, nothing presented should give you any concerns. I thought the technical information was presented quickly, but thoroughly. I think pace could be an issue for anyone with less experience with databases.