Saturday, September 19, 2015

Greetings! Here are 2 tools to help you free soccer data.

As I work on a "revise and resubmit" for an academic journal, I come across many new (at least to me) sports economics models and ideas. My first thought is always to apply these new concepts to my preferred drug of choice, Major League Soccer (MLS).  Rather than wait a year or so until many small ideas coalesce into a paper that will be presented to a room of perhaps 20 sports economists, I think this time I'll share the little steps along the way with anyone who will listen. I needed a place to put the thoughts and ideas I've been having about soccer.  So here we are.  My soccer scratch pad online.  An electronic version of what I've been doing on my Friday's most of the summer. 

The path to understanding and carrying out soccer analytics is not straightforward; it's sort of the wild west at the moment.  One surprising problem I came across was the difficulty in obtaining simple MLS data.  Want to see the goalscorers' names and minutes for the 4-3 goal-fest of Columbus vs. New England in August of 2012?  When you try to get the box scores from mlssoccer.com, you'll get an "access denied" page.  Is this a technical problem? OPTA restricting access to data so they can charge us for the information?  Your guess is as good as mine as my emails go unanswered.

Luckily, we have at least two weapons in our arsenal to free the data.  First, the Wayback Machine brought to you by the beautiful people at Internet Archive.  It's an amazing tool that let's you view webpages as they were back in time. So we can see what the boxscore page for the Columbus vs. New England game looked like in 2012.  The data is freed!

Second, we have a great database of soccer knowledge at SoccerStats.us.  There are a handful of issues with the data.  Some goals are listed twice, many goals scored in extra time are listed as being scored in the 45th or 90th minute when they were actually scored in the 46th or 93rd minute.  I'm guessing this has to do with the way the data was scrapped from the original sites (many goals may be listed as 45+ or 90+).  However, with the first tool I was able to figure out what the real minutes were but it was a grueling process.

Last night I used the data to get measures of how competitive MLS is.  It is a common opinion that the league is more competitive than the European leagues and perhaps more competitive than other American sports leagues.  But there's no need for conjecture!  This hypothesis can be tested using the simple methods described in the fantastic book written by Rodney Fort and James Quirk, Pay Dirt and,  if you prefer a more academic treatment, in the Journal of Economic Literature article by the same authors. 

The graphs are made and the results are in but my Saturday morning has become Saturday afternoon and this apartment isn't going to clean itself.  So next time I'll show you what I came up with and include data and R code so that you can reproduce what I've done.  Right now the analytics community, which I do not consider myself a part of nor would they recognize me, are very stingy with their data and procedures.  It is insanely frustrating to read articles using new data and techniques  to analyze matches only to be hit with a pay wall or "access denied" messages when trying to recreate the analysis.  As such, I will provide my data and code for any who wish to learn.  See you then.

No comments:

Post a Comment