Friday, September 25, 2015

Competitive Balance in MLS

Last time I talked about how I obtained the MLS game data I'll be using in my research.  The Wayback Machine is incredibly useful for finding things that use to be online.  It's also a great reminder that whatever you put on the internet is likely permanent in some way.  In this entry, I'm going to test how competitive MLS is by comparing the distribution of team outcomes to an "ideal" measure.

The motivation for this experiment comes from a similar analysis in Pay Dirt and Fort and Quirk (1995), both of which derive ideal distributions of win percentages for teams in equally balanced leagues. If we want to take their methods and apply them to soccer leagues, we must derive our own distribution of outcomes since the authors only consider leagues that have no ties.  Thus, instead of an ideal distribution for winning percentages, I'll derive an ideal distribution for percentage of possible points earned.

With 3 points awarded for a win, 1 for a tie, and 0 for a loss, the most a team could earn in a season with 30 games is 90.  Therefore, if a team only earned 45 points in a season, it would would have earned 45/90 X 100%=50% of points possible.  Let's see what an equally balanced or "ideal" league would look like below.

Consider an MLS season where each team has a certain number of games played (gp).  If each team is of equal playing strength, then the probability of winning, losing, or tying any game is PW=PL=PT=1/3.  That is, the outcome of each game follows a trinomial distribution.   The expected number of points earned from gp games for any team is then

If you're having trouble viewing the equations below, try not using Firefox.  Some security setting is preventing them from showing up.


and the variance in points is

where
 

and
 
 so that



Converting to percentage of points possible, the expected value becomes


and the variance becomes


Since the distribution of percentage points earned is an average taken from a trinomial distribution, it will be distributed normally with mean 4/9 and standard deviation


If you'd like to run a simulation of such an ideal league with any number of teams and games played, I've written an R code for you to do so.

One issue with comparing actual distributions of percentage of points earned to the ideal distribution is that the latter depends on the number of games in a season, which differs across season in MLS.  For this reason, I've combined outcomes for all seasons with the same number of games.  Also, since the first four years of MLS did not have ties but ended in a shootout, the formula above does not apply to those years.  Therefore, for 1996-2000, I use "hypothetical points", which are the points teams would have earned had there been no shootout.  The unit of observation is then the percentage of points earned for a particular team in a season with the same number of games as all other observations.  So how competitive is MLS exactly?


In the figure above, I use a kernal desnity estimate of the actual percentages earned (blue line) and compared this to the ideal distribution (black line).  For the years where 32 games were played, it seems that the actual distribution of percentage of points earned is a little too heavy in the tails to be considered ideal.  There are too many teams at the low end that earned around 20% or the possible points and too many at the high end that earned 60% or more.  A Kolmogorov-Smirnov test rejects that the actual distribution of percentages came from the ideal distribution (D = 0.265, p-value = 2.699e-05), although admittedly the test is not perfect since the percentages are not empirically from a continuous distribution but rather can only take steps of 1/96.

For the years with 30 games played, the actual distribution lines up quite nice with the ideal! Interestingly, this includes the designated player era, which started in 2007.  Many fans bemoaned the arrival of David Beckham as the arrival of an imbalanced league where only the big cities would be able to compete by paying millions of dollars for global stars.  However, there doesn't seem to be much evidence of this being the case in the figure below. The Kolmogorov-Smirnov test here cannot reject the claim that the observed percentages did in fact come from the ideal distribution (D = 0.09, p-value = 0.513).  Comparing this figure to the others in this post as well as those of the other sports leagues (NFL, NBA, MLB, NHL) in Pay Dirt and Fort and Quirk (1995) suggests that MLS is the most competitive professional sports league in all of US/Canada history and the only one that I am aware of to conform, at least temporarily, to a balanced ideal!
And then the bad news for people who like an any-given-Sunday league.  In the most recent seasons where 34 games were played, the distribution of percentages are no longer ideal (D = 0.20505, p-value = 0.003648).  In fact, this may be the least competitive group of seasons.  While there were minor tweaks to salary and designated player rules, nothing immediate stands out in this time period to suggest why the league has become so much less competitive.  However, this has not had any noticeable effect on league attendance or revenues as some may fear it would.

 Overall, we see that MLS is a relatively competitive league with at least one time frame of ideal levels of competition, at least as far as percentage of points earned goes.  However, this level of competitiveness seems to be diminishing as time goes on as the most recent time period deviates the most from what we may consider an ideal league to be.

For anyone interested in how the above was carried out, I've uploaded the data set and R script to recreate the Figures and Kolmogorov-Smirnov tests.  You could easily extend the code to match your favorite soccer league once you've downloaded the appropriate data set. 

Moving forward, I'm going to begin my investigation on the effects of the MLS shootout on team strategies.  In the meantime, if you have any MLS related question you think a bunch of data could answer, let me know I'll see what I can come up with.  Until then friends.

Saturday, September 19, 2015

Greetings! Here are 2 tools to help you free soccer data.

As I work on a "revise and resubmit" for an academic journal, I come across many new (at least to me) sports economics models and ideas. My first thought is always to apply these new concepts to my preferred drug of choice, Major League Soccer (MLS).  Rather than wait a year or so until many small ideas coalesce into a paper that will be presented to a room of perhaps 20 sports economists, I think this time I'll share the little steps along the way with anyone who will listen. I needed a place to put the thoughts and ideas I've been having about soccer.  So here we are.  My soccer scratch pad online.  An electronic version of what I've been doing on my Friday's most of the summer. 

The path to understanding and carrying out soccer analytics is not straightforward; it's sort of the wild west at the moment.  One surprising problem I came across was the difficulty in obtaining simple MLS data.  Want to see the goalscorers' names and minutes for the 4-3 goal-fest of Columbus vs. New England in August of 2012?  When you try to get the box scores from mlssoccer.com, you'll get an "access denied" page.  Is this a technical problem? OPTA restricting access to data so they can charge us for the information?  Your guess is as good as mine as my emails go unanswered.

Luckily, we have at least two weapons in our arsenal to free the data.  First, the Wayback Machine brought to you by the beautiful people at Internet Archive.  It's an amazing tool that let's you view webpages as they were back in time. So we can see what the boxscore page for the Columbus vs. New England game looked like in 2012.  The data is freed!

Second, we have a great database of soccer knowledge at SoccerStats.us.  There are a handful of issues with the data.  Some goals are listed twice, many goals scored in extra time are listed as being scored in the 45th or 90th minute when they were actually scored in the 46th or 93rd minute.  I'm guessing this has to do with the way the data was scrapped from the original sites (many goals may be listed as 45+ or 90+).  However, with the first tool I was able to figure out what the real minutes were but it was a grueling process.

Last night I used the data to get measures of how competitive MLS is.  It is a common opinion that the league is more competitive than the European leagues and perhaps more competitive than other American sports leagues.  But there's no need for conjecture!  This hypothesis can be tested using the simple methods described in the fantastic book written by Rodney Fort and James Quirk, Pay Dirt and,  if you prefer a more academic treatment, in the Journal of Economic Literature article by the same authors. 

The graphs are made and the results are in but my Saturday morning has become Saturday afternoon and this apartment isn't going to clean itself.  So next time I'll show you what I came up with and include data and R code so that you can reproduce what I've done.  Right now the analytics community, which I do not consider myself a part of nor would they recognize me, are very stingy with their data and procedures.  It is insanely frustrating to read articles using new data and techniques  to analyze matches only to be hit with a pay wall or "access denied" messages when trying to recreate the analysis.  As such, I will provide my data and code for any who wish to learn.  See you then.