Friday, September 25, 2015

Competitive Balance in MLS

Last time I talked about how I obtained the MLS game data I'll be using in my research.  The Wayback Machine is incredibly useful for finding things that use to be online.  It's also a great reminder that whatever you put on the internet is likely permanent in some way.  In this entry, I'm going to test how competitive MLS is by comparing the distribution of team outcomes to an "ideal" measure.

The motivation for this experiment comes from a similar analysis in Pay Dirt and Fort and Quirk (1995), both of which derive ideal distributions of win percentages for teams in equally balanced leagues. If we want to take their methods and apply them to soccer leagues, we must derive our own distribution of outcomes since the authors only consider leagues that have no ties.  Thus, instead of an ideal distribution for winning percentages, I'll derive an ideal distribution for percentage of possible points earned.

With 3 points awarded for a win, 1 for a tie, and 0 for a loss, the most a team could earn in a season with 30 games is 90.  Therefore, if a team only earned 45 points in a season, it would would have earned 45/90 X 100%=50% of points possible.  Let's see what an equally balanced or "ideal" league would look like below.

Consider an MLS season where each team has a certain number of games played (gp).  If each team is of equal playing strength, then the probability of winning, losing, or tying any game is PW=PL=PT=1/3.  That is, the outcome of each game follows a trinomial distribution.   The expected number of points earned from gp games for any team is then

If you're having trouble viewing the equations below, try not using Firefox.  Some security setting is preventing them from showing up.


and the variance in points is

where
 

and
 
 so that



Converting to percentage of points possible, the expected value becomes


and the variance becomes


Since the distribution of percentage points earned is an average taken from a trinomial distribution, it will be distributed normally with mean 4/9 and standard deviation


If you'd like to run a simulation of such an ideal league with any number of teams and games played, I've written an R code for you to do so.

One issue with comparing actual distributions of percentage of points earned to the ideal distribution is that the latter depends on the number of games in a season, which differs across season in MLS.  For this reason, I've combined outcomes for all seasons with the same number of games.  Also, since the first four years of MLS did not have ties but ended in a shootout, the formula above does not apply to those years.  Therefore, for 1996-2000, I use "hypothetical points", which are the points teams would have earned had there been no shootout.  The unit of observation is then the percentage of points earned for a particular team in a season with the same number of games as all other observations.  So how competitive is MLS exactly?


In the figure above, I use a kernal desnity estimate of the actual percentages earned (blue line) and compared this to the ideal distribution (black line).  For the years where 32 games were played, it seems that the actual distribution of percentage of points earned is a little too heavy in the tails to be considered ideal.  There are too many teams at the low end that earned around 20% or the possible points and too many at the high end that earned 60% or more.  A Kolmogorov-Smirnov test rejects that the actual distribution of percentages came from the ideal distribution (D = 0.265, p-value = 2.699e-05), although admittedly the test is not perfect since the percentages are not empirically from a continuous distribution but rather can only take steps of 1/96.

For the years with 30 games played, the actual distribution lines up quite nice with the ideal! Interestingly, this includes the designated player era, which started in 2007.  Many fans bemoaned the arrival of David Beckham as the arrival of an imbalanced league where only the big cities would be able to compete by paying millions of dollars for global stars.  However, there doesn't seem to be much evidence of this being the case in the figure below. The Kolmogorov-Smirnov test here cannot reject the claim that the observed percentages did in fact come from the ideal distribution (D = 0.09, p-value = 0.513).  Comparing this figure to the others in this post as well as those of the other sports leagues (NFL, NBA, MLB, NHL) in Pay Dirt and Fort and Quirk (1995) suggests that MLS is the most competitive professional sports league in all of US/Canada history and the only one that I am aware of to conform, at least temporarily, to a balanced ideal!
And then the bad news for people who like an any-given-Sunday league.  In the most recent seasons where 34 games were played, the distribution of percentages are no longer ideal (D = 0.20505, p-value = 0.003648).  In fact, this may be the least competitive group of seasons.  While there were minor tweaks to salary and designated player rules, nothing immediate stands out in this time period to suggest why the league has become so much less competitive.  However, this has not had any noticeable effect on league attendance or revenues as some may fear it would.

 Overall, we see that MLS is a relatively competitive league with at least one time frame of ideal levels of competition, at least as far as percentage of points earned goes.  However, this level of competitiveness seems to be diminishing as time goes on as the most recent time period deviates the most from what we may consider an ideal league to be.

For anyone interested in how the above was carried out, I've uploaded the data set and R script to recreate the Figures and Kolmogorov-Smirnov tests.  You could easily extend the code to match your favorite soccer league once you've downloaded the appropriate data set. 

Moving forward, I'm going to begin my investigation on the effects of the MLS shootout on team strategies.  In the meantime, if you have any MLS related question you think a bunch of data could answer, let me know I'll see what I can come up with.  Until then friends.

2 comments:

  1. Why not model as poisson? No need to smooth.

    ReplyDelete
    Replies
    1. Good suggestion! Using a Poisson would assume the mean and variance are the same. However, I checked the data and the variance is way smaller than the mean (less than a quarter of the mean). I think a discrete histogram with bins equal to 1/(points possible) would model the actual distribution perfectly but the continuous lines look nicer in pictures. Also, I'm copying what Quirk and Fort did.

      Delete