The Accuracy of Expected Goals in The Premier League

Skidmore College
Creative Matter
Economics Student Theses and Capstone Economics

Projects
Spring 5-5-2024
The accuracy of expected goals in the Premier League

Max Mian
[email protected]
Follow this and additional works at: https://creativematter.skidmore.edu/econ_studt_schol
Part of the Economics Commons
Recommended Citation
Mian, Max, "The accuracy of expected goals in the Premier League" (2024). Economics Student Theses
and Capstone Projects. 164.
https://creativematter.skidmore.edu/econ_studt_schol/164
This Thesis is brought to you for free and open access by the Economics at Creative Matter. It has been accepted
for inclusion in Economics Student Theses and Capstone Projects by an authorized administrator of Creative
Matter. For more information, please contact [email protected].
Max Mian
Professor Das
Thesis
29/3/2024
ROUGH DRAFT: THE ACCURACY OF EXPECTED GOALS IN THE
PREMIER LEAGUE
For my topic I will be analyzing the accuracy of expected goals on the 2014-2015
Premier League season. Expected goals is a statistical metric used in soccer to quantify the
likelihood of a particular scoring opportunity resulting in a goal. Assessing the quality of a
goal-scoring opportunity based on various factors: distance of shot, angle, location, or players in
the way. My main research question is: How accurately do expected goals predict a team's
finishing position in the Premier League last season? Teams, coaches, and analysts have become
very dependent on the expected goal tool.
The goal of this metric is to provide a deeper evaluation of a player and team’s
performance beyond just the number of goals scored. By giving shots probability a team can
grasp a better understanding if they are creating high quality chances or are getting unlucky or
benefiting from luck. This analytical tool has recently become very popular, as the final score of
a game does not always reflect the chances a team had.
In my thesis I am going to further explore these limitations to see how accurate xG really
is in reflecting the actual score and if it is a dependable analytics tool for the sport. Overall, if we
were to take expected goals over a whole season for all teams and games how accurately would
it describe the games and finishing position of teams in the premier league in the 2014-15
season? These are the main questions I will be aiming to answer in my thesis.
1
Researching this topic is relevant because the Premier League is the world’s most
watched soccer league in the world which generates hundreds of millions of revenue. Being able
to finish in a higher position at the end of the season makes the team receive more revenue,
having an analytics tool which helps you have an advantage over other teams can be very
beneficial. Furthermore, analytics in soccer is an area of research which is growing
exponentially. Being able to add literature is exciting for people, like myself, who are passionate
about the sport and the science behind it.
To explain how expected goals look visually I have provided Image 1 as a better visual of
what one game looks like based on expected goals. The image I have used is a game which I will
be using in my research, Stoke City against Liverpool; the score of the game was 6 to 1.
According to the result of the game one would think that Stoke City was a much better team and
created many more goal scoring opportunities than Liverpool. But if we look at the result based
on expected goals the score would be Stoke City 2 Liverpool 1. This is a much smaller difference
in score than the actual score; what this means is, according to expected goals, Stoke City created
slightly better chances in this game over Liverpool, and if we just looked at the goal scoring
chances they created the score should have been 2 to 1.
For the time being games and seasons, at the moment, are not determined by the score
that expected goals gives an individual game. But in the future this might change, pundits,
coaches, and analysts heavily depend on what expected goals they created during a game. In
other words, if a team tied a game 1 to 1 but according to expected goals they should have won
the game 5 to 1, coaches and teams will continue to play the same system and formation.
Trusting that in the long run the chances they create(or their high expected goals) in future games
2
will translate to actual goals. Coaches will depend on certain playstyles and formations based on
their xG output.
The Premier League is one of the most profitable sports leagues in the world, where the
smallest of details make such a huge difference for a team’s finishing position. Having any
further insight on an opposing team or your own team is highly valued. Last year the Premier
League, on average, came out with a statistic where scoring one Premier League goal cost on
average 3 million US dollars(EPL). This included summer signings, renovations on facilities,
and sales of players. This shows how costly it is to operate a team in the Premier League.
Expected goals have a unique and important value that add to the innovation of soccer.
Expected goals add a more objective evaluation on a team’s performance, instead of just looking
at the goals scored which has a bigger influence of luck, expected goals represent the quality of
chances created which helps assess how a team played. Through tactical analysis, coaches and
analysts of a team can determine how effective different tactics and strategies were based on
their chance creation, in other words, the expected goals in favor and against in a game. By
understanding where and when scoring opportunities are being created and the likelihood of the
conversion rate into a goal of each chance.
It also helps determine individual performance of players for teams. Expected goals can
be used to determine if certain players are over performing or underperforming. For example a
striker might have 10 goals but have five expected goals, this means they are overperforming
their xG by five goals which is very impressive. Compared to a striker on the same team who has
11 goals but 16 xG, this means this other striker is underperforming. Expected goals would help
a team see on a quantifiable level who is playing better and which striker would benefit the team
through a teams attacking play.
3
Expected goal metrics can also be used in the world of soccer for scouting and
recruitment. Using the same steps as mentioned previously, teams can identify talented players,
by observing if players are over performing their expected goals; by looking at the ability to
create and convert scoring chances. Especially if a team plays a similar system, expected goals
provide a more comprehensive picture of the potential value a player could bring to the team if
they were signed. Expected goals are now being used as one of the leading tools to identify
talented players.
But expected goals stretch further than just to the understanding of coaches, staff and
bookmakers. It also increases engagement and understanding of a game, team, season, or player.
As xG provides deeper understanding than just the final scoreline, fans can use this as another
layer of discussion and analysis of the game. For many this increases the overall understanding
of a game and for some more passionate fans, it can increase the enjoyment of the game. The
increase of expected goals in the world of soccer has increased understanding of a game, so
much that teams and coaches have depended massively on it to determine the outcome of a game
or season. In this thesis I will be analyzing how important are expected goals, using the 2014-15
Premier League season, I will be able to determine how effective xG is on concluding teams’
finishing positions.
Literature Review
Because of the relevance and uprise of expected goals I have had the luck to read many
papers related to this topic which have been written very recently. Due to the nuance of this topic
many researchers are able to test different models since there is not a perfect model just yet and it
is constantly being updated to be more accurate. For my literature review I will be explaining
4
papers in the chronological order that I have read them, this makes the most sense because the
first couple papers explain the basic studies behind expected goals, while the ones I have read a
little later develop more intricate models of expected goals.
The first paper I want to take a look at is one written by Alex Rathke from the Journal of
Human Sport and Exercise at the University of Alicante. This was one of the first papers to study
expected goals as expected goals has been around for around a decade. The goal of this paper
was to analyze what factors were associated with determining expected goals. Factors analyzed
were: distance of shot and angle of shot. What is important to consider is that expected goals
have evolved a lot since then, which means more factors are under consideration but for this
paper those were the two main factors. They found that lower and higher league teams over or
under achieved their expected goals compared to mid table teams. Angleand shot location, as
imagined, had a major effect on calculating xG.
Rathke believed that showing players their xG could help them in attacking phases
showing them from where and how to strike certain shots or for defenders how they should be
positioned to avoid these shots. Which I thought was pretty interesting how intricate and detailed
coaching can be due to expected goals. This paper is interesting because it breaks down the
accuracy of xG per team and player across different european professional leagues and it is also
one of the first papers to research xG in such a detailed manner(Rathke 2017).
The next paper I looked at was one that I had not read much on. It had to do with
expected goals in women's soccer. In theory, the results should be similar, but I found it
interesting that after reading through so much literature this was the first one I found on women.
The women’s football game has made huge advances in recent years. When the first official
FIFA Women’s World Cup took place in 1999, it featured matches that lasted only 80 minutes
5
and the final was not even shown on TV. Just 20 years later, more than 1 billion viewers watched
the 2019 World Cup final between the Netherlands and the USA. Concurrently, financial invest-
ment in the women’s club realm has increased the number of players who are able to play
professionally.
This research takes a first in-depth analytical look using machine learning at the technical
data that is now being collected from professional women’s matches. Specifically, we focus on
shots as the fundamental objective of football is to score more goals than your opponent, and as
Johan Cruijff fa- mously said: ”you can’t score if you don’t shoot.” A natural way to analyze
shots and shot behavior is through the lens of the well-known expected goals (xG) metric, which
gives the probability that a shot will yield a goal. Most of this paper includes graphs and tables
which show the different xG across leagues, during my next presentation I will make sure to
show it. Some conclusions were found this paper performed an extensive analysis of women’s
football shots.
They identified interesting observations such as the fact that women tend to shoot from
different locations than men, have a higher shot conversion rate and their goals are differently
distributed across the season. They trained six different xG models on different data sets with
different machine learning algorithms and found that, in general, models from one gender are
applicable to shots from the other. However, when inspecting the models and shots, some
interesting differences arose in terms of what features are important and how the models value
certain types of shots. This paper is unique because I have found almost nothing that has been
written on expected goals in the women's game(Bransen 2021).
One paper written by the journal of sports research fell perfectly in line as a continuation
of the previous paper and aiming to answer some of my previous thesis questions.
6
This paper, Partida 2021, is one of the most important papers in continuing my research.
The goal of the paper was examining the predictive capabilities of expected goals across the top
leagues in europe. They did this through collecting the expected goals of 310 games from the
German league, Spanish league, and Italian league. They were able to collect data from games
that are publicly available, this is very helpful for me as my research will benefit from public
data. They collected their expected goals from understat.com and from there created statistical
models. This was a pleasant coincidence since the data I am looking at is from the same website,
which means this paper can be very useful to my work. Something a little different that they did
was take data from a betting website, as they were also comparing xG to betting probability
outcome of games.
They then created three different models based on the expected goals and then compared
the models. Two well-established probability models used binomial deviance, squared error and
probability in betting. One of the models took xG and subtracted xGA to determine how
successful teams are. This model is the most unbiased, as it is only explaining the difference goal
scoring opportunities both the opposing and home team faced; from that concluding which team
had better goal scoring opportunities, hence should be the winner. In my research I will be using
the same equation. The best model suggests that expected goals are most accurate for teams with
home team advantage. Two of the other models were profitable under very specific betting
conditions. One limitation they found is that their models and expected goals did not include
factors such as a team's defensive prowess(Partida, 2021).
They concluded that expected goals can provide meaningful insight, but that was about it.
For the betting side of their research they concluded that predictive capabilities of our models
were strong enough to be profitable under certain conditions. A further analysis of profitability
7
in football betting could involve the investigation of the accuracy necessary to be profitable
when betting all of the games as well as the use of parlays (betting multiple games at once) as
betting strategy. To have the most accurate numbers possible researchers looked at the
confidence interval, bias, variance, and irreducible error.
This research focused on four different aspects of the game(and expected goals) which
meant the paper was a bit confusing, by that I mean covering a lot of areas which can be
confusing to the reader. For each one of their chapters a different algorithm was created which
meant different outcomes were being spit out. But although they had different outcomes they all
agreed that one limitation was the quality of each individual goal scoring opportunity. Meaning
the variance between goal scoring opportunities was really high–they suggested that future work
should try and find a way to reduce variance. Since expected goals are taking the ‘average’
shooter we cannot determine the quality of the shooter. Fortunately, there is work which has been
done on this exact limitation which I will talk about later in this paper(Partida 2021).
Subsequently, the paper I read looked at the difference in expected goals in Germany’s 4
different levels of football: Bundesliga, Regionalliga, U19, and U17. They used the same model
that Brentford uses, a Premier League team which was one of the first to use xG in their analysis
of games. They collected data from 2 different seasons from all four different tiers of football.
For their model they used the same regression model which is widely used when looking at
chance creation from scoring.
The researchers found that players in higher levels tend to take riskier shots and aim for
top corners, even though the chance of missing is higher. They also found that distance from goal
contributions varied significantly between the four leagues. Interestingly, they found that as
goalkeepers get older they have a harder time saving shots further away from their body, so
8
bottom corner shots. This is interesting because goalkeeper quality has not been mentioned
before in xG and going forward it might be a topic that is spoken about more. What I also found
interesting was that the xG was mostly the same across leagues, this makes sense as the level
across all 4 tiers is even relative to the level they play at. But the researchers also found that in
the highest tier, the Bundesliga, players took less shots from outside the box. Which may allude
to a hypothesis I was mentioning before because of all this data players might be told instead of
going for the goal of the year and shooting it from 40 yards out, to shoot from closer as the
probability of a goal is higher. But at the same time it takes away from the essence of the game.
The next couple of papers will look at the specifics on how xG is actually calculated;
compared to the past two papers, where they are looking and creating models to see if expected
goals or models surrounding are even accurate.
For the next paper, I am going to introduce a paper by the analytics department at PSV
Eindhoven, a dutch team in the first division, who regularly is considered one of the best clubs in
the Netherlands. These next couple papers will show what more specialized xG looks like. The
authors begin by explaining how the impact of analytics in soccer has been a direct influence of
baseball and the whole moneyball era by the Oakland A’s. A team which looked at players in a
different lens and tried to see when and where a player is most efficient, instead of observing him
as being good or bad.
This paper creates algorithms to see if the correlation between high value chances and
expected goals is accurate. This study suggests that skill of the player should be considered for
expected goals because the variance of shots is larger when the skill of the player is lower and
vice versa(Eggels 2016).
9
The following paper studies something along the same lines but does so in the
professional soccer league in the U.S. the MLS.Written by the journal of sports analytics the
authors look at expected goals in the MLS, the top american professional soccer league. They
look at the accuracy of expected goals, and nothing jumps out as uncommon or weird. But what
makes this study interesting is that they add to the current xG model, by adding rotation of the
ball. Which may seem very minimalistic but does have an effect on a shot and the quality of a
shot. In this study they aimed to investigate two issues, the first: building a model which
estimates the probability of a shot leading to a goal(this is where they add ball rotation). The
second one is delving into possible relationships between a team’s efficiency and their actual
expected goals.
In the study they quantified the probability of a chance being a goal(with their own xG
model). They would take the probability of a shot being a goal and then compare the fraction of
shots from that group that ended up being goals. They then utilize this model for teams and
specific players which would allow them to make conclusions about the offense and defense of
MLS teams(Fairchild 2018).
Interestingly enough this past paper was not able to create a model or function which
showed what tailoring xG should look like, but for this next paper the authors do exactly that.
The next academic paper I found was one written by scholars at the University of Cardiff
within their computer science department. They took the current model of xG and decided to
adjust it for players positions and skill; a limitation which I have mentioned before. Their aim
was to evaluate models based on the individual players which would be useful in showing the
quality of players and their finishing. One thing I found useful about this paper is it talked a bit
about how expected goals have evolved, even introducing what coaches thought of xG, which for
10
context can be very useful. They also acquired their data from a publicly available data set,
statsbomb. This paper went really in-depth as it took shots taken by players and accounted for
their dominant foot, keeper distance from center of goal, number of opponents within 5 yard
radius, and much more.
This paper also took the xG of multiple analytics companies so they would be able to
gather as much data as possible to determine which players are the best finishers. When looking
at players who surpassed their xG based on their model one player was an obvious outlier, Lionel
Messi, recognized as one of the greats in the sport they found that from an xG of 342 he scored
415 surpassing his xG by 22% the next closest player in their model was Luis Suarez, another
great finisher who surpassed his xG by 15%. Something I found interesting that they did is they
made a Messi adjusted xG model, so basically if every player was Messi what would their xG be.
They created a whole table with players, but one player that stood out to me was Neymar whose
goals were 52 with a proposed xG of 58, so he was underperforming his xG. But once they
adjusted for Messi’s model, if Neymar was Messi, he would have an xG of 66. I just found it so
interesting how detailed they were to the point they could adjust for other players, which was
their main contribution to the existing literature.
Hewitt also mentioned that in the long run expected goals usually reflect accurately how
a player should be playing. Meaning if they overperformed their expected goals one year it is
unlikely to continue being that way, with the exception of Lionel Messi. It is interesting but
makes sense to read that expected goals are more accurate when there are more observations.
Because a player might be getting fortunate one season or seems to be operating well in a
system, but sooner or later opposing teams will modify their tactics to neutralize such players.
11
This paper is really important because it shows what xG should look like for specific
players. Although they are only doing so for one player, Messi, it is interesting to see what other
players xG would look like if they were Messi. With that being said, hopefully in the future they
can make more models for specific players. Because as of now we have an average xG which
just takes the ‘average’ player(Hewitt 2023).
Another paper I read which had similar intentions was one written at the University of
Barcelona which did this study in collaboration with the soccer team F.C. Barcelona. The aim of
this project is to create a basic xG model and then build a second one with more information
about the shooter, so more tailored to the player although the methodologies were a bit different.
They acquired their data through OPTA which is not publicly available.
The first model was used without using specific information about a player, so the shots
taken were just as if an average player took the shot. In the paper they explain their results
through tables and formulas. But the second part of the model involved qualitative information to
build xG data on the players they decided to pick out. To do this they decided to use the ratings
of a video game, FIFA, to give the players a score. Although FIFA is a complex video game
which has stats for every player, I would assume that the researchers would use a different
method to rate the players out of 100.
Similar to the past study they found that amongst certain players when they made a xG
model for them it was more accurate player to player than the average one that i used for
everyone. Some advice that they gave for future work was at the beginning of the season
tracking everyone’s shot output so after the course of a couple seasons there is a huge amount of
data compiled which can then be applied to give every player their own rating and expected goal.
12
The table that they showed where each player has their own model showed to be far more
accurate than the generic one that is used in regular games(Madrero 2020).
In conclusion, having read this related literature it has helped me figure out what the gaps
are. For example, there is a lot of literature on shots and chance creation with many different
models. But little to no literature on the expected goal accuracy of the premier league and for
home and away teams. Many of the authors of these pieces concluded that what would make
expected goals more accurate is having specialized models for each player. The last two papers I
wrote about spoke about this pretty accurately, for future work I believe more work needs to be
done in this area to make expected goals as accurate as possible. With that being said, I am glad
there are some gaps in the literature which I am going to write about, with the goal of finding
conclusions and extending this work.
ANALYTICAL FRAMEWORK
Having data for my thesis is obviously a make or break. Luckily all of my data is publicly
available at statsbomb.com, opta.com, and other websites with downloadable csv data. The data I
have was all collected from a dataset on kaggle, where they hold all of the expected goal data
that I need. They had every expected goal and expected goal against result from the 2014-2015
Premier League season. Expected goal data is pretty publicly available, and is similar since the
methodology used to collect it is almost identical. The methodology was created and shared by
statsbomb, who is an analytics company who provides data for professional teams over many
leagues.
What my data is all of the data needed for the 2014-2015 Premier League season. What
this entails reading it left to right, as shown in Figure 1: is xG, xGA; these are the two most
important variables for my project as these are my independent variables for my project. The
13
next three columns are goals scored and scored against, which is not relevant to my thesis
directly but could potentially be important for describing the error, therefore I have kept it. Later
is the result of the game so either a win, loss or tie, regardless of the xG and xGA, this is the
actual result. This is important to have because when comparing it to my expected result I will
see if they are the same or different. The following column is the points a team gained from that
game, meaning: if they won they got three points, if they tied they got one, and if they lost they
would get none.
The next couple columns describe the amount of passes that a team had as well as shots
that they had and have given up. Similar to the previous column of actual goals scored or scored
against, this is not directly relevant to my paper but in the long run it could maybe describe my
standard error if there are a lot of variance in my results. The last column is the team name, for
figure 1 the only team name for the first 38 columns is Arsenal because, in excel I organized it
by team name, and those first 38 columns are all of Arsenal’s games that season, meaning all of
the previous columns to the left are the statistics for Arsenal’s season. Unfortunately, my data is a
lot longer than Figure 1 so I am not able to show how extensive it is, as there are 722 more rows
of data. But Figure 1 is the example of what one team's data for the season looks like. Meaning
there are 19 more teams that are not shown, but their results(or columns) would be structured the
same way.
When teams play each other there can be three outcomes: a win, a tie, or a loss. Wins
account for 3 points, ties 1 point, and losses 0 points. Over the course of a season all these points
are added up and depending on how many points a team has they can become champions of the
league, qualify for European competitions, or get relegated. To be the champion you must be in
first place with the most points out of the 20 teams; if two or more teams have the same amount
14
of points the team with the highest goal differential, meaning goals scored minus goals conceded,
will be crowned champion of the Premier League. The top five teams qualify for European
competition for the following season, with the same concepts: must be the top highest five teams
in points and if tied with other teams, goal differential would apply. For relegation it is the
bottom three teams with the least amount of points that would get relegated to the second
division in England, the Championship; again if tied on points with another team, the teams with
the lowest goal differential would finish in a lower position.
In my paper I am researching if expected goals have an accurate prediction on the
finishing position of teams in the Premier League. Therefore, my ‘Y’ variable will be the
expected finishing position of a team; this makes the most sense as finishing positions in the
Premier League are all based on the amount of goals you score and the amount of goals scored
against you. For this project I am proving whether or not expected goals are an accurate tool to
measure goals. My two ‘X’ variables will be expected goals for and expected goals against(Also
known as xG and xGA). Both of these variables are independent and are not affected by finishing
position, rather they affect the finishing position. Partida 2021, had the same equation in their
research, as they had three different models one looked at what would be the result of a game if
they subtracted xGA from xG to find the expected result of a game. The average team model had
the same relationship of independent variables that I have included in my paper, where you are
taking the xG minus the xGA.
The difference between Partida 2021 work and mine is they were not looking at games
over the course of a whole season, I will be looking at the whole 2014/15 season and compiling
conclusions from that. Furthermore, the research they did was in Spain, Germany, and Italy,
while I will be looking at England. Debatably the most competitive league, in the past five years
15
from Partida 2021 work there were only 2 different league champions in Italy, 2 in Spain, and 1
in Germany. Compared to England who has had 5 different title winners in the past five years
from 2021. The Premier League is one of the most competitive leagues, which makes it
entertaining, but also high in variance, therefore I am expecting my results to be volatile.
However I cannot combine all the xG and xGA a team has had over the course of a
season to predict their expected finishing position. To make this more accurate I would have to
take every xG and subtract the xGA for every game, to get an ‘expected result’ for each game.
Depending on whether or not the expected result gives me a win, loss or tie I would combine the
results from a 38 long game season to give a team an expected finishing position. In a Premier
League teams play 38 games in their season as there are 20 teams; 19 games at home and 19
games away from home.
Initially I believed I would need a regression, but since xG and xGA of a game are not
consistent independent variables, all I need is to subtract xG from xGA for every game every
team played that season to acquire an expected result. To do this, I needed help from STATA and
to clean up my data on excel. Many hours were spent cleaning my data, as initially received the
data, and the games were organized from most recent game to least recent. What I believed made
most sense was organizing every team’s result alphabetically from A to Z. Meaning, the first 38
rows would be Arsenal’s game results and the next 38 rows would be Aston Villa’s results as
they are the next in line alphabetically and so on.
I imported all of my data, as shown in Figure 1. This had all of the xG and xGA of every
team’s game in one place. Next, what I had to do was create a new variable which would show
the net xG of each game, so xG-xGA. As shown in Figure 2 I created the variable ‘netxg’. This
took the net expected goals of every game and gave it whatever value xG-xGA equaled. But the
16
net expected goals of a game was not enough to draw a conclusion so I needed the net expected
goals of a game to give me an expected result. So for example in the first game Arsenal played
the game ended 2 xG to 0 xGA for the other team, meaning Arsenal would acquire a win from
this game. Because 2xG-0xGA would equal a win for Arsenal. As mentioned earlier a win is 3
points, so Arsenal would receive 3 expected points from this game. With this logic I am
following the same steps that regular leagues follow with their actual league play, according to
actual results not expected results.
Therefore I needed to create another variable, which would stand for the expected result.
This would be my dependent variable as it would be dependent on the xG and xGA of a game.
The new variable was named ‘exresult’, as shown in Figure 2. I coded it for every net xG 1 or
greater than 1 it would give ‘exresult’ a value of 3, in other words 3 points. For every net xG
equal to 0, it would give a value of 1, if there is a tie the net xG would be zero regardless of the
xG and xGA. Lastly, for every net xG value less than 0, so -1, -2, -3, and so on it would give
‘exresult’ a value of 0, this is because if a team had a negative net xG this would mean that they
would have lost the game according to expected goals, therefore they would get no points. These
expected results will be applied to every game for every team.
Adding up all the expected results for every team STATA, will give me a point total for
every team. Given all the points for every team I will be able to rank the teams based on their
expected points over the course of a 38 game season.
Initially I wanted to include a subsection to my thesis trying to see if net expected goals
are more accurate for home or away teams. I believed this would be a great idea as I thought
there would be a huge disparity between home and away games. Home games in soccer
massively favor the home team therefore, I would assume home teams have a more accurate
17
expected goals since they are creating better quality chances. But on the other hand, away teams
would have a lower expected goals accuracy as they might be creating lower scoring chances and
getting more ‘luck’. Hopefully in future studies this topic can be researched. One study, Partida
2021 found that expected goals and expected goals against favored the home team, their
reasoning was similar–also referencing that teams at home press higher so they collect the ball
higher so their shots come from a closer distance than away teams. If I had continued my
research and done home vs away games I would have overlapped work that has already been
done
When calculating my xG-xGA formula in the formula there is also an element of standard
error which will likely describe why teams are in different positions. The standard error is going
to explain some of the variance or irregularities, if there are any, in my formula. The error will
explain why some teams are in very different spots compared to their actual finishing position.
This what makes this research so interesting is that some teams might be in very different
rankings and for very different reasons since each team is so unique. These will be discussed a
little later in my discussion of results and potential shortcomings of this study.
Discussion of Results
For my results I have found some interesting outcomes from my formula. As mentioned
earlier, my formula of xG-xGA was run on every team's game in the Premier League that season.
For every game I got an expected result of either 0 points, 1 points, or 3 points. I then created 20
new columns with every team's expected points, as shown in Figure 3. Every teams expected
points were organized by alphabetical order, from left to right. Therefore every column has the
expected points each team collected in a game, from the 2014-15 season. As shown in the first
18
row and first column of “arsenal_pts”, Arsenal played against Crystal Palace, where the xG for
Arsenal was 2, while the xG for Crystal Palace was 1. In other words the xGA for Arsenal was 1.
Therefore 2-1=1, meaning according to expected goals they won the game, so Arsenal acquired 3
expected points. This process was done for every team's game in this season. After this was
done, the expected points for a team in a season were totaled.
For instance Arsenal's expected points for a season totaled to 68, Aston Villa’s totaled to
26 and so on. Once all the teams’ expected points were totaled, the teams were ranked, from
most amount of points to least amount of points, as shown in Table 1. Almost all of my work
which I have done this semester was to acquire what Table 1 is. But the best part of it all was
putting it all together. So with my expected finishing position table I put it side by side and
compared it to the actual finishing position table of the 2014-2015 Premier League season, this
is shown on Table 2.
Table 2 is broken up into 5 different columns. The first column has the actual finishing
position of each team by most points to least, in the 2014-15 season. The second column has the
amount of points each team got that season. The third column is my work done through my
formula, of the expected finishing position of each team. Same as the first column every team is
ranked from highest to lowest based on the amount of points they totaled in the 2014-15 season.
The second to last column are the expected points each team had that season, this led them to
finish higher or lower in the table. The last column is the difference between the actual finishing
position and the expected finishing position in the season, this helps the reader show how many
spots difference every team finished in.
Looking at the results the first part of the table I would like to analyze is the points vs the
19
expected points. At first glance what we can observe is the expected points are much lower than
the actual points in a season. For the winner of the expected finishing table, Manchester City,
they had 69 points, compared to the actual winner of the Premier League, Chelsea who had 87
points to win the league that year. There are many reasons why this happened. The first reason
was in my formula I ran there were a lot more draws than in the regular season. What this means
is points were split between teams instead of one team taking all 3 points. For example Chelsea,
who had 87 points in their actual season, dropped to 66 points, a 21 point drop. In their actual
season they had a total of 9 draws, compared to 16 draws in their expected season. This means
they were going from potentially having 3 points in 7 games to having 1, this dropped their
points and increased the points of someone they tied with in the expected table. A team with a
similar situation was Manchester United, who in their season had a massive 70 points. But in
their expected table their points almost halved to 37 points, according to the formula. Manchester
United had tied 10 times in their actual season, while in the expected season they tied 17 games,
7 more were expected. More will be explained on Manchester United and their finishing
position later on this paper, but this increase in draws makes a lot of sense. Manchester United
are a team that tend to score very late on in the game as well as goals that usually do not go
in–they are a team that has a positive correlation to random events.
The rise in the amount of draws can also be seen in the difference in points between the
first place team and the last place team, in both tables. The difference in points between Chelsea
and QPR in the actual table is much higher than in the expected table between Manchester City
and Newcastle. In other words, this means that the expected table shows that teams are closer in
competition than what is reflected in the actual table. This makes a lot of sense as in real soccer
games the margin of error can cost a team games or a season. Expected goals do not have that
20
attribute where they account for defensive errors or mistakes. For instance, if a player shoots the
ball from 40 yards out and it slips through the goalkeeper's hands and goes in the net, although
the goalkeeper should be making that save. On the scoreboard it will say one to zero, but the
expected goal scoreboard might count that shot as 0.001 expected goals. The game could end up
being one to one expected goals but the actual score would be one to zero. The scoreline might
not reflect fairly how the game played out, as both teams had equal scoring opportunities, but the
game would, in theory, be decided by the mistake the goalkeeper made, something which
expected goals does not account for. This would mean the home team would take the 3 points in
the actual game despite equal scoring opportunities. Because of the lack of accountability for
errors in a game, expected goals predict more teams to be more equal in skill and goal scoring
opportunities.
What this would mean is the variance in points can be reasoned because of the standard
error; but each team's standard error is very unique. If we take Newcastle for example they
finished 15th in the real table with 39 points. In the expected table they finished dead last with 17
points, a drop of 22 points, into relegation. But how on earth could this happen to a team?
Newcastle that season had a prolific striker, Papiss Demba Cisse, who scored 12 goals and
assisted 1(Premier League) in only 22 games. His expected goals for that season were 8, this
means he outperformed his goal scoring tally by a whole 4 goals. Although 4 goals might sound
like a small figure it is not at all, especially for a striker who plays for a team that sits in the
bottom third of the table. Furthermore for teams that sit so low in the table even just one goal for
them can be the margin of difference to win a game, which prevents them from getting relegated
that season. According to the expected goals Newcastle were predicted to have conceded 11
more goals than they did. For this reason they dropped to last in the expected table, instead of
21
finishing 15th in the actual table.
The next part of the table I would like to take a look at are the actual standings of the
2014-15 season versus the expected standings from my model. If we compare one by one there
were only 4 out of 20 teams which were in the same position both in actual table and the
expected table, an accuracy of 20%. If we were just to look at this percentage, any given person
would believe expected goals do not hold much accuracy. But it is not as plain or simple as
looking at the standings with no context. Also there is an element if one team is not in the same
place it automatically means another team is also not placed in the same space.
Looking at it from a different perspective, 13 out of 20 teams were within 3 spots of
actual finishing position. Meaning if teams went by the expected table, most teams would not
have a significantly different season. The only team which would have a significantly different
outcome are Chelsea and Manchester City; as Manchester City would be crowned champion in
the expected table, compared to Chelsea dropping out of currently being Champions to finishing
third. Even then Manchester City moved up one position and Chelsea finished third, with 3
points less than the champion of the expected table. 65% of teams being within three spots of
their actual finishing position shows promising results. Especially when looking at the teams
which had the biggest change in position, one could argue that the expected table has accurately
described how teams should have finished. Observing the teams that had the largest variance in
position change it is no surprise that they are the bottom half of the table, in other words, the
teams that are not as good.
Let's take Swansea as an example, in the actual table they finished 8th, while in the
expected table they were predicted to finish 10 spots lower at 18th, with 33 points less than in the
22
actual table. Out of all of the teams this was the team with the biggest change in position. This
change in position would mean they would be relegated and playing in the second division of
England. In their expected table they were predicted to have scored five goals less and have
conceded seven more throughout the course of the season. Although over the course of 38 games
five goals for and seven against do not sound like many goals, many soccer games are won
through the slightest of margins. Out of the 16 games Swansea won this season 11 of them were
won by a one goal difference: either 1-0, 2-1, 3-2, and so on. Swansea took winning by the
slightest of margins literally. Out of the 11 games they won by a one goal margin, 9 of them,
according to the expected game model, Swansea either lost or tied. This is not even considering
the games they tied or won by more than one goal. This shows that Swansea were very fortunate
to win many of these games and to finish in the position that they did in the expected table. The
expected table placed them a lot lower because based on their scoring chances and the scoring
chances of opposing teams they should have done much worse, but in the 2014-15 one could say
they were on the receiving end of luck.
This luck can be explained through statistical methodologies rather than anecdotes. What
I mean by this is that this season Swansea had two important factors which allowed them to
finish higher than expected. The first being the goalkeeper; Swansea had Lukas Fabianski
playing in net, who was voted the second best keeper in the Premier League that season(Sky).
This speaks wonders to his performances that season, usually the winning team that year, in this
case Chelsea, their goalkeeper is named best goalkeeper of the season. The second best would
usually go to a team which finished within the top 5 positions. Therefore, with Swansea finishing
8th that year it comes to shoot the amount of goals he prevented from opposing teams. This is
one of the main reasons Swansea’s xGA is so much higher than the actual goals they conceded,
23
because they had a goalkeeper who was over performing.
Another reason why Swansea finished so high that season was due to overperformance of
their goalscorers: Wilfred Bony, Ki Sung, and Gylfi Sigurdsson. Wilfred Bony scored 9 goals
that year over performing his expected goals by 2(fbref). Ki Sung that year scored 8 goals and
over performed his expected goal tally by 4, massively impressive as it is double the amount.
Lastly, Gylfi Sigurdsson scored 7 goals that year over performing his expected goal tally by 2
goals. Most teams do not have their top goal scorer out performing their xG, while Swansea had
their top three goal scorers over performing. Although these seem like not very high numbers, as
mentioned earlier, soccer is a low scoring game where the slightest of margins might go a long
way. Swansea had a perfect sequence of events where their goalkeepers and strikers were playing
in the best form of their career’s all at the same time.
Teams such as Swansea, who are a positive outcome to random events in one season, in
the long run expected goals and expected goals against will be accurately reflected in their actual
position as mentioned by Hewitt, 2023. This claim is backed up significantly as two seasons after
Swansea fell down to 15th, and the following season after that they got relegated falling into
18th place(EPL). Although my model was very incorrect in predicting Swansea’s finishing
position vs their actual position. This drop in positions in the following seasons proves that for
the 2014-15 season Swansea were an outlier, as that season they had a near perfect alignment of
players overperforming. Although this is not directly part of my research it would be interesting
to see, in future research, teams which outperform their expected points in one season and seeing
if their actual finishing position worsens over years and why.
One team which did not have a positive outcome of random events were Queens Park
24
Rangers, known as QPR. QPR finished in last place in their actual table, which meant they were
relegated into the second division. But according to the expected table model, QPR should have
finished seven places up in 13th place, based on the goal scoring chances created by them and
their opponents. In the course of the season QPR were expected to have score 4 goals more and
conceded 8 goals less than they did in their actual season. Along with other teams expectedly
scoring and conceding less chances they would have been expected to finish in 13th place and
not relegated. QPR were massively unfortunate in the 2014-15 season, out of the 24 games they
lost, in 8 of them they either had a better or the same xG as the opposing team. This is what
made them finish so much higher in the expected table; in 8 out of these 24 games that they lost,
they were the team which created better goal scoring chances, while limiting opposing teams.
QPR were the better team in these games but ended up being unfortunate.
Similar to Swansea, QPR have their own way of describing their own fortune. QPR that
season had a prolific striker, Charlie Austin, who tallied a total of 17 goals that season(EPL).
These are impressive scoring statistics, especially for a team that finished in last. But according
to the expected goals Charlie Austin was expected to score 17 goals, despite his eye for goal that
season(transfermarket). In fact, out of QPR’s top five goal scorers that season none of them
outperformed their expected goals by more than half a goal. This meant that QPR’s strikers were
expected to score many more goals than they actually did–they had the opposite of Swansea’s
stickers, QPR’s goal scorers had a a perfect alignment of a negative outcome to random events.
Furthermore QPR conceded most of their goals that year, 17 to be exact, in the final 15 minutes
of games(transfermarkt). This can reflect multiple aspects of a team. One could be poor fitness or
fatigue of players. When players are tired they are more susceptible to defensive mistakes which
can lead to more goals–fatigue and mistakes are not taken into account in xGA. Another
25
potential reason why QPR concede late goals is poor coaching. QPR had a coach change in the
middle of the season; usually when teams are performing poorly or under performing their
standards there is a change in the head coaching position.
The combination of having poor finishers, fatigued players, and bad coaching meant that
QPR’s expected finishing position was not reflected in their actual finishing position. QPR and
Swansea are two examples of teams which are outliers in the expected table model, but most of
the teams are not outliers and their actual finishing position is similar to their expected finishing
position.
But not all teams were predicted incorrectly by the expected table, many of them, as
mentioned earlier, fell into a similar or the same position as their actual finishing position. Stoke
City is a great example–their actual finishing position was 9th while their expected finishing
position was 8th. The only reason they finished higher was because Manchester United finished
below them in the expected table. The expected model predicted Stoke City to have one more
goal for and against. One of the closest estimates in this model for xG and xGA. Furthermore,
Stoke City’s top three goal scorers were all within a goal of their xG that season. This means
they neither over performed or under performed, they performed to the standard that xG
expected them to perform at. Moreover, out of the 15 games Stoke City won 12 of them they
were expected to win–an 80% accuracy rate on the games Stoke City should have won. The
accuracy of Stoke City’s games did not end there; as out of the 15 games they lost all 15 they lost
according to xG and xGA. In other words, the expected model accurately calculated from the
goal scoring opportunities both teams had, that Stoke City should have lost all of the games they
did.
26
Stoke City are not the only team who were accurately expected to finish in a similar
position as their actual finishing position: Crystal Palace, Arsenal, Everton, Sunderland, and nine
other teams were in the same boat as Stoke City. Where most of the games they won were
because they generated better goal scoring opportunities than the opposing team. Same as most
of the games these teams lost were due to the fact they created less goal scoring opportunities
than the opposing team, which was accurately represented by xG and xGA.
LIMITATIONS AND FUTURE WORK
For many of the teams in the Premier League the 2014-15 season xG and xGA were
accurately represented in their actual results. But in terms of predicting the finishing position of a
team, this model had a 65% accuracy rate of teams finishing within 2 positions of their actual
finishing position. One can draw conclusions on this number subjectively, because out of the rest
of the statistics in soccer used almost as commonly as xG and xGA; such as possession, shots on
target, fouls, or pass completion percentage, none of them would accurately predict the finishing
position of the whole league better than xG and xGA. But another argument can be made that
predicting 65% of the finishing positions is not a high number and that the other 7 teams were
very inaccurately predicted. What do those 7 teams say about expected goals?
For starters there is a lot more that goes into a soccer game or season than just goal
scoring opportunities by the home and away team. xG solely focuses on shots without
accounting for build-up play, momentum, possession, or defensive contributions and all of these
aspects can largely affect the outcome of a match. For example if a team has a really good build
up play and the player decides to pass the ball instead of shooting it, even though he might be in
a great goal scoring position the xG value for that scenario will be 0.
27
The formula is a bit rudimentary, but with this comes the fact that it leaves no room for
bias and less unwanted variation. Furthermore, there is little more at my disposal that I could
have added to strengthen this formula. Using Partida, 2021 formula of xG-xGA to predict a
winner from each game, there is information about the game being left out, some of which are
shortcomings of the formula but others are shortcomings of expected goals in general. For
instance as mentioned by Pardo 2020, expected goals favor the home team, meaning they are a
lot more accurate for home teams than away teams. The model I had made it as if every game
was being played on a neutral site. Some future research should include a model which tailors
goals if a team is playing on their home field or away.
Soccer is low scoring and one of the most volatile sports, as one mistake can cause a lot
of change in a game. Expected goals also do not account for much variability. Meaning a shot
with low xG can still result in a goal because of goalkeeper error, deflection or exceptional skill
of a shooter. This goes hand in hand with xG not accounting for individual errors.
The last limitation that xG has is that every shot is valued the same, regardless of player
skill. This is mostly due to the relatively small amount of data xG analysts have on players shot
ability. A striker might shoot once or twice a game and sometimes he may not even shoot, over
the course of a season that is not as many shots as a data collector would want to draw
conclusions on a player’s shooting ability. Furthermore other players on the field such as
midfielders or defenders shoot even less, so their observations are even more limited. For those
reasons xG metrics uses an “average” shooter skill to display the expected goal of every shot.
Some future research I would wish to see or do, could include having a specific xG value
to every player. This maybe would not be possible for every player, as data collection would be
28
limited for players who do not shoot a lot, but for strikers and players who shoot consistently I
do not see why this would be an issue to calculate. Assigning xG values to every players shots
would increase the accuracy of xG in games, which would then predict finishing positions way
more accurately.
Some other future work I wish to do one day is observing teams who have over or under
performed in a season and seeing how they do over the course of the next five seasons or so.
Expected goals tend to be more accurate over time as xG analysts are able to collect more data
and make more accurate measurements in a game. Moreover, variation of over and under
performers tend to be outliers in the data, so in the long run the teams who should be winning
games, in theory, will win the games deserved.
CONCLUSION
In the world of soccer expected goals have become an essential analytical tool for
analysts, pundits, and coaches. It has allowed us to view the game more critically than we did
before. Work had been done previously on various fields related to my topic. But my aim was to
look at the first year xG and xGA was used and test its accuracy in finishing position in one of
the most uncertain and competitive leagues in the world, the 2014-15 Premier League season.
The accuracy of xG predicting a finishing position had never been done previously, which made
me even more excited to conduct this research. With previous literature I was able to use a
formula which was as unbiased and straightforward as possible. I wanted to analyze the accuracy
of expected goals with no conflicting variables, in order to get raw evidence on its accuracy. The
results shown were fairly promising, 13 out of the 20 teams finished within 2 positions of their
actual finishing position. Teams such as Stoke City, Arsenal, Everton and more were accurately
29
predicted. For the other 7 teams there was a lot of variability which showed the limitations of xG
or that xG was not accounting for. Teams such as Swansea and QPR who had either massive over
performers or under performers in their teams which xG could not accurately calculate.
This semester I have had the privilege to work on a topic that I am very passionate about
while including analytics in the sport of soccer. With the tremendous help of my teacher I have
been guided on the right path in order to perfect my work. But it has left me willing to learn and
research more. As I mentioned in some of my future work, I would like to see what xG looks like
if it was adjusted for the shooter's skill or if it matters if a team is home or away. As the metrics
of expected goals grow I feel privileged to have added research to the analytics world of soccer
and hope that future research continues.
30
Image 1
31
Figure 1
32
Figure 2
33
Figure 3
34
Table 1
35
Table 2
36
BIBLIOGRAPHY
● Bransen, L., & Davis, J. (2021, May). Women’s football analyzed: Interpretable expected
goals models for women. Proceedings of the AI for Sports Analytics (AISA) Workshop at
IJCAI (Vol. 2021).
● Eggels, H., van Elk, R., & Pechenizkiy, M. (2016). Expected goals in soccer: Explaining
match results using predictive analytics. In The machine learning and data mining for
sports analytics workshop (Vol. 16).
● Fairchild, Alexander, Pelechrinis, Konstantinos, and Kokkodis, Marios. ‘Spatial Analysis
of Shots in MLS: A Model for Expected Goals and Fractal Dimensionality’. 1 Jan. 2018 :
165 – 174.
● Goddard, J. (2005). Regression models for forecasting goals and match results in
association football. International Journal of forecasting, 21(2), 331-340.
● Hewitt, J. H., & Karakuş, O. (2023). A machine learning approach for player and position
adjusted expected goals in football (soccer). Franklin Open, 4, 100034.
● Madrero Pardo, P. (2020). Creating a model for expected goals in football using
qualitative player information (Master's thesis, Universitat Politècnica de Catalunya).
● Partida, A., Martinez, A., Durrer, C., Gutierrez, O., & Posta, F. (2021). Modeling of
Football Match Outcomes with Expected Goals Statistic. Journal of Sports Research,
10(1).
● Yorke, J. (2022, November 8). Premier League 2014-15 Stat Round-Up: Goalscorers,
disappearing shots and more. StatsBomb | Data Champions.
● Rathke, A. (2017). An examination of expected goals and shot efficiency in soccer.
Journal of Human Sport and Exercise, 12(2), 514-529.
37
● Premier League expected goal data 2014-2020. (2021, June 15). Kaggle.
● "Premier League – Handbook Season 2014/15" (PDF). Premier League. Archived from
the original (PDF) on 20 August 2014. Retrieved 2 February 2015.
● "Barclays Premier League Statistics – 2014–15". ESPN FC. Entertainment and Sports
Programming Network (ESPN).
● Brechot, M., & Flepp, R. (2018). Dealing with randomness in match outcomes: how to
rethink performance evaluation and decision-making in European club football.
● Mead, J., O’Hare, A., & McMenemy, P. (2023). Expected goals in football: Improving
model performance and demonstrating value. Plos one, 18(4)
● "Statistical Leaders – Clean Sheets". NBC Sports. Archived from the original on 15 June
2013.
● Tiippana, T. (2020). How accurately does the expected goals model reflect goalscoring
and success in football? (Bachelor's thesis).
● ESPN scoring STATS, 2014-15 Premier League Season
● Barclays Premier League football scores & results, 2014-15.
● Goddard, J. (2005). Regression models for forecasting goals and match results in
association football. International Journal of forecasting, 21(2), 331-340.
● Mead, J., O’Hare, A., & McMenemy, P. (2023). Expected goals in football: Improving
model performance and demonstrating value. Plos one, 18(4), e0282295.
● Stats, Goals, Records, Assists, Cups and more | FBref.com. (n.d.). FBref.com.
● EPL xG Table and Scorers for the 2014/2015 season | Understat.com. (n.d.). Understat.
● Premier League football news, fixtures, scores & results.
38
39

The Accuracy of Expected Goals in The Premier League

Uploaded by

Copyright:

Available Formats

The Accuracy of Expected Goals in The Premier League

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Accuracy of Expected Goals in The Premier League

Uploaded by

Copyright:

Available Formats

Skidmore College

Economics Student Theses and Capstone Economics

The accuracy of expected goals in the Premier League

Follow this and additional works at: https://creativematter.skidmore.edu/econ_studt_schol

Part of the Economics Commons

ROUGH DRAFT: THE ACCURACY OF EXPECTED GOALS IN THE

likelihood of a particular scoring opportunity resulting in a goal. Assessing the quality of a

very dependent on the expected goal tool.

a game does not always reflect the chances a team had.

beneficial. Furthermore, analytics in soccer is an area of research which is growing

about the sport and the science behind it.

chances they created the score should have been 2 to 1.

average 3 million US dollars(EPL). This included summer signings, renovations on facilities,

conversion rate into a goal of each chance.

through a teams attacking play.

little later develop more intricate models of expected goals.

imagined, had a major effect on calculating xG.

one of the first papers to research xG in such a detailed manner(Rathke 2017).

foot- ball shots.

written on expected goals in the women's game(Bransen 2021).

factors such as a team's defensive prowess(Partida, 2021).

confidence interval, bias, variance, and irreducible error.

chance creation from scoring.

goals or models surrounding are even accurate.

as being good or bad.

vice versa(Eggels 2016).

MLS teams(Fairchild 2018).

radius, and much more.

their main contribution to the existing literature.

just takes the ‘average’ player(Hewitt 2023).

method to rate the players out of 100.

conclusions and extending this work.

would get none.

the lowest goal differential would finish in a lower position.

In my paper I am researching if expected goals have an accurate prediction on the

taking the xG minus the xGA.

entertaining, but also high in variance, therefore I am expecting my results to be volatile.

games away from home.

they are the next in line alphabetically and so on.

actual results not expected results.

expected results will be applied to every game for every team.

expected points over the course of a 38 game season.

little later in my discussion of results and potential shortcomings of this study.

done, the expected points for a team in a season were totaled.

spots difference every team finished in.

in–they are a team that has a positive correlation to random events.

Looking at it from a different perspective, 13 out of 20 teams were within 3 spots of

teams that are not as good.

they were on the receiving end of luck.

in the best form of their career’s all at the same time.

if their actual finishing position worsens over years and why.

standards there is a change in the head coaching position.

LIMITATIONS AND FUTURE WORK

goals if a team is playing on their home field or away.

games, in theory, will win the games deserved.

and hope that future research continues.

IJCAI (Vol. 2021).

sports analytics workshop (Vol. 16).

● Fairchild, Alexander, Pelechrinis, Konstantinos, and Kokkodis, Marios. ‘Spatial Analysis