The Accuracy of Expected Goals in The Premier League

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Skidmore College

Creative Matter

Economics Student Theses and Capstone Economics


Projects

Spring 5-5-2024

The accuracy of expected goals in the Premier League


Max Mian
[email protected]

Follow this and additional works at: https://creativematter.skidmore.edu/econ_studt_schol

Part of the Economics Commons

Recommended Citation
Mian, Max, "The accuracy of expected goals in the Premier League" (2024). Economics Student Theses
and Capstone Projects. 164.
https://creativematter.skidmore.edu/econ_studt_schol/164

This Thesis is brought to you for free and open access by the Economics at Creative Matter. It has been accepted
for inclusion in Economics Student Theses and Capstone Projects by an authorized administrator of Creative
Matter. For more information, please contact [email protected].
Max Mian

Professor Das

Thesis

29/3/2024

ROUGH DRAFT: THE ACCURACY OF EXPECTED GOALS IN THE

PREMIER LEAGUE

For my topic I will be analyzing the accuracy of expected goals on the 2014-2015

Premier League season. Expected goals is a statistical metric used in soccer to quantify the

likelihood of a particular scoring opportunity resulting in a goal. Assessing the quality of a

goal-scoring opportunity based on various factors: distance of shot, angle, location, or players in

the way. My main research question is: How accurately do expected goals predict a team's

finishing position in the Premier League last season? Teams, coaches, and analysts have become

very dependent on the expected goal tool.

The goal of this metric is to provide a deeper evaluation of a player and team’s

performance beyond just the number of goals scored. By giving shots probability a team can

grasp a better understanding if they are creating high quality chances or are getting unlucky or

benefiting from luck. This analytical tool has recently become very popular, as the final score of

a game does not always reflect the chances a team had.

In my thesis I am going to further explore these limitations to see how accurate xG really

is in reflecting the actual score and if it is a dependable analytics tool for the sport. Overall, if we

were to take expected goals over a whole season for all teams and games how accurately would

it describe the games and finishing position of teams in the premier league in the 2014-15

season? These are the main questions I will be aiming to answer in my thesis.

1
Researching this topic is relevant because the Premier League is the world’s most

watched soccer league in the world which generates hundreds of millions of revenue. Being able

to finish in a higher position at the end of the season makes the team receive more revenue,

having an analytics tool which helps you have an advantage over other teams can be very

beneficial. Furthermore, analytics in soccer is an area of research which is growing

exponentially. Being able to add literature is exciting for people, like myself, who are passionate

about the sport and the science behind it.

To explain how expected goals look visually I have provided Image 1 as a better visual of

what one game looks like based on expected goals. The image I have used is a game which I will

be using in my research, Stoke City against Liverpool; the score of the game was 6 to 1.

According to the result of the game one would think that Stoke City was a much better team and

created many more goal scoring opportunities than Liverpool. But if we look at the result based

on expected goals the score would be Stoke City 2 Liverpool 1. This is a much smaller difference

in score than the actual score; what this means is, according to expected goals, Stoke City created

slightly better chances in this game over Liverpool, and if we just looked at the goal scoring

chances they created the score should have been 2 to 1.

For the time being games and seasons, at the moment, are not determined by the score

that expected goals gives an individual game. But in the future this might change, pundits,

coaches, and analysts heavily depend on what expected goals they created during a game. In

other words, if a team tied a game 1 to 1 but according to expected goals they should have won

the game 5 to 1, coaches and teams will continue to play the same system and formation.

Trusting that in the long run the chances they create(or their high expected goals) in future games

2
will translate to actual goals. Coaches will depend on certain playstyles and formations based on

their xG output.

The Premier League is one of the most profitable sports leagues in the world, where the

smallest of details make such a huge difference for a team’s finishing position. Having any

further insight on an opposing team or your own team is highly valued. Last year the Premier

League, on average, came out with a statistic where scoring one Premier League goal cost on

average 3 million US dollars(EPL). This included summer signings, renovations on facilities,

and sales of players. This shows how costly it is to operate a team in the Premier League.

Expected goals have a unique and important value that add to the innovation of soccer.

Expected goals add a more objective evaluation on a team’s performance, instead of just looking

at the goals scored which has a bigger influence of luck, expected goals represent the quality of

chances created which helps assess how a team played. Through tactical analysis, coaches and

analysts of a team can determine how effective different tactics and strategies were based on

their chance creation, in other words, the expected goals in favor and against in a game. By

understanding where and when scoring opportunities are being created and the likelihood of the

conversion rate into a goal of each chance.

It also helps determine individual performance of players for teams. Expected goals can

be used to determine if certain players are over performing or underperforming. For example a

striker might have 10 goals but have five expected goals, this means they are overperforming

their xG by five goals which is very impressive. Compared to a striker on the same team who has

11 goals but 16 xG, this means this other striker is underperforming. Expected goals would help

a team see on a quantifiable level who is playing better and which striker would benefit the team

through a teams attacking play.

3
Expected goal metrics can also be used in the world of soccer for scouting and

recruitment. Using the same steps as mentioned previously, teams can identify talented players,

by observing if players are over performing their expected goals; by looking at the ability to

create and convert scoring chances. Especially if a team plays a similar system, expected goals

provide a more comprehensive picture of the potential value a player could bring to the team if

they were signed. Expected goals are now being used as one of the leading tools to identify

talented players.

But expected goals stretch further than just to the understanding of coaches, staff and

bookmakers. It also increases engagement and understanding of a game, team, season, or player.

As xG provides deeper understanding than just the final scoreline, fans can use this as another

layer of discussion and analysis of the game. For many this increases the overall understanding

of a game and for some more passionate fans, it can increase the enjoyment of the game. The

increase of expected goals in the world of soccer has increased understanding of a game, so

much that teams and coaches have depended massively on it to determine the outcome of a game

or season. In this thesis I will be analyzing how important are expected goals, using the 2014-15

Premier League season, I will be able to determine how effective xG is on concluding teams’

finishing positions.

Literature Review

Because of the relevance and uprise of expected goals I have had the luck to read many

papers related to this topic which have been written very recently. Due to the nuance of this topic

many researchers are able to test different models since there is not a perfect model just yet and it

is constantly being updated to be more accurate. For my literature review I will be explaining

4
papers in the chronological order that I have read them, this makes the most sense because the

first couple papers explain the basic studies behind expected goals, while the ones I have read a

little later develop more intricate models of expected goals.

The first paper I want to take a look at is one written by Alex Rathke from the Journal of

Human Sport and Exercise at the University of Alicante. This was one of the first papers to study

expected goals as expected goals has been around for around a decade. The goal of this paper

was to analyze what factors were associated with determining expected goals. Factors analyzed

were: distance of shot and angle of shot. What is important to consider is that expected goals

have evolved a lot since then, which means more factors are under consideration but for this

paper those were the two main factors. They found that lower and higher league teams over or

under achieved their expected goals compared to mid table teams. Angleand shot location, as

imagined, had a major effect on calculating xG.

Rathke believed that showing players their xG could help them in attacking phases

showing them from where and how to strike certain shots or for defenders how they should be

positioned to avoid these shots. Which I thought was pretty interesting how intricate and detailed

coaching can be due to expected goals. This paper is interesting because it breaks down the

accuracy of xG per team and player across different european professional leagues and it is also

one of the first papers to research xG in such a detailed manner(Rathke 2017).

The next paper I looked at was one that I had not read much on. It had to do with

expected goals in women's soccer. In theory, the results should be similar, but I found it

interesting that after reading through so much literature this was the first one I found on women.

The women’s football game has made huge advances in recent years. When the first official

FIFA Women’s World Cup took place in 1999, it featured matches that lasted only 80 minutes

5
and the final was not even shown on TV. Just 20 years later, more than 1 billion viewers watched

the 2019 World Cup final between the Netherlands and the USA. Concurrently, financial invest-

ment in the women’s club realm has increased the number of players who are able to play

professionally.

This research takes a first in-depth analytical look using machine learning at the technical

data that is now being collected from professional women’s matches. Specifically, we focus on

shots as the fundamental objective of football is to score more goals than your opponent, and as

Johan Cruijff fa- mously said: ”you can’t score if you don’t shoot.” A natural way to analyze

shots and shot behavior is through the lens of the well-known expected goals (xG) metric, which

gives the probability that a shot will yield a goal. Most of this paper includes graphs and tables

which show the different xG across leagues, during my next presentation I will make sure to

show it. Some conclusions were found this paper performed an extensive analysis of women’s

foot- ball shots.

They identified interesting observations such as the fact that women tend to shoot from

different locations than men, have a higher shot conversion rate and their goals are differently

distributed across the season. They trained six different xG models on different data sets with

different machine learning algorithms and found that, in general, models from one gender are

applicable to shots from the other. However, when inspecting the models and shots, some

interesting differences arose in terms of what features are important and how the models value

certain types of shots. This paper is unique because I have found almost nothing that has been

written on expected goals in the women's game(Bransen 2021).

One paper written by the journal of sports research fell perfectly in line as a continuation

of the previous paper and aiming to answer some of my previous thesis questions.

6
This paper, Partida 2021, is one of the most important papers in continuing my research.

The goal of the paper was examining the predictive capabilities of expected goals across the top

leagues in europe. They did this through collecting the expected goals of 310 games from the

German league, Spanish league, and Italian league. They were able to collect data from games

that are publicly available, this is very helpful for me as my research will benefit from public

data. They collected their expected goals from understat.com and from there created statistical

models. This was a pleasant coincidence since the data I am looking at is from the same website,

which means this paper can be very useful to my work. Something a little different that they did

was take data from a betting website, as they were also comparing xG to betting probability

outcome of games.

They then created three different models based on the expected goals and then compared

the models. Two well-established probability models used binomial deviance, squared error and

probability in betting. One of the models took xG and subtracted xGA to determine how

successful teams are. This model is the most unbiased, as it is only explaining the difference goal

scoring opportunities both the opposing and home team faced; from that concluding which team

had better goal scoring opportunities, hence should be the winner. In my research I will be using

the same equation. The best model suggests that expected goals are most accurate for teams with

home team advantage. Two of the other models were profitable under very specific betting

conditions. One limitation they found is that their models and expected goals did not include

factors such as a team's defensive prowess(Partida, 2021).

They concluded that expected goals can provide meaningful insight, but that was about it.

For the betting side of their research they concluded that predictive capabilities of our models

were strong enough to be profitable under certain conditions. A further analysis of profitability

7
in football betting could involve the investigation of the accuracy necessary to be profitable

when betting all of the games as well as the use of parlays (betting multiple games at once) as

betting strategy. To have the most accurate numbers possible researchers looked at the

confidence interval, bias, variance, and irreducible error.

This research focused on four different aspects of the game(and expected goals) which

meant the paper was a bit confusing, by that I mean covering a lot of areas which can be

confusing to the reader. For each one of their chapters a different algorithm was created which

meant different outcomes were being spit out. But although they had different outcomes they all

agreed that one limitation was the quality of each individual goal scoring opportunity. Meaning

the variance between goal scoring opportunities was really high–they suggested that future work

should try and find a way to reduce variance. Since expected goals are taking the ‘average’

shooter we cannot determine the quality of the shooter. Fortunately, there is work which has been

done on this exact limitation which I will talk about later in this paper(Partida 2021).

Subsequently, the paper I read looked at the difference in expected goals in Germany’s 4

different levels of football: Bundesliga, Regionalliga, U19, and U17. They used the same model

that Brentford uses, a Premier League team which was one of the first to use xG in their analysis

of games. They collected data from 2 different seasons from all four different tiers of football.

For their model they used the same regression model which is widely used when looking at

chance creation from scoring.

The researchers found that players in higher levels tend to take riskier shots and aim for

top corners, even though the chance of missing is higher. They also found that distance from goal

contributions varied significantly between the four leagues. Interestingly, they found that as

goalkeepers get older they have a harder time saving shots further away from their body, so

8
bottom corner shots. This is interesting because goalkeeper quality has not been mentioned

before in xG and going forward it might be a topic that is spoken about more. What I also found

interesting was that the xG was mostly the same across leagues, this makes sense as the level

across all 4 tiers is even relative to the level they play at. But the researchers also found that in

the highest tier, the Bundesliga, players took less shots from outside the box. Which may allude

to a hypothesis I was mentioning before because of all this data players might be told instead of

going for the goal of the year and shooting it from 40 yards out, to shoot from closer as the

probability of a goal is higher. But at the same time it takes away from the essence of the game.

The next couple of papers will look at the specifics on how xG is actually calculated;

compared to the past two papers, where they are looking and creating models to see if expected

goals or models surrounding are even accurate.

For the next paper, I am going to introduce a paper by the analytics department at PSV

Eindhoven, a dutch team in the first division, who regularly is considered one of the best clubs in

the Netherlands. These next couple papers will show what more specialized xG looks like. The

authors begin by explaining how the impact of analytics in soccer has been a direct influence of

baseball and the whole moneyball era by the Oakland A’s. A team which looked at players in a

different lens and tried to see when and where a player is most efficient, instead of observing him

as being good or bad.

This paper creates algorithms to see if the correlation between high value chances and

expected goals is accurate. This study suggests that skill of the player should be considered for

expected goals because the variance of shots is larger when the skill of the player is lower and

vice versa(Eggels 2016).

9
The following paper studies something along the same lines but does so in the

professional soccer league in the U.S. the MLS.Written by the journal of sports analytics the

authors look at expected goals in the MLS, the top american professional soccer league. They

look at the accuracy of expected goals, and nothing jumps out as uncommon or weird. But what

makes this study interesting is that they add to the current xG model, by adding rotation of the

ball. Which may seem very minimalistic but does have an effect on a shot and the quality of a

shot. In this study they aimed to investigate two issues, the first: building a model which

estimates the probability of a shot leading to a goal(this is where they add ball rotation). The

second one is delving into possible relationships between a team’s efficiency and their actual

expected goals.

In the study they quantified the probability of a chance being a goal(with their own xG

model). They would take the probability of a shot being a goal and then compare the fraction of

shots from that group that ended up being goals. They then utilize this model for teams and

specific players which would allow them to make conclusions about the offense and defense of

MLS teams(Fairchild 2018).

Interestingly enough this past paper was not able to create a model or function which

showed what tailoring xG should look like, but for this next paper the authors do exactly that.

The next academic paper I found was one written by scholars at the University of Cardiff

within their computer science department. They took the current model of xG and decided to

adjust it for players positions and skill; a limitation which I have mentioned before. Their aim

was to evaluate models based on the individual players which would be useful in showing the

quality of players and their finishing. One thing I found useful about this paper is it talked a bit

about how expected goals have evolved, even introducing what coaches thought of xG, which for

10
context can be very useful. They also acquired their data from a publicly available data set,

statsbomb. This paper went really in-depth as it took shots taken by players and accounted for

their dominant foot, keeper distance from center of goal, number of opponents within 5 yard

radius, and much more.

This paper also took the xG of multiple analytics companies so they would be able to

gather as much data as possible to determine which players are the best finishers. When looking

at players who surpassed their xG based on their model one player was an obvious outlier, Lionel

Messi, recognized as one of the greats in the sport they found that from an xG of 342 he scored

415 surpassing his xG by 22% the next closest player in their model was Luis Suarez, another

great finisher who surpassed his xG by 15%. Something I found interesting that they did is they

made a Messi adjusted xG model, so basically if every player was Messi what would their xG be.

They created a whole table with players, but one player that stood out to me was Neymar whose

goals were 52 with a proposed xG of 58, so he was underperforming his xG. But once they

adjusted for Messi’s model, if Neymar was Messi, he would have an xG of 66. I just found it so

interesting how detailed they were to the point they could adjust for other players, which was

their main contribution to the existing literature.

Hewitt also mentioned that in the long run expected goals usually reflect accurately how

a player should be playing. Meaning if they overperformed their expected goals one year it is

unlikely to continue being that way, with the exception of Lionel Messi. It is interesting but

makes sense to read that expected goals are more accurate when there are more observations.

Because a player might be getting fortunate one season or seems to be operating well in a

system, but sooner or later opposing teams will modify their tactics to neutralize such players.

11
This paper is really important because it shows what xG should look like for specific

players. Although they are only doing so for one player, Messi, it is interesting to see what other

players xG would look like if they were Messi. With that being said, hopefully in the future they

can make more models for specific players. Because as of now we have an average xG which

just takes the ‘average’ player(Hewitt 2023).

Another paper I read which had similar intentions was one written at the University of

Barcelona which did this study in collaboration with the soccer team F.C. Barcelona. The aim of

this project is to create a basic xG model and then build a second one with more information

about the shooter, so more tailored to the player although the methodologies were a bit different.

They acquired their data through OPTA which is not publicly available.

The first model was used without using specific information about a player, so the shots

taken were just as if an average player took the shot. In the paper they explain their results

through tables and formulas. But the second part of the model involved qualitative information to

build xG data on the players they decided to pick out. To do this they decided to use the ratings

of a video game, FIFA, to give the players a score. Although FIFA is a complex video game

which has stats for every player, I would assume that the researchers would use a different

method to rate the players out of 100.

Similar to the past study they found that amongst certain players when they made a xG

model for them it was more accurate player to player than the average one that i used for

everyone. Some advice that they gave for future work was at the beginning of the season

tracking everyone’s shot output so after the course of a couple seasons there is a huge amount of

data compiled which can then be applied to give every player their own rating and expected goal.

12
The table that they showed where each player has their own model showed to be far more

accurate than the generic one that is used in regular games(Madrero 2020).

In conclusion, having read this related literature it has helped me figure out what the gaps

are. For example, there is a lot of literature on shots and chance creation with many different

models. But little to no literature on the expected goal accuracy of the premier league and for

home and away teams. Many of the authors of these pieces concluded that what would make

expected goals more accurate is having specialized models for each player. The last two papers I

wrote about spoke about this pretty accurately, for future work I believe more work needs to be

done in this area to make expected goals as accurate as possible. With that being said, I am glad

there are some gaps in the literature which I am going to write about, with the goal of finding

conclusions and extending this work.

ANALYTICAL FRAMEWORK

Having data for my thesis is obviously a make or break. Luckily all of my data is publicly

available at statsbomb.com, opta.com, and other websites with downloadable csv data. The data I

have was all collected from a dataset on kaggle, where they hold all of the expected goal data

that I need. They had every expected goal and expected goal against result from the 2014-2015

Premier League season. Expected goal data is pretty publicly available, and is similar since the

methodology used to collect it is almost identical. The methodology was created and shared by

statsbomb, who is an analytics company who provides data for professional teams over many

leagues.

What my data is all of the data needed for the 2014-2015 Premier League season. What

this entails reading it left to right, as shown in Figure 1: is xG, xGA; these are the two most

important variables for my project as these are my independent variables for my project. The

13
next three columns are goals scored and scored against, which is not relevant to my thesis

directly but could potentially be important for describing the error, therefore I have kept it. Later

is the result of the game so either a win, loss or tie, regardless of the xG and xGA, this is the

actual result. This is important to have because when comparing it to my expected result I will

see if they are the same or different. The following column is the points a team gained from that

game, meaning: if they won they got three points, if they tied they got one, and if they lost they

would get none.

The next couple columns describe the amount of passes that a team had as well as shots

that they had and have given up. Similar to the previous column of actual goals scored or scored

against, this is not directly relevant to my paper but in the long run it could maybe describe my

standard error if there are a lot of variance in my results. The last column is the team name, for

figure 1 the only team name for the first 38 columns is Arsenal because, in excel I organized it

by team name, and those first 38 columns are all of Arsenal’s games that season, meaning all of

the previous columns to the left are the statistics for Arsenal’s season. Unfortunately, my data is a

lot longer than Figure 1 so I am not able to show how extensive it is, as there are 722 more rows

of data. But Figure 1 is the example of what one team's data for the season looks like. Meaning

there are 19 more teams that are not shown, but their results(or columns) would be structured the

same way.

When teams play each other there can be three outcomes: a win, a tie, or a loss. Wins

account for 3 points, ties 1 point, and losses 0 points. Over the course of a season all these points

are added up and depending on how many points a team has they can become champions of the

league, qualify for European competitions, or get relegated. To be the champion you must be in

first place with the most points out of the 20 teams; if two or more teams have the same amount

14
of points the team with the highest goal differential, meaning goals scored minus goals conceded,

will be crowned champion of the Premier League. The top five teams qualify for European

competition for the following season, with the same concepts: must be the top highest five teams

in points and if tied with other teams, goal differential would apply. For relegation it is the

bottom three teams with the least amount of points that would get relegated to the second

division in England, the Championship; again if tied on points with another team, the teams with

the lowest goal differential would finish in a lower position.

In my paper I am researching if expected goals have an accurate prediction on the

finishing position of teams in the Premier League. Therefore, my ‘Y’ variable will be the

expected finishing position of a team; this makes the most sense as finishing positions in the

Premier League are all based on the amount of goals you score and the amount of goals scored

against you. For this project I am proving whether or not expected goals are an accurate tool to

measure goals. My two ‘X’ variables will be expected goals for and expected goals against(Also

known as xG and xGA). Both of these variables are independent and are not affected by finishing

position, rather they affect the finishing position. Partida 2021, had the same equation in their

research, as they had three different models one looked at what would be the result of a game if

they subtracted xGA from xG to find the expected result of a game. The average team model had

the same relationship of independent variables that I have included in my paper, where you are

taking the xG minus the xGA.

The difference between Partida 2021 work and mine is they were not looking at games

over the course of a whole season, I will be looking at the whole 2014/15 season and compiling

conclusions from that. Furthermore, the research they did was in Spain, Germany, and Italy,

while I will be looking at England. Debatably the most competitive league, in the past five years

15
from Partida 2021 work there were only 2 different league champions in Italy, 2 in Spain, and 1

in Germany. Compared to England who has had 5 different title winners in the past five years

from 2021. The Premier League is one of the most competitive leagues, which makes it

entertaining, but also high in variance, therefore I am expecting my results to be volatile.

However I cannot combine all the xG and xGA a team has had over the course of a

season to predict their expected finishing position. To make this more accurate I would have to

take every xG and subtract the xGA for every game, to get an ‘expected result’ for each game.

Depending on whether or not the expected result gives me a win, loss or tie I would combine the

results from a 38 long game season to give a team an expected finishing position. In a Premier

League teams play 38 games in their season as there are 20 teams; 19 games at home and 19

games away from home.

Initially I believed I would need a regression, but since xG and xGA of a game are not

consistent independent variables, all I need is to subtract xG from xGA for every game every

team played that season to acquire an expected result. To do this, I needed help from STATA and

to clean up my data on excel. Many hours were spent cleaning my data, as initially received the

data, and the games were organized from most recent game to least recent. What I believed made

most sense was organizing every team’s result alphabetically from A to Z. Meaning, the first 38

rows would be Arsenal’s game results and the next 38 rows would be Aston Villa’s results as

they are the next in line alphabetically and so on.

I imported all of my data, as shown in Figure 1. This had all of the xG and xGA of every

team’s game in one place. Next, what I had to do was create a new variable which would show

the net xG of each game, so xG-xGA. As shown in Figure 2 I created the variable ‘netxg’. This

took the net expected goals of every game and gave it whatever value xG-xGA equaled. But the

16
net expected goals of a game was not enough to draw a conclusion so I needed the net expected

goals of a game to give me an expected result. So for example in the first game Arsenal played

the game ended 2 xG to 0 xGA for the other team, meaning Arsenal would acquire a win from

this game. Because 2xG-0xGA would equal a win for Arsenal. As mentioned earlier a win is 3

points, so Arsenal would receive 3 expected points from this game. With this logic I am

following the same steps that regular leagues follow with their actual league play, according to

actual results not expected results.

Therefore I needed to create another variable, which would stand for the expected result.

This would be my dependent variable as it would be dependent on the xG and xGA of a game.

The new variable was named ‘exresult’, as shown in Figure 2. I coded it for every net xG 1 or

greater than 1 it would give ‘exresult’ a value of 3, in other words 3 points. For every net xG

equal to 0, it would give a value of 1, if there is a tie the net xG would be zero regardless of the

xG and xGA. Lastly, for every net xG value less than 0, so -1, -2, -3, and so on it would give

‘exresult’ a value of 0, this is because if a team had a negative net xG this would mean that they

would have lost the game according to expected goals, therefore they would get no points. These

expected results will be applied to every game for every team.

Adding up all the expected results for every team STATA, will give me a point total for

every team. Given all the points for every team I will be able to rank the teams based on their

expected points over the course of a 38 game season.

Initially I wanted to include a subsection to my thesis trying to see if net expected goals

are more accurate for home or away teams. I believed this would be a great idea as I thought

there would be a huge disparity between home and away games. Home games in soccer

massively favor the home team therefore, I would assume home teams have a more accurate

17
expected goals since they are creating better quality chances. But on the other hand, away teams

would have a lower expected goals accuracy as they might be creating lower scoring chances and

getting more ‘luck’. Hopefully in future studies this topic can be researched. One study, Partida

2021 found that expected goals and expected goals against favored the home team, their

reasoning was similar–also referencing that teams at home press higher so they collect the ball

higher so their shots come from a closer distance than away teams. If I had continued my

research and done home vs away games I would have overlapped work that has already been

done

When calculating my xG-xGA formula in the formula there is also an element of standard

error which will likely describe why teams are in different positions. The standard error is going

to explain some of the variance or irregularities, if there are any, in my formula. The error will

explain why some teams are in very different spots compared to their actual finishing position.

This what makes this research so interesting is that some teams might be in very different

rankings and for very different reasons since each team is so unique. These will be discussed a

little later in my discussion of results and potential shortcomings of this study.

Discussion of Results

For my results I have found some interesting outcomes from my formula. As mentioned

earlier, my formula of xG-xGA was run on every team's game in the Premier League that season.

For every game I got an expected result of either 0 points, 1 points, or 3 points. I then created 20

new columns with every team's expected points, as shown in Figure 3. Every teams expected

points were organized by alphabetical order, from left to right. Therefore every column has the

expected points each team collected in a game, from the 2014-15 season. As shown in the first

18
row and first column of “arsenal_pts”, Arsenal played against Crystal Palace, where the xG for

Arsenal was 2, while the xG for Crystal Palace was 1. In other words the xGA for Arsenal was 1.

Therefore 2-1=1, meaning according to expected goals they won the game, so Arsenal acquired 3

expected points. This process was done for every team's game in this season. After this was

done, the expected points for a team in a season were totaled.

For instance Arsenal's expected points for a season totaled to 68, Aston Villa’s totaled to

26 and so on. Once all the teams’ expected points were totaled, the teams were ranked, from

most amount of points to least amount of points, as shown in Table 1. Almost all of my work

which I have done this semester was to acquire what Table 1 is. But the best part of it all was

putting it all together. So with my expected finishing position table I put it side by side and

compared it to the actual finishing position table of the 2014-2015 Premier League season, this

is shown on Table 2.

Table 2 is broken up into 5 different columns. The first column has the actual finishing

position of each team by most points to least, in the 2014-15 season. The second column has the

amount of points each team got that season. The third column is my work done through my

formula, of the expected finishing position of each team. Same as the first column every team is

ranked from highest to lowest based on the amount of points they totaled in the 2014-15 season.

The second to last column are the expected points each team had that season, this led them to

finish higher or lower in the table. The last column is the difference between the actual finishing

position and the expected finishing position in the season, this helps the reader show how many

spots difference every team finished in.

Looking at the results the first part of the table I would like to analyze is the points vs the

19
expected points. At first glance what we can observe is the expected points are much lower than

the actual points in a season. For the winner of the expected finishing table, Manchester City,

they had 69 points, compared to the actual winner of the Premier League, Chelsea who had 87

points to win the league that year. There are many reasons why this happened. The first reason

was in my formula I ran there were a lot more draws than in the regular season. What this means

is points were split between teams instead of one team taking all 3 points. For example Chelsea,

who had 87 points in their actual season, dropped to 66 points, a 21 point drop. In their actual

season they had a total of 9 draws, compared to 16 draws in their expected season. This means

they were going from potentially having 3 points in 7 games to having 1, this dropped their

points and increased the points of someone they tied with in the expected table. A team with a

similar situation was Manchester United, who in their season had a massive 70 points. But in

their expected table their points almost halved to 37 points, according to the formula. Manchester

United had tied 10 times in their actual season, while in the expected season they tied 17 games,

7 more were expected. More will be explained on Manchester United and their finishing

position later on this paper, but this increase in draws makes a lot of sense. Manchester United

are a team that tend to score very late on in the game as well as goals that usually do not go

in–they are a team that has a positive correlation to random events.

The rise in the amount of draws can also be seen in the difference in points between the

first place team and the last place team, in both tables. The difference in points between Chelsea

and QPR in the actual table is much higher than in the expected table between Manchester City

and Newcastle. In other words, this means that the expected table shows that teams are closer in

competition than what is reflected in the actual table. This makes a lot of sense as in real soccer

games the margin of error can cost a team games or a season. Expected goals do not have that

20
attribute where they account for defensive errors or mistakes. For instance, if a player shoots the

ball from 40 yards out and it slips through the goalkeeper's hands and goes in the net, although

the goalkeeper should be making that save. On the scoreboard it will say one to zero, but the

expected goal scoreboard might count that shot as 0.001 expected goals. The game could end up

being one to one expected goals but the actual score would be one to zero. The scoreline might

not reflect fairly how the game played out, as both teams had equal scoring opportunities, but the

game would, in theory, be decided by the mistake the goalkeeper made, something which

expected goals does not account for. This would mean the home team would take the 3 points in

the actual game despite equal scoring opportunities. Because of the lack of accountability for

errors in a game, expected goals predict more teams to be more equal in skill and goal scoring

opportunities.

What this would mean is the variance in points can be reasoned because of the standard

error; but each team's standard error is very unique. If we take Newcastle for example they

finished 15th in the real table with 39 points. In the expected table they finished dead last with 17

points, a drop of 22 points, into relegation. But how on earth could this happen to a team?

Newcastle that season had a prolific striker, Papiss Demba Cisse, who scored 12 goals and

assisted 1(Premier League) in only 22 games. His expected goals for that season were 8, this

means he outperformed his goal scoring tally by a whole 4 goals. Although 4 goals might sound

like a small figure it is not at all, especially for a striker who plays for a team that sits in the

bottom third of the table. Furthermore for teams that sit so low in the table even just one goal for

them can be the margin of difference to win a game, which prevents them from getting relegated

that season. According to the expected goals Newcastle were predicted to have conceded 11

more goals than they did. For this reason they dropped to last in the expected table, instead of

21
finishing 15th in the actual table.

The next part of the table I would like to take a look at are the actual standings of the

2014-15 season versus the expected standings from my model. If we compare one by one there

were only 4 out of 20 teams which were in the same position both in actual table and the

expected table, an accuracy of 20%. If we were just to look at this percentage, any given person

would believe expected goals do not hold much accuracy. But it is not as plain or simple as

looking at the standings with no context. Also there is an element if one team is not in the same

place it automatically means another team is also not placed in the same space.

Looking at it from a different perspective, 13 out of 20 teams were within 3 spots of

actual finishing position. Meaning if teams went by the expected table, most teams would not

have a significantly different season. The only team which would have a significantly different

outcome are Chelsea and Manchester City; as Manchester City would be crowned champion in

the expected table, compared to Chelsea dropping out of currently being Champions to finishing

third. Even then Manchester City moved up one position and Chelsea finished third, with 3

points less than the champion of the expected table. 65% of teams being within three spots of

their actual finishing position shows promising results. Especially when looking at the teams

which had the biggest change in position, one could argue that the expected table has accurately

described how teams should have finished. Observing the teams that had the largest variance in

position change it is no surprise that they are the bottom half of the table, in other words, the

teams that are not as good.

Let's take Swansea as an example, in the actual table they finished 8th, while in the

expected table they were predicted to finish 10 spots lower at 18th, with 33 points less than in the

22
actual table. Out of all of the teams this was the team with the biggest change in position. This

change in position would mean they would be relegated and playing in the second division of

England. In their expected table they were predicted to have scored five goals less and have

conceded seven more throughout the course of the season. Although over the course of 38 games

five goals for and seven against do not sound like many goals, many soccer games are won

through the slightest of margins. Out of the 16 games Swansea won this season 11 of them were

won by a one goal difference: either 1-0, 2-1, 3-2, and so on. Swansea took winning by the

slightest of margins literally. Out of the 11 games they won by a one goal margin, 9 of them,

according to the expected game model, Swansea either lost or tied. This is not even considering

the games they tied or won by more than one goal. This shows that Swansea were very fortunate

to win many of these games and to finish in the position that they did in the expected table. The

expected table placed them a lot lower because based on their scoring chances and the scoring

chances of opposing teams they should have done much worse, but in the 2014-15 one could say

they were on the receiving end of luck.

This luck can be explained through statistical methodologies rather than anecdotes. What

I mean by this is that this season Swansea had two important factors which allowed them to

finish higher than expected. The first being the goalkeeper; Swansea had Lukas Fabianski

playing in net, who was voted the second best keeper in the Premier League that season(Sky).

This speaks wonders to his performances that season, usually the winning team that year, in this

case Chelsea, their goalkeeper is named best goalkeeper of the season. The second best would

usually go to a team which finished within the top 5 positions. Therefore, with Swansea finishing

8th that year it comes to shoot the amount of goals he prevented from opposing teams. This is

one of the main reasons Swansea’s xGA is so much higher than the actual goals they conceded,

23
because they had a goalkeeper who was over performing.

Another reason why Swansea finished so high that season was due to overperformance of

their goalscorers: Wilfred Bony, Ki Sung, and Gylfi Sigurdsson. Wilfred Bony scored 9 goals

that year over performing his expected goals by 2(fbref). Ki Sung that year scored 8 goals and

over performed his expected goal tally by 4, massively impressive as it is double the amount.

Lastly, Gylfi Sigurdsson scored 7 goals that year over performing his expected goal tally by 2

goals. Most teams do not have their top goal scorer out performing their xG, while Swansea had

their top three goal scorers over performing. Although these seem like not very high numbers, as

mentioned earlier, soccer is a low scoring game where the slightest of margins might go a long

way. Swansea had a perfect sequence of events where their goalkeepers and strikers were playing

in the best form of their career’s all at the same time.

Teams such as Swansea, who are a positive outcome to random events in one season, in

the long run expected goals and expected goals against will be accurately reflected in their actual

position as mentioned by Hewitt, 2023. This claim is backed up significantly as two seasons after

Swansea fell down to 15th, and the following season after that they got relegated falling into

18th place(EPL). Although my model was very incorrect in predicting Swansea’s finishing

position vs their actual position. This drop in positions in the following seasons proves that for

the 2014-15 season Swansea were an outlier, as that season they had a near perfect alignment of

players overperforming. Although this is not directly part of my research it would be interesting

to see, in future research, teams which outperform their expected points in one season and seeing

if their actual finishing position worsens over years and why.

One team which did not have a positive outcome of random events were Queens Park

24
Rangers, known as QPR. QPR finished in last place in their actual table, which meant they were

relegated into the second division. But according to the expected table model, QPR should have

finished seven places up in 13th place, based on the goal scoring chances created by them and

their opponents. In the course of the season QPR were expected to have score 4 goals more and

conceded 8 goals less than they did in their actual season. Along with other teams expectedly

scoring and conceding less chances they would have been expected to finish in 13th place and

not relegated. QPR were massively unfortunate in the 2014-15 season, out of the 24 games they

lost, in 8 of them they either had a better or the same xG as the opposing team. This is what

made them finish so much higher in the expected table; in 8 out of these 24 games that they lost,

they were the team which created better goal scoring chances, while limiting opposing teams.

QPR were the better team in these games but ended up being unfortunate.

Similar to Swansea, QPR have their own way of describing their own fortune. QPR that

season had a prolific striker, Charlie Austin, who tallied a total of 17 goals that season(EPL).

These are impressive scoring statistics, especially for a team that finished in last. But according

to the expected goals Charlie Austin was expected to score 17 goals, despite his eye for goal that

season(transfermarket). In fact, out of QPR’s top five goal scorers that season none of them

outperformed their expected goals by more than half a goal. This meant that QPR’s strikers were

expected to score many more goals than they actually did–they had the opposite of Swansea’s

stickers, QPR’s goal scorers had a a perfect alignment of a negative outcome to random events.

Furthermore QPR conceded most of their goals that year, 17 to be exact, in the final 15 minutes

of games(transfermarkt). This can reflect multiple aspects of a team. One could be poor fitness or

fatigue of players. When players are tired they are more susceptible to defensive mistakes which

can lead to more goals–fatigue and mistakes are not taken into account in xGA. Another

25
potential reason why QPR concede late goals is poor coaching. QPR had a coach change in the

middle of the season; usually when teams are performing poorly or under performing their

standards there is a change in the head coaching position.

The combination of having poor finishers, fatigued players, and bad coaching meant that

QPR’s expected finishing position was not reflected in their actual finishing position. QPR and

Swansea are two examples of teams which are outliers in the expected table model, but most of

the teams are not outliers and their actual finishing position is similar to their expected finishing

position.

But not all teams were predicted incorrectly by the expected table, many of them, as

mentioned earlier, fell into a similar or the same position as their actual finishing position. Stoke

City is a great example–their actual finishing position was 9th while their expected finishing

position was 8th. The only reason they finished higher was because Manchester United finished

below them in the expected table. The expected model predicted Stoke City to have one more

goal for and against. One of the closest estimates in this model for xG and xGA. Furthermore,

Stoke City’s top three goal scorers were all within a goal of their xG that season. This means

they neither over performed or under performed, they performed to the standard that xG

expected them to perform at. Moreover, out of the 15 games Stoke City won 12 of them they

were expected to win–an 80% accuracy rate on the games Stoke City should have won. The

accuracy of Stoke City’s games did not end there; as out of the 15 games they lost all 15 they lost

according to xG and xGA. In other words, the expected model accurately calculated from the

goal scoring opportunities both teams had, that Stoke City should have lost all of the games they

did.

26
Stoke City are not the only team who were accurately expected to finish in a similar

position as their actual finishing position: Crystal Palace, Arsenal, Everton, Sunderland, and nine

other teams were in the same boat as Stoke City. Where most of the games they won were

because they generated better goal scoring opportunities than the opposing team. Same as most

of the games these teams lost were due to the fact they created less goal scoring opportunities

than the opposing team, which was accurately represented by xG and xGA.

LIMITATIONS AND FUTURE WORK

For many of the teams in the Premier League the 2014-15 season xG and xGA were

accurately represented in their actual results. But in terms of predicting the finishing position of a

team, this model had a 65% accuracy rate of teams finishing within 2 positions of their actual

finishing position. One can draw conclusions on this number subjectively, because out of the rest

of the statistics in soccer used almost as commonly as xG and xGA; such as possession, shots on

target, fouls, or pass completion percentage, none of them would accurately predict the finishing

position of the whole league better than xG and xGA. But another argument can be made that

predicting 65% of the finishing positions is not a high number and that the other 7 teams were

very inaccurately predicted. What do those 7 teams say about expected goals?

For starters there is a lot more that goes into a soccer game or season than just goal

scoring opportunities by the home and away team. xG solely focuses on shots without

accounting for build-up play, momentum, possession, or defensive contributions and all of these

aspects can largely affect the outcome of a match. For example if a team has a really good build

up play and the player decides to pass the ball instead of shooting it, even though he might be in

a great goal scoring position the xG value for that scenario will be 0.

27
The formula is a bit rudimentary, but with this comes the fact that it leaves no room for

bias and less unwanted variation. Furthermore, there is little more at my disposal that I could

have added to strengthen this formula. Using Partida, 2021 formula of xG-xGA to predict a

winner from each game, there is information about the game being left out, some of which are

shortcomings of the formula but others are shortcomings of expected goals in general. For

instance as mentioned by Pardo 2020, expected goals favor the home team, meaning they are a

lot more accurate for home teams than away teams. The model I had made it as if every game

was being played on a neutral site. Some future research should include a model which tailors

goals if a team is playing on their home field or away.

Soccer is low scoring and one of the most volatile sports, as one mistake can cause a lot

of change in a game. Expected goals also do not account for much variability. Meaning a shot

with low xG can still result in a goal because of goalkeeper error, deflection or exceptional skill

of a shooter. This goes hand in hand with xG not accounting for individual errors.

The last limitation that xG has is that every shot is valued the same, regardless of player

skill. This is mostly due to the relatively small amount of data xG analysts have on players shot

ability. A striker might shoot once or twice a game and sometimes he may not even shoot, over

the course of a season that is not as many shots as a data collector would want to draw

conclusions on a player’s shooting ability. Furthermore other players on the field such as

midfielders or defenders shoot even less, so their observations are even more limited. For those

reasons xG metrics uses an “average” shooter skill to display the expected goal of every shot.

Some future research I would wish to see or do, could include having a specific xG value

to every player. This maybe would not be possible for every player, as data collection would be

28
limited for players who do not shoot a lot, but for strikers and players who shoot consistently I

do not see why this would be an issue to calculate. Assigning xG values to every players shots

would increase the accuracy of xG in games, which would then predict finishing positions way

more accurately.

Some other future work I wish to do one day is observing teams who have over or under

performed in a season and seeing how they do over the course of the next five seasons or so.

Expected goals tend to be more accurate over time as xG analysts are able to collect more data

and make more accurate measurements in a game. Moreover, variation of over and under

performers tend to be outliers in the data, so in the long run the teams who should be winning

games, in theory, will win the games deserved.

CONCLUSION

In the world of soccer expected goals have become an essential analytical tool for

analysts, pundits, and coaches. It has allowed us to view the game more critically than we did

before. Work had been done previously on various fields related to my topic. But my aim was to

look at the first year xG and xGA was used and test its accuracy in finishing position in one of

the most uncertain and competitive leagues in the world, the 2014-15 Premier League season.

The accuracy of xG predicting a finishing position had never been done previously, which made

me even more excited to conduct this research. With previous literature I was able to use a

formula which was as unbiased and straightforward as possible. I wanted to analyze the accuracy

of expected goals with no conflicting variables, in order to get raw evidence on its accuracy. The

results shown were fairly promising, 13 out of the 20 teams finished within 2 positions of their

actual finishing position. Teams such as Stoke City, Arsenal, Everton and more were accurately

29
predicted. For the other 7 teams there was a lot of variability which showed the limitations of xG

or that xG was not accounting for. Teams such as Swansea and QPR who had either massive over

performers or under performers in their teams which xG could not accurately calculate.

This semester I have had the privilege to work on a topic that I am very passionate about

while including analytics in the sport of soccer. With the tremendous help of my teacher I have

been guided on the right path in order to perfect my work. But it has left me willing to learn and

research more. As I mentioned in some of my future work, I would like to see what xG looks like

if it was adjusted for the shooter's skill or if it matters if a team is home or away. As the metrics

of expected goals grow I feel privileged to have added research to the analytics world of soccer

and hope that future research continues.

30
Image 1

31
Figure 1

32
Figure 2

33
Figure 3

34
Table 1

35
Table 2

36
BIBLIOGRAPHY

● Bransen, L., & Davis, J. (2021, May). Women’s football analyzed: Interpretable expected

goals models for women. Proceedings of the AI for Sports Analytics (AISA) Workshop at

IJCAI (Vol. 2021).

● Eggels, H., van Elk, R., & Pechenizkiy, M. (2016). Expected goals in soccer: Explaining

match results using predictive analytics. In The machine learning and data mining for

sports analytics workshop (Vol. 16).

● Fairchild, Alexander, Pelechrinis, Konstantinos, and Kokkodis, Marios. ‘Spatial Analysis

of Shots in MLS: A Model for Expected Goals and Fractal Dimensionality’. 1 Jan. 2018 :

165 – 174.

● Goddard, J. (2005). Regression models for forecasting goals and match results in

association football. International Journal of forecasting, 21(2), 331-340.

● Hewitt, J. H., & Karakuş, O. (2023). A machine learning approach for player and position

adjusted expected goals in football (soccer). Franklin Open, 4, 100034.

● Madrero Pardo, P. (2020). Creating a model for expected goals in football using

qualitative player information (Master's thesis, Universitat Politècnica de Catalunya).

● Partida, A., Martinez, A., Durrer, C., Gutierrez, O., & Posta, F. (2021). Modeling of

Football Match Outcomes with Expected Goals Statistic. Journal of Sports Research,

10(1).

● Yorke, J. (2022, November 8). Premier League 2014-15 Stat Round-Up: Goalscorers,

disappearing shots and more. StatsBomb | Data Champions.

● Rathke, A. (2017). An examination of expected goals and shot efficiency in soccer.

Journal of Human Sport and Exercise, 12(2), 514-529.

37
● Premier League expected goal data 2014-2020. (2021, June 15). Kaggle.

● "Premier League – Handbook Season 2014/15" (PDF). Premier League. Archived from

the original (PDF) on 20 August 2014. Retrieved 2 February 2015.

● "Barclays Premier League Statistics – 2014–15". ESPN FC. Entertainment and Sports

Programming Network (ESPN).

● Brechot, M., & Flepp, R. (2018). Dealing with randomness in match outcomes: how to

rethink performance evaluation and decision-making in European club football.

● Mead, J., O’Hare, A., & McMenemy, P. (2023). Expected goals in football: Improving

model performance and demonstrating value. Plos one, 18(4)

● "Statistical Leaders – Clean Sheets". NBC Sports. Archived from the original on 15 June

2013.

● Tiippana, T. (2020). How accurately does the expected goals model reflect goalscoring

and success in football? (Bachelor's thesis).

● ESPN scoring STATS, 2014-15 Premier League Season

● Barclays Premier League football scores & results, 2014-15.

● Goddard, J. (2005). Regression models for forecasting goals and match results in

association football. International Journal of forecasting, 21(2), 331-340.

● Mead, J., O’Hare, A., & McMenemy, P. (2023). Expected goals in football: Improving

model performance and demonstrating value. Plos one, 18(4), e0282295.

● Stats, Goals, Records, Assists, Cups and more | FBref.com. (n.d.). FBref.com.

● EPL xG Table and Scorers for the 2014/2015 season | Understat.com. (n.d.). Understat.

● Premier League football news, fixtures, scores & results.

38
39

You might also like