The English Premier League is well-known for being not only one of the most popular professional sports leagues in the world, but also one of the toughest competitions to predict. The first purpose of this research was to verify the consistency between goal scoring in the English Premier League and the Poisson process; specifically, the relationships between the number of goals scored in a match and the Poisson distribution, the time between goals throughout the course of a season and the exponential distribution, and the time location of goals during football games and the continuous uniform distribution. We found that the Poisson process and the three probability distributions accurately describe Premier League goal scoring. In addition, Poisson regression was utilized to predict outcomes for a Premier League season, using different sets of season data and with a large number of simulations being involved. We examined and compared various soccer metrics from our simulation results, including an English club’s chances of being the champions, finishing in the top four and bottom three, and relegation points.

Association football, also commonly known as soccer in America, is undoubtedly the most widely-played sport in the world. Often referred to as the “king of sport,” football can be played almost anywhere, from grass fields to indoor gyms, streets, parks, or beaches, due to the simplicity in its principal rules and essential equipment. Europe is known to be the birthplace of modern football [

The EPL was founded in 1992, and over the last three decades, we have witnessed numerous memorable matches and countless outstanding performances by clubs and their players. The EPL is currently a competition of twenty English football clubs. At the end of each season, the bottom three teams get relegated to the second-highest division of English football, in exchange for three promoted teams. A Premier League season usually takes place from mid-August to mid-May. Each team gets to play every other team twice, once at home and once on the road, hence there are a total of thirty-eight fixtures in a season for each team [

The most important aspect of the game of football is indisputably scoring goals. Despite the significance of other factors like ball possessing or disciplined defending, we have to admit that the main reason we pay to watch soccer is to see the ball being put in the back of the net. The rule is very simple: in order to win, you must score more than your opponent. In the Premier League, each match happens within the span of ninety minutes (plus stoppage time), and the match consists of two 45-minutes halves. Each team can get one of these three results after each match: a win, a draw, or a loss. If there is a draw, the two clubs receive a point apiece, and for non-drawing matches, the winner is rewarded with three points and the losing team gets punished with zero points. Thus the club with the most points at the end of the year will have their hands on the exquisite EPL trophy, and the total points also determines the fates of teams in the relegation zone [

In this paper, we attempt to use statistical methods to model and predict goal scoring and match results in the Premier League. We will first determine whether notable aspects of goal scoring, namely, the number of goals scored, the time between goals, and time location of goals in a match, fit the characteristics of a Poisson process. We will then use Poisson regression to predict what would happen in the 2018–19 EPL season, for instance, which clubs are more likely to win the title or get relegated, using different subsets of data from prior seasons. The paper is outlined as follows: We first introduce the data that we used for our analyses in Section

The first dataset for our investigation simply consists of match final scores of all Premier League games from its inaugural competition, the 1992–93 season, to the last fixture of 2018–19 season. The main attributes of this dataset are the season, the home and away teams, and the number of goals scored by each team. We rely on Football-Data.co.uk’s data [

To obtain the data for the first two season subsets, we simply filter out the seasons that don’t belong to the year ranges from the initial table. For the assigning weight simulation method, our weight allocation approach is very simple, as we let the weight number be equivalent to the number of times the data for a particular season is duplicated. We have decided that the previous five years before 2018–19 are almost all that matter. Thus, every season from 1992–93 to 2012–13 are given weight 1, then the weight increases by 1 for each one of 2013–14, 2014–15, and 2015–16. After that, we have the 2 most recent years left and we multiply the weight by 2. Our weight values are depicted in Figure

Weight values across seasons from 1992–93 to 2017–18 for predicting 2018–19 season outcomes.

The Poisson process [

Our second goal of this research is to use the method of Poisson regression to predict the outcomes for EPL matches. Poisson regression is a member of a broad class of models known as the Generalized Linear Models (GLM) [

1. A random component, indicating the conditional distribution of the response variable

2. A linear predictor (

3. A canonical link function

Poisson regression models are generalized linear models with the natural logarithm as the link function. It is used when our response’s data type is a count, which is appropriate for our case since our count variable is the number of goals scored. The model assumes that the observed outcome variable follows a Poisson distribution and attempts to fit the mean parameter to a linear model of explanatory variables. The general form of a Poisson regression model is

After that, we executed a large number of simulations, to get the hypothetical 2018–19 season results and then analyzed and compared what we got for each of the three subsets of season mentioned in the previous section. For each subset of data, we performed 10000 simulations, and this was accomplished by randomly generating the match final score for every team matchup, using the clubs’ average scoring rates that we got from fitting the Poisson regression models, which returns a random integer for each team’s number of goals scored. In addition, the number of points for every match outcome based on the teams’ number of goals scored were also calculated (see Table

Simulation table of 2018–19 EPL matches. The two columns

HomeTeam | HomeRate | AwayTeam | AwayRate | HomeScore | AwayScore | HomePoints | AwayPoints |

Newcastle | 1.667 | Arsenal | 1.510 | 1 | 3 | 0 | 3 |

Bournemouth | 1.474 | Southampton | 0.989 | 0 | 0 | 1 | 1 |

West Ham | 1.429 | Brighton | 0.526 | 1 | 0 | 3 | 0 |

Fulham | 1.413 | Tottenham | 1.256 | 2 | 3 | 0 | 3 |

Cardiff | 1.053 | Leicester | 1.148 | 0 | 2 | 0 | 3 |

Final standings for a simulated 2018–19 season. The team ranks are arranged by total points, followed by goal differential (goal scored minus goal conceded).

Rank | Team | Played | Points | GD |

1 | Man United | 38 | 77 | 30 |

2 | Chelsea | 38 | 73 | 27 |

3 | Liverpool | 38 | 69 | 20 |

4 | Arsenal | 38 | 68 | 26 |

5 | Leicester | 38 | 62 | 9 |

6 | Man City | 38 | 61 | 10 |

7 | Newcastle | 38 | 56 | 5 |

8 | Everton | 38 | 56 | 2 |

9 | Bournemouth | 38 | 55 | −1 |

10 | Southampton | 38 | 53 | 0 |

11 | West Ham | 38 | 53 | −3 |

12 | Tottenham | 38 | 51 | −2 |

13 | Wolves | 38 | 48 | −5 |

14 | Burnley | 38 | 48 | −6 |

15 | Cardiff | 38 | 47 | −10 |

16 | Fulham | 38 | 40 | −9 |

17 | Huddersfield | 38 | 39 | −33 |

18 | Watford | 38 | 38 | −14 |

19 | Crystal Palace | 38 | 36 | −19 |

20 | Brighton | 38 | 24 | −27 |

For our first analysis of the relationship between the number of goals scored and the Poisson distribution, we used Manchester United (MU) as our case of inspection. Our question here was “Does MU’s distribution of number of goals scored follow a Poisson distribution?” Table

Descriptive statistics of Manchester United’s number of goals scored.

min | Q1 | median | Q3 | max | mean | sd | n |

0 | 1 | 2 | 3 | 9 | 1.916 | 1.405 | 1038 |

Histogram of Manchester United’s goals scored.

Observed and expected frequencies of the number of matches for each goal value, alongside their Poisson probabilities.

Goals | Probability | Observed | Expected |

0 | 0.147 | 158 | 153 |

1 | 0.282 | 286 | 293 |

2 | 0.270 | 282 | 280 |

3 | 0.173 | 178 | 180 |

4 or more | 0.128 | 134 | 133 |

Side-by-side bar graph comparing the observed and expected matches.

We were also interested in verifying the connections between the time between goals in a season and the exponential distribution, and the re-scaled goal scoring minutes in a match and the standard uniform distribution. We continued to use Manchester United to investigate these topics and explore their goal scoring time data described in Section

Cumulative distribution curves of time between goals and the exponential distribution.

Cumulative distribution curves of the re-scaled goal scoring minutes and the standard uniform distribution.

In this section, we discuss the results of our Poisson regression models and simulations described in Section

We first look at the chances of finishing first in the EPL table at the end of the 2018–19 season for the “Big 6” in English football, which includes Manchester United, Liverpool, Arsenal, Chelsea, Manchester City, and Tottenham (see Table

Chances of winning the 2018–19 Premier League title for the Big 6.

Team | All Seasons | 2010s | Assign Weight |

Arsenal | 19.68 | 15.05 | 14.07 |

Chelsea | 14.28 | 11.70 | 9.03 |

Liverpool | 12.50 | 10.96 | 17.61 |

Man City | 7.00 | 41.71 | 38.09 |

Man United | 36.99 | 10.47 | 10.53 |

Tottenham | 3.91 | 7.38 | 8.34 |

Chances of getting relegated after the 2018–19 season for Premier League teams.

Team | All Seasons | 2010s | Assign Weight |

Huddersfield | 69.15 | 72.91 | 71.03 |

Cardiff | 49.99 | 54.38 | 53.47 |

Brighton | 41.65 | 45.03 | 44.03 |

Burnley | 31.60 | 43.38 | 33.34 |

Watford | 26.28 | 17.24 | 15.82 |

Wolves | 22.04 | 14.22 | 22.32 |

Crystal Palace | 17.54 | 14.42 | 10.65 |

Fulham | 9.87 | 7.91 | 11.74 |

West Ham | 7.46 | 6.73 | 6.39 |

Southampton | 6.49 | 5.23 | 11.65 |

Leicester | 5.53 | 1.75 | 2.39 |

Bournemouth | 4.76 | 5.46 | 5.80 |

Everton | 3.48 | 2.17 | 3.63 |

Newcastle | 2.40 | 8.78 | 7.41 |

Tottenham | 1.02 | 0.16 | 0.08 |

Man City | 0.38 | 0 | 0 |

Chelsea | 0.15 | 0.09 | 0.13 |

Arsenal | 0.10 | 0.04 | 0.06 |

Liverpool | 0.10 | 0.05 | 0.02 |

Man United | 0.01 | 0.05 | 0.04 |

Next, we investigate the likelihood of getting relegated after the 2018–19 season for EPL clubs. The relegation zone, or the last three places in the final rankings, is where no teams in the Premier League want to end up at the end of the season, because after each season, the bottom three clubs get relegated to the second highest division of English football. The results from our three subsets of season data (see Table

On a related note, the 40-point safety rule [

40-point safety rule comparison between the three subsets of season data. We tallied up the total number of teams as well as distinct simulated seasons with teams being relegated while having at least 40 points for each simulation method.

Subset | Seasons | Teams |

All Seasons | 3434 | 4424 |

2010s | 2066 | 2470 |

Assign Weight | 2346 | 2889 |

Overall, we have found that Premier League goal scoring fits the characteristics of a Poisson process. Our first result was that a Poisson distribution can be used to predict the number of matches with each number of goals scored. Additionally, the time between each individual goal in a season can be described by an exponential distribution. We also have evidence that the normalized goal scoring time positions after are uniformly distributed. We also used different sets of data prior to the 2018–19 Premier League season, namely, data from all seasons before, data from only the 2010s, and data from all previous years but assigning more weight to recent competitions, to predict what would happen in the 2018–19 season. We got each team’s goal scoring rate at home and away from home by doing Poisson regression, and then performed simulations using those rate parameters. Different team metrics like how many points each team got and what place each team finished were being kept track of from the simulations, and then we make use of those variables to analyze and compare our models of different season data subsets.

In the future, there are additional topics we could explore, including:

1. Besides the number of goals scored, there are many other factors that can be used to determine outcomes of football matches. In future research, we could use various factors to predict goal scoring and find out if they will be as helpful as using just number of goals. We could look into variables that are likely to contribute to the outcomes of Premier League football matches such as clean sheets, possession time, pass accuracy, shots on target, and numerous other soccer statistics. On top of that, we could compare different models with different predictors and evaluate them to find out which set of variables best predicts league outcomes, and then use them to simulate and predict match results.

2. In football and many other sports, team performance tends to vary throughout a season and across seasons. Some Premier League teams have the tendency of getting hot in early months, some clubs reach their peak during the middle period of the season, and a few others are more likely to do better at the season’s home stretch. Winning and losing streaks are also important factors in sports, as some clubs are streaky, while others tend to be more consistent. Thus, in future research, we could apply match results of EPL teams from past games within the season, and maybe find a way to emphasize winning and losing streaks, to predict the outcome of later matches. As a follow up, we could investigate a model’s performance throughout the season. Some models may work better and predict more accurate results at certain times in the year than others.

3. In addition to predicting match results, another popular application of statistical modeling in sports analytics is determining betting odds. We could use the probabilities from our Poisson regression models and simulations to calculate the odds of possible game outcomes for different team matchups. We could also look into and compare different types of bets such as over and under, money line wager, or point spread; determining if it is a good idea to bet on a match, and if so, how much profit we could win.

This work was completed as the author’s senior honors thesis, in partial fulfillment of the requirements for earning Departmental Honors in Mathematics at Wittenberg University in Springfield, Ohio. The author would like to express his special gratitude and thanks to his advisor, Professor Douglas M. Andrews, for his many ideas, suggestions and guidance throughout the research process. The author would also like to thank the Department of Mathematics and Computer Science at Wittenberg University for providing him the valuable knowledge during his undergraduate career, and for giving him the opportunity to participate in the Departmental Honors Program.

All of the materials related to this research are available on GitHub at