Predicting a Player's Performance
Trying to build a model that predicts a player's performance from one season to the next, with an emphasis on those who move leagues.
Introduction
After writing about Brighton’s recruitment and thinking about projecting player performance to assess potential value for money buys, I wanted to try and build a model to predict a player’s performance from one season to the next, accounting for differences in the league and club.
It's a widescale problem with plenty of approaches, so I kept it simple. I only tried to predict a player’s non-penalty goals from one season to the next.
The Data
There was a trade-off between depth and breadth when deciding what data to use. It would have likely been better to use more detailed data, but part of the fun was seeing how players from lots of leagues performed in different leagues — particularly the J League and Belgian League after writing about Brighton. So, I went for breadth. I used data from these leagues: (with the names taken from FiveThirtyEight’s data, I’ll probably refer to them as something else at different points)
Spanish Primera Division
Barclays Premier League
German Bundesliga
Italy Serie A
French Ligue 1
Dutch Eredivisie
Portuguese Liga
Turkish Turkcell Super Lig
English League Championship
Belgian Jupiler League
Swiss Raiffeisen Super League
Argentina Primera Division
Austrian T-Mobile Bundesliga
Spanish Segunda Division
Danish SAS-Ligaen
Scottish Premiership
Japanese J League
German 2. Bundesliga
Swedish Allsvenskan
French Ligue 2
Italy Serie B
Norwegian Tippeligaen
It meant the only data available was goals, shots, shots on target and FiveThirtyEight’s SPI ratings. There was no xG or other more detailed data. But there was something nice about the minimalism of this, like the old-school Poisson prediction models. It also meant there was more data to work with, which was helpful considering the low number of moves between leagues.
What started as ~77k rows became ~13k after getting rid of those who didn’t have enough minutes across multiple seasons. The data isn’t the cleanest. I’ve likely lost some players through error or lazy joining.
How to model it?
Writing the idea for this kind of model was easier than building it. A lot of subtleties started to stand out. You have to consider a player’s performance, their team’s performance and strength, and their prospective team’s performance and strength. Which, again, doesn’t sound that bad. But thinking about how best to model led to a lot of trial and error and even more strange results, like Dominic Solanke, Adam Armstrong and Aleksandar Mitrović sharing the Premier League golden boot.
The three main approaches I tried were:
Predict the total number of goals if a player played 1800 minutes. Then use per 90 values to compare to reality
Predict with a classification algorithm, then use the underlying probability figures to predict an ExpG value
Predict the goals per 90
None were particularly convincing. I liked the idea of using a classification algorithm with the underlying probabilities because the original classification tended to predict higher values. The linear model tended to be lower than reality (max value per 90 was 0.84), while the classification algorithm predicted players to score ~20+ goals across a season (this was before doing the 1800-minute idea). However, it added another step of computation, and after going through the extra step, the values were even lower than the linear model.
With it being quicker to fit and predicting continuous values, the good ol’ linear regression seemed best, as did predicting per 90 values. I don’t know if it is the best. I’m not even sure how best to test such a widescale model, especially because it shouldn’t be that accurate. If it accounts for every outlier or surprise, it’s likely a bad model. But I wouldn’t go so far as to say this is a good model.
My main (okay, only) way of assessing the entire performance was to see the R2 score (not sure how to superscript on here). And look through lists of players, but that isn’t a good method. You see that the model predicted a couple of values close to reality and feel like you’re on the cusp of something great. Then you see the model also predicted a player from a Championship side score more for an ~average Premier League side, or Erling Haaland’s numbers look human, and feel it’s all been a waste of time. I did later use RMSE to assess some results, but that wasn’t until I opted for the linear model.
It’s also difficult because the data isn’t too predictable. Across the entire dataset (after cleaning), the R2 score for a player’s npG per 90 from the previous season to the next was 0.371. The R2 score for a player’s predicted npG per 90 and actual npG per 90 was 0.613. It makes it seem like there’s more correlation between prediction and reality than reality and reality.
I did cheat a bit. I used the SPI values from the current season in predictions. It doesn’t seem a massive crime since it’s hard to imagine the SPI value from the close of the previous season would differ ~that much from the opening or average of the next season. Using averages and the current season was more convenient to program. It also makes it easier to project prospective moves using the current season’s data.
Just get to the results already…
Since I said I’m not great at assessing the model, I’m also not great at reviewing it or writing about it. It seems best to cherry-pick some elements and hope they’re interesting.
So, across all the seasons, who did it predict would have the most productive season in front of goal? These are the top ten predicted npG per 90 values (npG per 90 (t-1) is their value from the season before and offers some context to the prediction (sorry for the horrible formatting, I’m not sure how best to make tables that are readable on here)):
(EDIT: I've just seen that these tables look horrible on mobile. I'll try and fix it soon. Requesting desktop site made it better for me but that isn't ideal.)
# - Player - Club - Season - (p)npG per 90 - npG per 90 - npG per 90 (t-1)
1. Kylian Mbappé PSG 2020/21 0.84 - 0.80 - 1.07
2. Lionel Messi Barcelona 2019/20 0.77 - 0.63 - 1.06
3. Lionel Messi Barcelona 2018/19 0.74 - 1.06 - 0.96
4. Luis Muriel Atalanta 2021/22 0.74 - 0.41 - 1.25
5. Robert Lewandowski Bayern Munich 2021/22 0.73 - 0.92 - 1.21
6. Cristiano Ronaldo Juventus 2018/19 0.70 - 0.54 - 0.91
7. Kylian Mbappé PSG 2019/20 0.69 - 1.07 - 1.23
8. Robert Lewandowski Barcelona 2022/23 0.68 - 0.73 - 0.92
9. Kylian Mbappé PSG 2021/22 0.68 - 0.71 - 0.80
10. Robert Lewandowski Bayern Munich 2018/19 0.66 - 0.58 - 0.95
It feels acceptable, but you can see how low the predicted values compare to some of their previous season values. Even more annoying, the low values stop you from deluding yourself into believing things like the model predicted Lewandowski wouldn’t be as productive at Barcelona or Ronaldo wouldn’t be as productive at Juventus.
Premier League Imports
Seeing how the model predicts player performance across leagues was more of a priority than predicting season to season. I thought it’d be worth seeing how stable the model predicts how players will perform in the Premier League. There are players that I was hoping to see included but aren’t because of missing data (which I may have to check in the future), but the top five leagues with the lowest RMSE and >= 10 moves are:
Portuguese Liga (13 moves — 0.08 RMSE)
English League Championship (158 moves — 0.09 RMSE)
Italy Serie A (30 moves — 0.09 RMSE)
French Ligue 1 (41 moves — 0.10 RMSE)
German Bundesliga (32 moves — 0.15 RMSE)
The Championship includes promoted players, so it can likely get a good score by predicting close to zero for every goalkeeper and defender. In general, from looking at other leagues, those with the fewest productive players performed better.
The Portuguese league coming out on top was interesting. Looking through the moves, the model handled them well. If a January signing played more for their previous club than their new club, they’re included twice (I took the data with the most minutes if there was more than one record for a season), which applies to Bruno Fernandes and Luis Díaz.
Darwin Núñez, Díaz and Fernandes are the main reason the model does well with the Portuguese league, but there are some caveats. The model predicted Núñez would manage 0.48 npG per 90, and he did. But he also underperformed his xG. Had Núñez performed in line with his xG, this prediction wouldn’t have looked so good.
Fernandes has two records in the data. His 2019/20 Manchester United prediction is based on his 2018/19 Sporting data, while his 2020/21 Manchester United is based on his 2019/20 Sporting data — rather than his 2019/20 United data. When he joined in January, the model predicted he’d fall from his 2018/19 rate of 0.43 per 90 to 0.33 per 90. In reality, he managed 0.30 per 90. The next season the model predicted 0.33 per 90 again, but his real production fell to 0.26 per 90.
Díaz faced a similar problem following the January move. The model predicted 0.28 npG per 90 for his first half-season, where he managed 0.38 per 90. But across his first season, the model predicted 0.42 per 90 (following 0.84 per 90 in the Portuguese league), and he managed 0.36 per 90.
This is a tiny sample, and it’s mostly luck the predictions have been in a similar ballpark, but it still feels encouraging. The model also predicted Matheus Nunes wouldn’t be as productive for Wolves, going from 0.11 per 90 → 0.05 per 90. In reality, he managed 0.04 per 90.
Pedro Porro is the only player who performed much better than the prediction, contributing 0.24 npG per 90 when he was only predicted 0.12 per 90.
Moving on to the Championship, despite it having the 2nd best RMSE, there were some odd values. The model predicted Callum Robinson would have a better goalscoring rate for a newly promoted West Brom side than he did in the Championship. It also predicted Saïd Benrahma would have a higher goalscoring rate for West Ham than Brentford. Below are a couple of highlights:
Overall, it’s not bad. Benrahma and Armstrong are two standouts for players who didn’t come close to expectation. Armstrong was coming off the back of a great goalscoring season, so his numbers being inflated isn’t too surprising. From what I’ve seen, he’s the main player to move from the Championship after a great season and not deliver on what was expected. Comparable examples are Aleksandar Mitrović for Fulham’s first promotion (0.55 t-1, predicted 0.34, actual 0.13) and Dominic Solanke for Bournmouth’s recent promotion (0.55 t-1, predicted 0.31, actual 0.19).
A lot of the big performers performed in line with expectation. Below are the top 20 for their previous season's output in the Championship (Season is the year the season started).
Player Season Actual Predicted t-1
Mitrović 2022 0.45 0.41 0.87
Billy Sharp 2019 0.26 0.32 0.66
Adam Armstrong 2021 0.13 0.40 0.60
Tammy Abraham 2019 0.61 0.36 0.57
Jarrod Bowen 2020 0.28 0.36 0.55
Mitrović 2020 0.13 0.34 0.55
Teemu Pukki 2021 0.22 0.29 0.55
Dominic Solanke 2022 0.19 0.31 0.55
Ollie Watkins 2020 0.35 0.30 0.54
Neal Maupay 2019 0.29 0.34 0.53
Che Adams 2019 0.33 0.30 0.51
Oliver McBurnie 2019 0.26 0.32 0.50
Kieran Dowell 2021 0.10 0.27 0.48
Jarrod Bowen 2019 0.10 0.26 0.41
Grady Diangana 2020 0.07 0.17 0.39
Emi Buendía 2021 0.19 0.21 0.38
Harry Wilson 2019 0.38 0.31 0.37
Harvey Barnes 2019 0.26 0.29 0.37
Patrick Bamford 2020 0.44 0.34 0.37
Jacob Murphy 2020 0.11 0.20 0.35
A lot seem close enough to their prediction to be content with. Ollie Watkins, Neal Maupay, Emi Buendía and Che Adams are four signings that moved from the Championship to the Premier League (rather than part of a promoted team) and performed largely in line with the model’s prediction. Given that Bowen was a January signing, falling prey to the problems mentioned earlier, he’s a bit of a mixed bag. His first half-season was underwhelming, but his first full season was closer to expectation. But, after scoring the winner in a European final, West Ham fans will likely not care about an underwhelming first half-season.
There is a concern, since the model tends to be lower than reality, that moving to a tougher league is more favourable for results, explaining why the Championship → Premier League move scores well. But the RMSE was the same for Premier League → Championship too. This move may perform better due to the sample size. Since so many players play in both due to promotions and relegations, the model has more data points between the two leagues. More of the best RMSE scores come from a first division and second division, rather than international moves.
It makes it more interesting that Brighton have made quite a few of their more expensive signings from the Championship. Initially I thought it was because it may compare to some top division leagues in quality while offering better value, but it may also be because their internal models have more confidence in projecting player performance given the higher number of data points, perhaps limiting risk when making a first-team signing.
The Italian Serie A having such a good score with over double the moves of the Portuguese league was surprising. Looking at the figures, it may have been because only 5 of those moves were players with > 0.3 npG per 90 in their final Serie A season. Since the model tends to be lower than reality and there’s likely to be less error predicting values < ~0.1, it seems likely that the lower scoring a sample is for npG, the better the RMSE.
Despite that, the Serie A players with > 0.3 npG per 90 performed mostly in line with their Premier League prediction.
Player Squad Season Actual Predicted t-1
Cristiano Ronaldo Manchester Utd 2021 0.55 0.54 0.74
Gianluca Scamacca West Ham 2022 0.29 0.39 0.63
Romelu Lukaku Chelsea 2021 0.40 0.38 0.56
Gonzalo Higuaín Chelsea 2018 0.41 0.44 0.48
Felipe Anderson West Ham 2018 0.27 0.16 0.31
Again, it may benefit from going to a perceived tougher league, plus having some older players in there, which will likely predict a lower figure. But it’s almost scary how accurate the figures are for Ronaldo, Lukaku and Higuaín. While I don’t think this model is accurate enough for someone to use, if you saw those figures before signing Lukaku or Ronaldo, would you still have invested all the money? A player can have more to their game than goals, but you’d imagine there was better value out there for each side.
Felipe Anderson overperformed the model’s prediction, but Scamacca is the more interesting case. It predicted a decline from his Serie A figures, but he still didn’t meet it. It’s the kind of situation where a confidence interval would be useful. If you’re West Ham, investing in a new ‘ready to explode’ striker for €36m, how confident would you have to be that he’d produce x npG per 90 to deem it worthwhile? Having a range of figures and probabilities for projected goals would likely be more useful, then saying he’s worth x amount if there’s a > y% chance he provides > z goals.
Tangent about Lukaku
Remember way back when I said:
While I don’t think this model is accurate enough for someone to use, if you saw those figures before signing Lukaku or Ronaldo, would you still have invested all the money? A player can have more to their game than goals, but you’d imagine there was better value out there for each side.
Well, I thought it’d be fun to test it. I used the model to project everyone’s npG per 90 if they played for Chelsea in the 2021/22 season. And to be fair to Chelsea, there wasn’t much choice out there. It’s tough to do this in hindsight, having two more years of football since, but not many names jump off the page as obvious alternatives — particularly if we include the fact Chelsea likely wanted Lukaku due to his link-up play/ability to play into his feet.
To try and minimise the names, I only looked at those from the top five leagues, plus the Championship and Portuguese league (due to the impressive RMSE figures). 48 players had higher projected figures than Lukaku, but not all are realistic (Adam Armstrong will haunt my nightmares).
Two players on the list were Olivier Giroud and Tammy Abraham. Maybe Chelsea would have been better off keeping those two around than splashing the cash on Lukaku? They both had better projected figures and weirdly performed in line with those projections even at their new clubs (Giroud was projected 0.46 for Chelsea and scored 0.43 for Milan. Abraham was projected 0.43 for Chelsea and scored 0.41 for Roma).
Lukaku’s stock always seems to rise during international tournaments, so signing him after EURO 2020(/1) may not have been ideal. Chelsea had just won the Champions League and likely wanted to invest to push them towards the title, but the fee seems inflated. Unless Chelsea were happy because Lukaku was their ideal striker and they had the money. Thinking about it, Lukaku almost feels like an amalgamation of Chelsea’s three central options at the time (Giroud, Abraham and Kai Havertz). The physicality of Giroud with the ability to play into his feet with his back to goal + the off-the-ball play/movement of Havertz + the ability to get into good goalscoring positions like Abraham. But was anyone else available?
The model projected Alexander Isak would score 0.47 npG per 90 for Chelsea (which is weirdly his exact npG per 90 for Newcastle this season). He had two productive seasons in front of goal for Sociedad and was only 21, but he didn’t have the same strong figures for xAG or receiving progressive passes as Lukaku. Being a different profile and still only young, it’s likely he wasn’t the right and too high a risk for Chelsea at the time.
If we’re stricter and say a player must have played >1800 minutes in the previous season since they would be a regular starter for Chelsea, there were 28 players with higher projections than Lukaku. There were still some surprising Premier League players: Dominic Calvert-Lewin, Christian Benteke, Michail Antonio and Danny Ings. These players all had strong figures at the time, so it shouldn’t be too surprising they’re projected to have good figures for a stronger side, but it does feel strange to imagine Chelsea unveiling Christian Benteke as their new starting forward. Dominic Calvert-Lewin is the youngest and was coming off the back of a good season with Everton, but Michael Antonio is probably my favourite name listed.
If Antonio was a couple of years younger or had a better record with injuries, he could have been a fun risk to take for the right fee — replacing Giroud as the older forward option while keeping Abraham. At his best, he has the ability to move into channels, receive the ball to feet and enhance the game of supporting players (like Jarrod Bowen at West Ham and what would have been Mason Mount, Timo Werner and Kai Havertz at Chelsea?). Looking at their comparison page for the 2020/21 season on FBref, they had some similar figures — not to mention Antonio did for a weaker side in a tougher league (by SPI ratings), and there’s a chance of his projections being more stable given Premier League → Premier League has more data points. Some of their key figures are below:
Metric Lukaku Antonio
npxG per 90 0.57 0.59
xAG per 90 0.22 0.14
npG per 90 0.56 0.46
A per 90 0.34 0.23
Prog passes per 90 2.56 1.64
Prog carries per 90 2.41 2.69
Rec prog passes per 90 9.22 8.08
I can’t check everyone’s data in the code (well, I could scrape more and join it since it’s all FBref, but I’m not going to), but from the few names I’ve looked at manually, Antonio probably profiles the closest. I’m not saying Chelsea should have signed Antonio instead of Lukaku, but Antonio’s contribution at West Ham can sometimes seem a bit underrated, perhaps because of his injury record.
Two other names with strong projected figures but perhaps a drawback or two that stop them from being ideal for Chelsea:
André Silva - 0.49 (p)npG per 90. I haven’t seen loads of Silva, but he’s always struck me as someone who is a good penalty box striker rather than an all-rounder. But his numbers are strong all around from 2020/21. Maybe too much risk involved given Chelsea’s history of Bundesliga attackers. If they added a fourth for a decent fee and they didn’t take off…
Gerard Moreno - 0.40 (p)npG per 90. Has had good figures for years, but has always seemed like someone who wouldn’t do well with the physicality of the Premier League. Seems like he’d be useful dropping into space, but less so with his back to goal and the ball to feet.
So, I have some sympathy for Chelsea. There wasn’t a wealth of options that fit the Lukaku profile. It might be boring, or perhaps not something that would have excited Chelsea fans at the time, but sticking with Giroud and Abraham and investing the money elsewhere is probably my favourite option — even if their previous season numbers did come from a small number of minutes.
J League Exports
After the success of Celtic’s J League signings, plus Kaoru Mitoma’s form in the Premier League, the J League could be growing as a potential shopping destination for European clubs.
The J League’s average SPI is ~35. It’s ~the same as the French Ligue 2, the German 2. Bundesliga and the Swedish Allsvenskan. Some European leagues with ~45 SPI are the Swedish Allsvenskan, the Swiss Raiffeisen Super League, the Turkish Turkcell Super Lig, the Dutch Eredivisie and the Austrian T-Mobile Bundesliga. These leagues have also had players move from the J League in the data. The sample sizes are tiny, and some players are duplicated or pull data from the season before their final one due to how I’ve formatted the data and the J League running in one calendar year, but there are some encouraging results.
J League → Belgium
Of the ten moves of J League players to the Belgian league, the average RMSE was just 0.061. Only three players may have been productive in front of goal, making it easier to score a good RMSE, but all three performed in line with expectation.
Ayase Ueda moved from Kashima Antlers to Cercle Brugge. In his final J League season in the data, he scored 0.67 npG per 90. The model predicted he’d score 0.49 per 90 in Belgium, but he managed 0.53 per 90.
Daichi Hayashi moved from Sagan Tosu to Sint-Truiden. In his final J League season in the data, he scored 0.52 npG per 90. The model predicted he’d score 0.39 per 90 in Belgium, but he scored 0.35 per 90.
Taichi Hara moved from FC Tokyo to Sint-Truiden. In his final J League season in the data, he scored 0.42 npG per 90. The model predicted he’d score 0.35 per 90 in Belgium, which wasn’t far off his 0.35 per 90.
It’s an interesting avenue to explore for the Belgian league. The average SPI is higher than the J League, but not massively. You may not expect players to match their J League performance, but you might not expect them to lag far behind. Considering Ayase Ueda cost €1.30m, contributing ~0.5 npG per 90 seems a bargain. Both Victor Osimhen and Leandro Trossard left the Belgian league for > €16m after having similar figures — although Osimhen was a few years younger and Trossard was a winger. But if he can continue after his promising start, Cercle Brugge have a good value for money buy.
J League → Scotland Celtic
Ange Postecoglou wasn’t all Celtic found in the J League. The new Spurs manager also brought some J League players to the Scottish Premiership. The players in the data all look like great signings, and, considering the similar SPI between the two leagues, it’s interesting no one tapped into the market before Celtic — unless Postecoglou’s knowledge/experience of the league helped reduce the risk involved.
I should point out the Scottish Premiership had the worst RMSE for J League exports, but mostly because Kyogo Furuhashi’s performed much better than projected. But I have a thought on this.
Since the league is similar and Celtic are a dominant side, I guess that the high-performers fell victim to the model underestimating the upper boundary of goalscorers. I wonder if it’d be better to create the model with only attacking players or only players above x goals per 90. There’s a worry that all the defensive players are warping or weighing down the model.
The model predicted Reo Hatate and Daizen Maeda would perform ~the same as they did in the J League. Considering the similar league SPI and Reo Hatate’s similar team strength (team SPI / league SPI), plus Maeda having a comparable team strength for the one data point, that shouldn’t be surprising. But Kyogo Furuhashi moved from a team with lower relative team strength and, despite scoring 0.72 npG per 90, was projected 0.58 per 90. He smashed this with 1.05 per 90. While you wouldn’t predict he’d score that many, considering other factors, if the model predicted around the same as his J League figures, you’d consider that fairer.
The TransferMarkt value of the three combined has ~tripled compared to what they paid for the trio. They may be older than their previous big exports, but if they perform for Celtic, there’s probably a higher chance of another European club taking notice and offering Celtic the chance to make a profit. It’s not a bad model for Celtic to try and sign players from comparable quality, but perhaps more obscure, leagues, profiting and reinvesting — while still keeping the team top of the Scottish Premiership.
J League → Portugal, plus Kaoru Mitoma and Brighton
Not everyone has had joy signing players from the J League. The Portuguese league had the 2nd worst RMSE after the Scottish Premiership, but not because players exceeded their projected figures. However, the model predicted some players to perform better in Portugal than in Japan, which needs investigating. The more productive players in front of goal were ~half as productive in Portugal, but if teams know this in advance, there may still be an opportunity to find value.
Finally, we can’t talk about J League exports without mentioning Kaoru Mitoma. Due to how I’ve formatted the data, it predicts Mitoma straight from the J League → Brighton, rather than including his loan spell in Belgium (since they’re both classed as 2021 season and his minutes in Japan > his minutes in Belgium). I’d be curious to know how the projections change if the model considered his figures in Belgium. However, even using just his final J League season, the model projected 0.34 npG per 90 in the Premier League, not too far from his actual 0.27 npG per 90. It makes for an interesting comparison.
In my piece about Brighton, I said it was interesting Undav and Mitoma stayed at the club following a season on loan, considering the contract situations of Maupay and Trossard. Going into the season, the model predicted 0.47 npG per 90 for Undav after his season in Belgium and 0.34 npG per 90 for Mitoma after his final J League season. Maupay managed 0.28 npG per 90 in his previous season, while Trossard managed 0.22 per 90. Trossard outperformed his projection (0.48 vs 0.24 per 90), while Maupay left for Everton. The model even predicted Welbeck would score 0.33 npG per 90, a small decline from last season’s 0.37 per 90 but slightly more than his actual 0.29 per 90.
From here, you can almost see Brighton’s thinking and situation. Welbeck’s actual and predicted performances are better than Maupay’s, while Undav’s predicted figures are great. He may not have played many minutes, and with more game time, I imagine they might be lower, but it seems worth the chance of letting Maupay go and reinvesting while keeping Mitoma as a Trossard understudy, ready to take over should Trossard go.
As a bit of fun, I projected how players would perform for Brighton next season to see what the model thinks of João Pedro. It projected 0.24 npG per 90, which isn’t great but is ~the same as what Maupay averaged for Brighton. Brighton will likely hope the increased production under Roberto de Zerbi + his age will push that a bit higher. Most options that were projected to be more productive were also unrealistic or contained more risk (due to the league they’re from). It makes you think a ~75% confidence of scoring 0.24 npG per 90 might be better than a ~40-50% confidence of scoring 0.30 npG per 90.
Conclusion
Cherry-picking examples, as I have, it’s easy to focus on the projections that were good or presented an opportunity to build a narrative. There were plenty of weird projections, but there was often a reason behind the weirdness. If a player has a great season for an average side, the model will likely project them to have a great season, especially for a stronger side.
There’s also the problem that the model is somewhat conservative with its predictions. It’s rare it will predict a significant drop-off for a player. Most changes seem small and tethered to their previous season’s production. Considering how little relationship there was between a player’s goalscoring from one season to the next, there’s an argument that the model expects players to be more consistent than they are in reality.
Some quick thoughts I’ve had on the model, with an eye to improving and expanding on it (but not including using different technologies/modelling techniques):
Introduce a threshold for goals. It benefits the performance measures to have so many low-scoring players (predicting a player who rarely scores to rarely score = low RMSE). Since we’re only looking at goals, players who have no hand in goalscoring likely aren’t helpful to predictions. In the graph for the Championship → Premier League players, you can see a glut of players < 0.1 on each axis. It’s hard to imagine these points are informative for the model.
Consider position. It was only late in messing around with the model that I realised I had FBref’s position column. It could be useful to include this when predicting goals — and ties in with the above point.
Consider how to best use the data. Going from ~77k rows to ~13k (when using two season’s data) or ~25k (when using one season), seems a steep drop. Considering the difference in when seasons run and how that impacts the data, plus how the system duplicates January signings, makes it seem like there’s a better way of handling the data than what I’m currently doing.
Expand to other metrics. This has been fun but it’d be interesting to predict more, although the problem is the availability of data for more detailed metrics.
Overall, it’s been fun. While it’s simple and basic, hacking something like this together using publicly available data and gaining some insight or narratives makes it seem like the fun and insight would only grow with more (or more detailed) data.
I used a different model for this section. To increase the sample size I only used the previous season’s data rather than the previous two. It doubled the number of data points.