Belgium will win the World Cup in Qatar according to a predictive model

Intro

It is no longer news that I am passionate about football and artificial intelligence too. It was obvious that with the World Cup coming up he was going to try to put the machine to work to see which team has the best chance of being the next winner.

Although I had already carried out an experiment for the playoffs, in this case the approach should be different. For the playoffs, I approached the problem as a simulation one, taking as a base to sample the matches that had already been played between those two teams (eg: results between Uruguay and Paraguay in Asunción). In the case of the World Cup, it was not so simple because there is not so much history of clashes between the same teams.

Data modeling

For this reason I went for a more traditional machine learning approach, looking for a mechanism to encode (generate an embedding) of each selection, of the match itself and from that to train with a history of all the international soccer matches played since 1872.

To this information, to be able to take each team to a vectorial representation, I used the data of the FIFA ranking. To transform a selection to a list of numbers at a given time I used:

  • the points they had at that date,
  • the position,
  • whether they had improved or worsened with respect to the previous measurement,
  • the average, maximum, minimum and standard deviation of points:
    • last year,
    • of the last two years and
    • of the last three years,
  • How they performed in the last 5 games (won, lost, tied)
  • Confederacy
  • If at that time they were the current world champion (because of the champion’s curse)

The advantage of transforming the teams in this way is that the teams that are competing in itself no longer matter, for the algorithm they become two vectors.

The problem could have been addressed in different ways, for my objective it was not necessary to exactly predict the goals of one team and the other, so I actually approached it as a binary classification problem where the algorithm tries to predict whether the team will win “local”. Then I actually ask the algorithm to predict if that will happen and if the probability is within a window of around 50% I give the teams a tie.

A curiosity is that the algorithm detected that the local is luckier than the visitor. For example, when asked for the probability that Bolivia beat Uruguay, it gives (for example) 75%, however, if we ask for the probability that Uruguay beat Bolivia, it gives 95%.

As you can see these probabilities don’t add up to 1 which is pretty crazy but after you think about it, it makes sense.

Since in the World Cup the weight of the home team is not such, I had to adapt the model so that it “averages and scales” the probabilities that go one way and the other.

Models and evaluation with Russia World Cup

I trained various models LogisticRegression, KNeighborsClassifier, DecisionTreeClassifier, LinearSVC, GaussianNB, RandomForestClassifier, and GradientBoostingClassifier. The one that performed best was the latter.

After the hyperparameter tuning came the acid test, testing the model with the World Cup in Russia to get a clear idea of ​​how good the model is at predicting World Cup matches. In this case, the model hit 78.12% of the final scores (50 games in 64)

This result is quite good, compared to a totally noon-sense model that, whatever it is, always repeats that the visitor will win, would hit 26 of 64 games, that is, 40.63%.

In addition, we must add that in that World Cup there were quite unlikely results:

Predictions Qatar World Cup

Once trained, the model is capable of returning the probability of any match. For the group stage, the scenario is as follows: Uruguay has a 76% chance to beat Korea and a 79% chance to beat Ghana; and, conversely, Portugal has a 64% chance of beating Uruguay. Let’s see the odds of all the matches of the group stage of the World Cup:

In this way, the most likely thing is that we will qualify second in the group because we win two and lose one and we would cross paths with Brazil (who win all three of their group games). Let’s see then what happens in the eighth, quarter, semi-final and final:

As we can see, Brazil would beat us and we would be left out of the World Cup. Brazil’s probability of victory over Uruguay is 79%.

In order not to be so sad and since I had a model capable of throwing away the probability of any possible match, I simulated playing the World Cup 10,000 times, to see if Uruguay had any remote chance of becoming champion

Simulations

After shooting the 10,000 simulations, Belgium came out champion 1,628 times. They are followed by: Brazil, Argentina, France, England, Spain, the Netherlands, Portugal, Denmark, Germany, Mexico, the USA and Uruguay. Of the 10,000 times, 88 wins Uruguay; is less than 1%. It is hard but not impossible.

Of the 32 teams that will play the tournament, Uruguay is the thirteenth country with the most chances of winning it. 

Analysis of the Uruguay Group

If we focus on the group stage, Uruguay could finish with any number of points between 0 and 9, except 8, it’s totally impossible for Uruguay to finish with 8, hahahaha.

We see that Uruguay will most likely finish with 6 points, something that is not surprising. What is quite motivating is that the second most likely thing is that we will end up with 9 points!

If we translate this into what position we are going to finish in the group, we see that close to 40% of us finishing first, a little more than 45% of finishing second, close to 10% of finishing third and less than 5% to finish quarters

Adding the probabilities of finishing first or second we see that the probability that Uruguay classifies is more than 80%

In other words, passing the line we can say that the first place in the group will be for Portugal 55.72%, Uruguay 39.05%, Korea 4.51% and Ghana 0.72%

The best thing that can happen to us

As a color fact, in one of the runs Uruguay went to the round of 16 with only 2 points! How is that possible?

If Portugal beats everyone (Uruguay with little difference and Korea and Ghana fill the basket) and the rest of the games end in a draw, we could go through with just two points!!!

Round of 16

Everything would indicate that we are going to pass the group stage, who would we run into?

Most likely we will come across Brazil 47%, Switzerland 28%, Serbia 6% and Cameroon 5%.

The most likely thing is that in the round of 16 we lose 64% of chances!

But if for these reasons we do not cross paths with Brazil, our chances of going to the quarterfinals increase to 59.50%.

Uruguay Tour Analysis

So, knowing that Uruguay will most likely stay in the round of 16, what are the chances that they will reach each of the phases?

As we had already seen, it is most likely that we will go in the round of 16 54.68%, quarterfinals 20.19%, Group 14.72%, Semi 6.99%, vice-champion 2.54%, champion 0.88%.

In the case of reaching the final (3.42%) who would be our adversary?

Color Data

Easier path

Of the 10,000 simulations, there was a route that was really easy for us! Switzerland, Japan, Senegal and USA

Harder path

Of the 10,000 simulations, there was a route that was really difficult for us, but we ended up being crowned champion anyway! Brazil, Spain, Argentina, France.

Resume

It is quite brave for the sky blue but hope is the last thing to be lost, let’s go for that 0.8% chance of becoming champions!!!

Lottery

Inspired by this idea, we decided to prepare the launch of a special lottery in which data science professionals and students can participate by putting together their own automatic prediction model about the World Cup in Qatar and, once the tournament is over, the one who has been most accurate will win.

Héctor Cotelo
Data & Analytics Consultant