End-to-end data project demonstrating skills across Data Analytics, Data Engineering, and Data Science
This project builds a probabilistic system to estimate FIFA World Cup winners using historical international match data, time-aware feature engineering, interpretable machine learning, and Monte Carlo tournament simulations.
The goal is not just to predict a single winner, but to quantify uncertainty, simulate thousands of tournament outcomes, and present results through clean analytics and visualizations.
- ✅ End-to-end pipeline: raw data → features → model → simulation → insights
- ✅ Time-aware feature engineering (no data leakage)
- ✅ Interpretable model with coefficient analysis
- ✅ Monte Carlo tournament simulation (10,000+ runs)
Can we estimate the probability of each national team winning the FIFA World Cup using historical match performance while accounting for uncertainty and randomness inherent in tournament play?
Instead of producing a single deterministic prediction, this project:
- Estimates match-level win probabilities
- Propagates uncertainty through full tournament simulations
- Produces championship probability distributions
world-cup-outcome-modeling/
│
├── data/
│ ├── raw/ # Original Kaggle dataset
│ ├── processed/ # Cleaned data
│ └── features/ # Engineered features
│
├── scripts/ # Data engineering pipeline
│ ├── ingest.py
│ ├── feature_engineering.py
│ └── run_pipeline.py
│
├── notebooks/
│ ├── 01_eda.ipynb
│ ├── 02_modeling.ipynb
│ └── 03_simulation.ipynb
│
├── models/
│ ├── soccer_model.pkl
│ └── soccer_scaler.pkl
│
├── report/ (In Progress)
│ └── final_report.pdf
│
└── README.md
- Dataset: International Football Results (Kaggle / OpenFootball)
- Coverage: 1872 – 2024
- Filtering: Matches from 2010 onward to reflect modern football
- Match date
- Home team / Away team
- Scores
- Tournament type
- Neutral venue flag
This project is designed so beginners can follow the full data flow.
- Load raw CSV data
- Parse dates
- Filter invalid or missing records
- Convert matches into long format (one row per team per match)
- Standardize columns:
team,opponent,team_score,win
Rolling features are computed per team, using only past matches.
Key features:
avg_goals_last_5avg_goals_conceded_last_5win_rate_last_5avg_goal_diff_last_5days_since_last_match
📌 Design Decision: All rolling features are shifted by 1 match to prevent data leakage.
Why logistic regression?
- Interpretable coefficients
- Probabilistic output
- Stable and explainable
- Common in quantitative modeling
- Binary match outcome (win / loss)
Instead of raw features, we model feature differences between two teams:
X = team_A_features − team_B_features
This allows the model to learn relative strength, not absolute values.
- ROC-AUC used for probability quality
- Coefficient inspection for interpretability
- Sanity checks on feature signs
| Feature | Effect |
|---|---|
| avg_goal_diff_last_5 | 0.143740 |
| avg_goals_last_5 | 0.093878 |
| win_rate_last_5 | 0.028106 |
📌 Insight: Goal differential is a stronger signal than raw win rate.
Single predictions hide uncertainty. Simulation allows us to:
- Estimate full probability distributions
- Measure variance and stability
- Analyze upsets
- Sample match outcomes using model probabilities
- Simulate knockout rounds
- Repeat 10,000+ times
- Count tournament winners
- Champion probability per team
- Cumulative probability curves
Key plots included in the project:
These visualizations make probabilistic results interpretable and actionable.
| Rank | Team | Probability | Win % |
|---|---|---|---|
| 1 | 🇪🇸 Spain | 0.1457 | 14.57% |
| 2 | England | 0.1203 | 12.03% |
| 3 | 🇧🇪 Belgium | 0.0874 | 8.74% |
| 4 | 🇮🇹 Italy | 0.0781 | 7.81% |
| 5 | 🇨🇴 Colombia | 0.0632 | 6.32% |
| 6 | 🇭🇷 Croatia | 0.0623 | 6.23% |
| 7 | 🇲🇦 Morocco | 0.0606 | 6.06% |
| 8 | 🇫🇷 France | 0.0602 | 6.02% |
| 9 | 🇦🇷 Argentina | 0.0585 | 5.85% |
| 10 | 🇳🇱 Netherlands | 0.0550 | 5.50% |
| 11 | 🇧🇷 Brazil | 0.0532 | 5.32% |
| 12 | 🇩🇪 Germany | 0.0434 | 4.34% |
| 13 | 🇺🇾 Uruguay | 0.0386 | 3.86% |
| 14 | 🇵🇹 Portugal | 0.0338 | 3.38% |
| 15 | 🇺🇸 United States | 0.0286 | 2.86% |
| 16 | 🇲🇽 Mexico | 0.0111 | 1.11% |
📌 Results represent model-implied odds, not guarantees.
- No player-level data
- No group-stage modeling
- Static team strength assumption
- Neutral venue treated uniformly
- Elo or Bayesian team strength updates
- Group-stage simulation
- Player-level metrics
- Live data ingestion via API
- Automated retraining pipeline
pip install -r requirements.txt
python pipelines/run_pipeline.py
jupyter notebook notebooks/📫 Questions or suggestions welcome!

