Data Sources & Libraries for Football Analytics
Overview
Football analytics data comes in two main forms: event data (discrete on-ball actions with coordinates) and tracking data (continuous positions of all 22 players and the ball). The availability, cost, and granularity of these data types shape what analyses are possible.
Event Data
What It Contains
Timestamped records of on-ball actions: passes, shots, tackles, fouls, dribbles, etc. Each event has x,y coordinates on the pitch, plus metadata (body part, outcome, assist type, etc.).
Providers
StatsBomb
- Founded by Ted Knutson in 2017
- Known for the highest-quality publicly available event data
- Unique features: freeze-frame data (positions of all visible players at the moment of each shot), detailed event taxonomy
- Open Data: free dataset covering select competitions (Men's & Women's World Cups, select La Liga and NWSL seasons, etc.) — GitHub
- Commercial data covers 100+ competitions worldwide
Opta (Stats Perform)
- One of the oldest football data companies, founded 1996
- Widely used by media (Premier League broadcasts, FBref, WhoScored)
- Comprehensive event coverage but less detailed than StatsBomb's freeze frames
- Commercial only (no free tier)
Wyscout
- Scouting-focused platform combining video and event data
- Used extensively by clubs for recruitment
- Pappalardo dataset: ~1,900 matches from 5 top European leagues (2017/18), released for academic research. Available on Figshare. Published by Pappalardo et al. in Nature Scientific Data (2019).
- Commercial platform for broader access
InStat
- Popular in Eastern Europe, Scandinavia, and lower-tier leagues
- Combines video analysis with event data
- More affordable entry point than StatsBomb/Opta for smaller clubs
Free/Public Sources
| Source | Coverage | Access |
|---|---|---|
| StatsBomb Open Data | Select competitions | GitHub |
| Wyscout/Pappalardo | 5 leagues, 2017/18 | Figshare |
| FBref | Major leagues (StatsBomb xG) | Web (fbref.com) |
| Understat | Top 6 European leagues | Web (understat.com) |
| Transfermarkt | Market values, squad info | Web |
Tracking Data
What It Contains
Continuous x,y (sometimes z) positions of all 22 players and the ball, typically sampled at 25 frames per second. Enables analysis of off-ball movement, pressing, space creation, speed, and pitch control.
Providers
Second Spectrum (Genius Sports)
- Optical tracking via stadium cameras
- Official tracking provider for the Premier League, MLS, and others
- Founded by Rajiv Maheswaran and Yu-Han Chang
Hawkeye (Sony)
- Known for ball-tracking in tennis and cricket
- Provides tracking for some football leagues
- Semi-automated offside technology in the Premier League uses Hawkeye
SkillCorner
- Uses broadcast video (no stadium cameras needed) to generate tracking data
- Makes tracking data accessible for leagues without optical infrastructure
- Growing provider, recently partnered with multiple competitions
Metrica Sports
- Provides tracking data and analysis tools
- Released a free sample dataset for research — one of the few publicly available tracking datasets
Availability
Tracking data is far less accessible than event data. Most is proprietary and expensive. Public options are limited to Metrica's sample data and synthetic datasets used in academic papers.
Python Libraries
| Library | Maintainer | Purpose |
|---|---|---|
statsbombpy | StatsBomb | Python wrapper for StatsBomb open data API |
socceraction | KU Leuven | SPADL format, VAEP, xT computation |
mplsoccer | Community | Pitch plotting, heat maps, shot maps, Voronoi, pass sonars |
kloppy | Community | Normalizes event/tracking data from multiple providers into a common format |
floodlight | Community | Analysis framework for tracking data (kinematic features, synchronization) |
codeball | Community | Expected goals models and shot analysis tools |
Key People in Football Data
- Ted Knutson — founded StatsBomb, major figure in making analytics accessible to clubs
- Luca Pappalardo — released the Wyscout public dataset for research
- William Spearman — pitch control models, previously at Hudl/Second Spectrum
- Javier Fernández — EPV (Expected Possession Value), previously at FC Barcelona
- Luke Bornn — EPV co-author, statistics professor, previously at Sacramento Kings and AS Roma
- David Sumpter — author of Soccermatics, Friends of Tracking YouTube series
- Laurie Shaw — Friends of Tracking contributor, open-source pitch control implementation
- Karun Singh — created xT framework
Tags: #football #analytics #data #libraries #providers
