This project analyzes Toronto Transit Commission (TTC) bus and subway delay incidents from January 2022 to January 2026. It covers the full pipeline from cleaning multi-year source files with inconsistent schemas, through exploratory analysis and weather integration, to an XGBoost classifier that predicts whether an incident will cause a significant delay (5 minutes or more). Results are delivered as interactive Power BI dashboards.
Each row in the data represents a single reported delay incident, not a trip. Because the total number of trips operated is not recorded, the analysis is scoped to the population of reported incidents. Notably, incidents logged with zero delay (a passenger assist resolved before the vehicle was held, for example) are valid records that inflate incident counts and deflate average delay; this is handled throughout.
Data source: Toronto Open Data — TTC Delay Data · Weather: Open-Meteo Archive API
- Consolidate eight source files (XLSX and CSV) with inconsistent schemas into two clean, analysis-ready datasets.
- Explore delay patterns across time, location, cause, and weather.
- Train a leak-free XGBoost classifier to predict significant delays (≥ 5 min).
- Translate the findings into interactive Power BI dashboards for both modes.
- Source: TTC Open Data — bus and subway delay records, January 2022 to January 2026.
- Scope: 97,502 raw subway rows and 243,594 raw bus rows, cleaned to 94,626 and 240,614 respectively.
- Granularity: one row per reported delay incident.
The 2022–2024 data arrived as XLSX files and the 2025+ data as CSV, with mismatched column schemas (e.g. bus route stored as an integer Route in one and a string Line in the other). The main issues resolved were:
- Schema unification: extracted route numbers via regex, renamed columns to a common schema, and merged official delay-code lookups into plain-text descriptions.
- Non-service records: removed garage, training, and internal runs (bus), the discontinued Line 3 / SRT (subway), and non-passenger maintenance locations.
- Cross-dataset contamination: removed subway line identifiers that leaked into the bus file, and flagged subway-only codes appearing in bus data as "Unknown".
- Inconsistent categories: normalized 40+ variant spellings of the four subway lines (e.g. "YUS", "B/D", "YU / BD") to canonical values, and corrected delay-code typos.
- Encoding repair: fixed a Windows-1252 double-encoding artefact in the bus code descriptions.
- Missing values: recovered 206 null subway
Linevalues by mapping each station to its most common line.
Delay descriptions were grouped into broad categories (Mechanical, Operations, Security, etc.) for both modes.
Three leak-free temporal features were built after sorting by timestamp, so each value uses only prior data: Previous_Delay, a 5-incident rolling mean (Rolling_Delay_5), and an expanding historical average per location and hour (Hist_Avg_Station_Hour for subway, Hist_Avg_Route_Hour for bus). Hourly weather (temperature, snowfall, precipitation) was joined from the Open-Meteo API.
- Target: binary
Delayed(1 if delay ≥ 5 min). - Outlier handling: extreme delays beyond a 3×IQR fence were removed from training only (16 min subway, 56 min bus); all rows are retained in the dashboard export and flagged via
Is_Outlier. - Split: chronological 60/25/15 train/validation/test split, never random, because the lag features would otherwise leak future information.
- Model: XGBoost classifier with
scale_pos_weightto handle class imbalance. - Leakage control:
Min Gap(gap to the following vehicle) was excluded as it approximates the target and would not be available at prediction time.
| Subway | Bus | |
|---|---|---|
| Cleaned incidents | 94,626 | 240,614 |
| Zero-delay rate | 62.3% | 8.5% |
| Median delay | 0 min | 11 min |
| Mean delay | 3.0 min | 20.8 min |
| Snow-day delay increase | +478% | +141% |
| Model ROC-AUC | 0.794 | 0.606 |
- Subway and bus delays behave very differently. 62.3% of subway incidents cause no measurable delay (versus 8.5% for bus), so subway data is dominated by logged-but-harmless events while bus incidents are mostly genuine delays.
- The worst subway delays are not at rush hour. Average delay peaks at the 4–6 AM restart window after overnight maintenance, when the timetable has no recovery buffer.
- Cause is concentrated. Bus Operations and Mechanical failures together drive over 55% of incidents; the top 3 bus incident types account for 52% of all delay-minutes.
- Weather has a large effect. Heavy-snow days raise total daily delay by +141% (bus) and +478% (subway).
- Subway delays are more predictable. The subway model (0.79 ROC-AUC) clearly outperforms the bus model (0.61), indicating subway delays carry more structure tied to specific stations and codes.
Detailed EDA, full ranking tables, and all visualizations are in subway.ipynb and bus.ipynb.
- The data records only reported delay incidents, not total service, so no rate of delayed trips can be computed.
- The bus model's modest performance (0.61 ROC-AUC) shows route-level features alone are weak predictors; richer signals such as real-time traffic would be needed.
- Categorical groupings rely on keyword rules and may misclassify some edge-case descriptions.
- Weather is matched at the city level, not per station or route.
| Dashboard | Link |
|---|---|
| Bus Delay Dashboard | View Dashboard |
| Subway Delay Dashboard | View Dashboard |
Python (pandas, NumPy), Scikit-learn, XGBoost, Matplotlib, Seaborn, Open-Meteo API, Power BI (DAX).

