TTC Bus & Subway Delay Analysis (2022–2026)

Overview

This project analyzes Toronto Transit Commission (TTC) bus and subway delay incidents from January 2022 to January 2026. It covers the full pipeline from cleaning multi-year source files with inconsistent schemas, through exploratory analysis and weather integration, to an XGBoost classifier that predicts whether an incident will cause a significant delay (5 minutes or more). Results are delivered as interactive Power BI dashboards.

Each row in the data represents a single reported delay incident, not a trip. Because the total number of trips operated is not recorded, the analysis is scoped to the population of reported incidents. Notably, incidents logged with zero delay (a passenger assist resolved before the vehicle was held, for example) are valid records that inflate incident counts and deflate average delay; this is handled throughout.

Data source: Toronto Open Data — TTC Delay Data · Weather: Open-Meteo Archive API

See Power BI Dashboards

Objectives

Consolidate eight source files (XLSX and CSV) with inconsistent schemas into two clean, analysis-ready datasets.
Explore delay patterns across time, location, cause, and weather.
Train a leak-free XGBoost classifier to predict significant delays (≥ 5 min).
Translate the findings into interactive Power BI dashboards for both modes.

Methodology

Data

Source: TTC Open Data — bus and subway delay records, January 2022 to January 2026.
Scope: 97,502 raw subway rows and 243,594 raw bus rows, cleaned to 94,626 and 240,614 respectively.
Granularity: one row per reported delay incident.

Data Cleaning

The 2022–2024 data arrived as XLSX files and the 2025+ data as CSV, with mismatched column schemas (e.g. bus route stored as an integer Route in one and a string Line in the other). The main issues resolved were:

Schema unification: extracted route numbers via regex, renamed columns to a common schema, and merged official delay-code lookups into plain-text descriptions.
Non-service records: removed garage, training, and internal runs (bus), the discontinued Line 3 / SRT (subway), and non-passenger maintenance locations.
Cross-dataset contamination: removed subway line identifiers that leaked into the bus file, and flagged subway-only codes appearing in bus data as "Unknown".
Inconsistent categories: normalized 40+ variant spellings of the four subway lines (e.g. "YUS", "B/D", "YU / BD") to canonical values, and corrected delay-code typos.
Encoding repair: fixed a Windows-1252 double-encoding artefact in the bus code descriptions.
Missing values: recovered 206 null subway Line values by mapping each station to its most common line.

Delay descriptions were grouped into broad categories (Mechanical, Operations, Security, etc.) for both modes.

Feature Engineering

Three leak-free temporal features were built after sorting by timestamp, so each value uses only prior data: Previous_Delay, a 5-incident rolling mean (Rolling_Delay_5), and an expanding historical average per location and hour (Hist_Avg_Station_Hour for subway, Hist_Avg_Route_Hour for bus). Hourly weather (temperature, snowfall, precipitation) was joined from the Open-Meteo API.

Modelling

Target: binary Delayed (1 if delay ≥ 5 min).
Outlier handling: extreme delays beyond a 3×IQR fence were removed from training only (16 min subway, 56 min bus); all rows are retained in the dashboard export and flagged via Is_Outlier.
Split: chronological 60/25/15 train/validation/test split, never random, because the lag features would otherwise leak future information.
Model: XGBoost classifier with scale_pos_weight to handle class imbalance.
Leakage control: Min Gap (gap to the following vehicle) was excluded as it approximates the target and would not be available at prediction time.

Key Findings

	Subway	Bus
Cleaned incidents	94,626	240,614
Zero-delay rate	62.3%	8.5%
Median delay	0 min	11 min
Mean delay	3.0 min	20.8 min
Snow-day delay increase	+478%	+141%
Model ROC-AUC	0.794	0.606

Subway and bus delays behave very differently. 62.3% of subway incidents cause no measurable delay (versus 8.5% for bus), so subway data is dominated by logged-but-harmless events while bus incidents are mostly genuine delays.
The worst subway delays are not at rush hour. Average delay peaks at the 4–6 AM restart window after overnight maintenance, when the timetable has no recovery buffer.
Cause is concentrated. Bus Operations and Mechanical failures together drive over 55% of incidents; the top 3 bus incident types account for 52% of all delay-minutes.
Weather has a large effect. Heavy-snow days raise total daily delay by +141% (bus) and +478% (subway).
Subway delays are more predictable. The subway model (0.79 ROC-AUC) clearly outperforms the bus model (0.61), indicating subway delays carry more structure tied to specific stations and codes.

Detailed EDA, full ranking tables, and all visualizations are in subway.ipynb and bus.ipynb.

Limitations

The data records only reported delay incidents, not total service, so no rate of delayed trips can be computed.
The bus model's modest performance (0.61 ROC-AUC) shows route-level features alone are weak predictors; richer signals such as real-time traffic would be needed.
Categorical groupings rely on keyword rules and may misclassify some edge-case descriptions.
Weather is matched at the city level, not per station or route.

Power BI Dashboards

Dashboard	Link
Bus Delay Dashboard	View Dashboard
Subway Delay Dashboard	View Dashboard

Tools Used

Python (pandas, NumPy), Scikit-learn, XGBoost, Matplotlib, Seaborn, Open-Meteo API, Power BI (DAX).

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Data		Data
Images		Images
.gitignore		.gitignore
README.md		README.md
bus.ipynb		bus.ipynb
subway.ipynb		subway.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TTC Bus & Subway Delay Analysis (2022–2026)

Overview

Objectives

Methodology

Data

Data Cleaning

Feature Engineering

Modelling

Key Findings

Limitations

Power BI Dashboards

Tools Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TTC Bus & Subway Delay Analysis (2022–2026)

Overview

Objectives

Methodology

Data

Data Cleaning

Feature Engineering

Modelling

Key Findings

Limitations

Power BI Dashboards

Tools Used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages