Skip to content

jnqu/ttc-delay-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TTC Bus & Subway Delay Analysis (2022–2026)

Overview

This project analyzes Toronto Transit Commission (TTC) bus and subway delay incidents from January 2022 to January 2026. It covers the full pipeline from cleaning multi-year source files with inconsistent schemas, through exploratory analysis and weather integration, to an XGBoost classifier that predicts whether an incident will cause a significant delay (5 minutes or more). Results are delivered as interactive Power BI dashboards.

Each row in the data represents a single reported delay incident, not a trip. Because the total number of trips operated is not recorded, the analysis is scoped to the population of reported incidents. Notably, incidents logged with zero delay (a passenger assist resolved before the vehicle was held, for example) are valid records that inflate incident counts and deflate average delay; this is handled throughout.

Data source: Toronto Open Data — TTC Delay Data · Weather: Open-Meteo Archive API

See Power BI Dashboards

Objectives

  1. Consolidate eight source files (XLSX and CSV) with inconsistent schemas into two clean, analysis-ready datasets.
  2. Explore delay patterns across time, location, cause, and weather.
  3. Train a leak-free XGBoost classifier to predict significant delays (≥ 5 min).
  4. Translate the findings into interactive Power BI dashboards for both modes.

Methodology

Data

  • Source: TTC Open Data — bus and subway delay records, January 2022 to January 2026.
  • Scope: 97,502 raw subway rows and 243,594 raw bus rows, cleaned to 94,626 and 240,614 respectively.
  • Granularity: one row per reported delay incident.

Data Cleaning

The 2022–2024 data arrived as XLSX files and the 2025+ data as CSV, with mismatched column schemas (e.g. bus route stored as an integer Route in one and a string Line in the other). The main issues resolved were:

  • Schema unification: extracted route numbers via regex, renamed columns to a common schema, and merged official delay-code lookups into plain-text descriptions.
  • Non-service records: removed garage, training, and internal runs (bus), the discontinued Line 3 / SRT (subway), and non-passenger maintenance locations.
  • Cross-dataset contamination: removed subway line identifiers that leaked into the bus file, and flagged subway-only codes appearing in bus data as "Unknown".
  • Inconsistent categories: normalized 40+ variant spellings of the four subway lines (e.g. "YUS", "B/D", "YU / BD") to canonical values, and corrected delay-code typos.
  • Encoding repair: fixed a Windows-1252 double-encoding artefact in the bus code descriptions.
  • Missing values: recovered 206 null subway Line values by mapping each station to its most common line.

Delay descriptions were grouped into broad categories (Mechanical, Operations, Security, etc.) for both modes.

Feature Engineering

Three leak-free temporal features were built after sorting by timestamp, so each value uses only prior data: Previous_Delay, a 5-incident rolling mean (Rolling_Delay_5), and an expanding historical average per location and hour (Hist_Avg_Station_Hour for subway, Hist_Avg_Route_Hour for bus). Hourly weather (temperature, snowfall, precipitation) was joined from the Open-Meteo API.

Modelling

  • Target: binary Delayed (1 if delay ≥ 5 min).
  • Outlier handling: extreme delays beyond a 3×IQR fence were removed from training only (16 min subway, 56 min bus); all rows are retained in the dashboard export and flagged via Is_Outlier.
  • Split: chronological 60/25/15 train/validation/test split, never random, because the lag features would otherwise leak future information.
  • Model: XGBoost classifier with scale_pos_weight to handle class imbalance.
  • Leakage control: Min Gap (gap to the following vehicle) was excluded as it approximates the target and would not be available at prediction time.

Key Findings

Subway Bus
Cleaned incidents 94,626 240,614
Zero-delay rate 62.3% 8.5%
Median delay 0 min 11 min
Mean delay 3.0 min 20.8 min
Snow-day delay increase +478% +141%
Model ROC-AUC 0.794 0.606
  1. Subway and bus delays behave very differently. 62.3% of subway incidents cause no measurable delay (versus 8.5% for bus), so subway data is dominated by logged-but-harmless events while bus incidents are mostly genuine delays.
  2. The worst subway delays are not at rush hour. Average delay peaks at the 4–6 AM restart window after overnight maintenance, when the timetable has no recovery buffer.
  3. Cause is concentrated. Bus Operations and Mechanical failures together drive over 55% of incidents; the top 3 bus incident types account for 52% of all delay-minutes.
  4. Weather has a large effect. Heavy-snow days raise total daily delay by +141% (bus) and +478% (subway).
  5. Subway delays are more predictable. The subway model (0.79 ROC-AUC) clearly outperforms the bus model (0.61), indicating subway delays carry more structure tied to specific stations and codes.

Detailed EDA, full ranking tables, and all visualizations are in subway.ipynb and bus.ipynb.

Subway Hour x Weekday Heatmap Bus Model Confusion Matrix and Feature Importance

Limitations

  • The data records only reported delay incidents, not total service, so no rate of delayed trips can be computed.
  • The bus model's modest performance (0.61 ROC-AUC) shows route-level features alone are weak predictors; richer signals such as real-time traffic would be needed.
  • Categorical groupings rely on keyword rules and may misclassify some edge-case descriptions.
  • Weather is matched at the city level, not per station or route.

Power BI Dashboards

Dashboard Link
Bus Delay Dashboard View Dashboard
Subway Delay Dashboard View Dashboard

Tools Used

Python (pandas, NumPy), Scikit-learn, XGBoost, Matplotlib, Seaborn, Open-Meteo API, Power BI (DAX).

About

Analysis of TTC delay incidents from January 2022 to January 2026

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors