Skip to content

ref: split HangTracker into three-tier architecture#8179

Open
philprime wants to merge 36 commits into
mainfrom
ref/hang-tracker-per-observer-threshold
Open

ref: split HangTracker into three-tier architecture#8179
philprime wants to merge 36 commits into
mainfrom
ref/hang-tracker-per-observer-threshold

Conversation

@philprime

@philprime philprime commented Jun 24, 2026

Copy link
Copy Markdown
Member

Should be merged after #8214 to adopt the new mutex type for observers.

Summary

Splits the existing DefaultHangTracker into a three-tier architecture that separates runloop polling from hang classification. This prepares the foundation for upcoming app hang V3 work (stacktrace sampling, flamegraph construction) while cleaning up the integration layer. Also adds a develop-docs/TERMINOLOGY.md glossary defining Apple and SDK terms for hang detection.

Motivation

The existing DefaultHangTracker does two things at once: it polls the main runloop at around 25 ms intervals via a CFRunLoopObserver + DispatchSemaphore pattern, and it broadcasts raw delay callbacks to all observers. Consumers like WatchdogTerminationTrackingIntegration then have to manually filter by their own duration threshold (guard interval > timeoutInterval), which scatters hang-classification logic across consumers.

The upcoming app hang V3 integration needs to sample stacktraces at the high-frequency 25 ms polling rate during a hang, then emit accumulated data only when the hang ends. This requires a middle layer that subscribes to the raw polling and handles debouncing, thresholding, and in the future stacktrace accumulation. None of that belongs in the low-level runloop observer or the integration itself.

Architecture

Three layers, each with a single responsibility:

  1. RunLoopDelayTracker (renamed from DefaultHangTracker). The low-level runloop observer. Polls at around 25 ms (1.5 times the expected frame duration). Notifies observers on every tick while the runloop is blocked (ongoing=true), then once when it unblocks (ongoing=false). No threshold logic, no hang classification.

  2. AppHangTracker (new middle layer). Wraps a RunLoopDelayTracker and adds per-observer thresholds. Each observer registers with addObserver(threshold:handler:). The tracker notifies an observer exactly once when the accumulated delay first exceeds that observer's threshold, then once when the hang ends. Uses hasBeenNotified state per observer to prevent duplicate notifications within a single hang. Designed to later support stacktrace sampling on every raw 25 ms tick (out of scope for this PR).

  3. Integrations (consumers). For example, WatchdogTerminationTrackingIntegration now calls appHangTracker.addObserver(threshold: timeoutInterval) instead of manually guarding interval > timeoutInterval. The threshold filtering is the AppHangTracker's job.

Changes

  • Sources/Swift/RunLoopDelayTracker.swift renamed from HangTracker.swift. Types: DefaultHangTracker to DefaultRunLoopDelayTracker, protocol HangTracker to RunLoopDelayTracker. Behavior unchanged, only names and doc comments updated.
  • Sources/Swift/AppHangTracker.swift is a new file. DefaultAppHangTracker wraps RunLoopDelayTracker, manages per-observer ObserverEntry structs with threshold and hasBeenNotified state. AppHangTrackerProvider protocol for dependency injection.
  • Sources/Swift/SentryDependencyContainer.swift now has an appHangTracker lazy var that creates a DefaultRunLoopDelayTracker wrapped in a DefaultAppHangTracker.
  • SentryWatchdogTerminationTrackingIntegration.swift replaced addOngoingHangObserver and manual threshold guard with appHangTracker.addObserver(threshold:handler:).
  • Tests: HangTrackerTests.swift renamed to RunLoopDelayTrackerTests.swift with renamed types. New AppHangTrackerTests.swift with 9 tests for the middle layer covering threshold filtering, deduplication, consecutive hangs, and multiple observers with different thresholds. MockAppHangTracker in watchdog tests updated to mirror threshold logic.
  • develop-docs/TERMINOLOGY.md new glossary defining Apple terms (hang, hitch, watchdog termination) and SDK terms (runloop delay, polling interval, app hang, ANR) with references to Apple documentation.

Test plan

  • RunLoopDelayTrackerTests pass (9 tests, renamed, behavior unchanged)
  • AppHangTrackerTests pass (9 new tests for threshold edge cases, dedup, consecutive hangs)
  • SentryWatchdogTerminationIntegrationTests pass (17 tests including 3 threshold tests)
  • make build-ios passes
  • make format clean

#skip-changelog

Introduce a layered design for hang detection:

- RunLoopDelayTracker (renamed from DefaultHangTracker): raw
  ~25ms runloop polling via CFRunLoopObserver + semaphore
- HangTracker (new middle layer): debounces delays into hangs
  with per-observer thresholds, notifies once on threshold
  crossing and once on hang end
- WatchdogTerminationTrackingIntegration: now uses
  HangTracker.addObserver(threshold:) instead of manually
  filtering by duration
Comment thread Sources/Swift/HangTracker.swift
Gives the middle layer a distinct name so git sees the
original HangTracker.swift → RunLoopDelayTracker.swift as
a clean rename instead of a full rewrite.
@philprime philprime added the ready-to-merge Use this label to trigger all PR workflows label Jun 24, 2026
@sentry

sentry Bot commented Jun 24, 2026

Copy link
Copy Markdown

📲 Install Builds

iOS

🔗 App Name App ID Version Configuration
SDK-Size io.sentry.sample.SDK-Size 9.19.0 (1) Release

⚙️ sentry-cocoa Build Distribution Settings

@philprime

Copy link
Copy Markdown
Member Author

@sentry review

@philprime philprime marked this pull request as ready for review June 24, 2026 16:17
@philprime philprime self-assigned this Jun 24, 2026
@philprime philprime enabled auto-merge (squash) June 24, 2026 16:19
Comment thread Sources/Swift/AppHangTracker.swift Outdated
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Performance metrics 🚀

  Plain With Sentry Diff
Startup time 1228.47 ms 1262.62 ms 34.16 ms
Size 24.14 KiB 1.22 MiB 1.20 MiB

Baseline results on branch: main

Startup times

Revision Plain With Sentry Diff
75eb224 1215.57 ms 1244.74 ms 29.17 ms
4c437ea 1241.63 ms 1266.20 ms 24.57 ms
c5c7a29 1223.14 ms 1258.55 ms 35.41 ms
ce1ec47 1212.60 ms 1247.61 ms 35.02 ms
7b6ea60 1230.86 ms 1250.91 ms 20.05 ms
1f2abc8 1209.18 ms 1252.50 ms 43.32 ms
92fada5 1217.60 ms 1252.89 ms 35.30 ms
5804f33 1218.24 ms 1247.67 ms 29.42 ms
3efa7b5 1226.55 ms 1260.66 ms 34.11 ms
72ea638 1224.78 ms 1258.96 ms 34.18 ms

App size

Revision Plain With Sentry Diff
75eb224 24.14 KiB 1.16 MiB 1.13 MiB
4c437ea 24.14 KiB 1.18 MiB 1.15 MiB
c5c7a29 24.14 KiB 1.15 MiB 1.13 MiB
ce1ec47 24.14 KiB 1.15 MiB 1.13 MiB
7b6ea60 24.14 KiB 1.18 MiB 1.15 MiB
1f2abc8 24.14 KiB 1.22 MiB 1.20 MiB
92fada5 24.14 KiB 1.17 MiB 1.15 MiB
5804f33 24.14 KiB 1.18 MiB 1.16 MiB
3efa7b5 24.14 KiB 1.22 MiB 1.19 MiB
72ea638 24.14 KiB 1.18 MiB 1.15 MiB

Previous results on branch: ref/hang-tracker-per-observer-threshold

Startup times

Revision Plain With Sentry Diff
48cfd9f 1222.32 ms 1251.65 ms 29.33 ms
437a74e 1221.90 ms 1251.82 ms 29.93 ms
8a4624c 1223.62 ms 1247.70 ms 24.08 ms
c0f787f 1231.45 ms 1266.51 ms 35.06 ms
843e9e7 1225.23 ms 1267.65 ms 42.42 ms

App size

Revision Plain With Sentry Diff
48cfd9f 24.14 KiB 1.22 MiB 1.20 MiB
437a74e 24.14 KiB 1.22 MiB 1.20 MiB
8a4624c 24.14 KiB 1.22 MiB 1.20 MiB
c0f787f 24.14 KiB 1.21 MiB 1.19 MiB
843e9e7 24.14 KiB 1.22 MiB 1.20 MiB

@jamieQ jamieQ left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some drive-by commentary for your consideration :)

Comment thread Sources/Swift/AppHangTracker.swift Outdated
Comment thread Sources/Swift/AppHangTracker.swift Outdated
Comment thread Sources/Swift/AppHangTracker.swift Outdated
Backport of Synchronization.Mutex API (iOS 18+) using
os_unfair_lock for older deployment targets.
Document SentryMutex as preferred locking primitive for new
Swift code. Expose isSessionPaused as computed property for
test access after State struct migration.
Move sentrycrashbic callback registration back inside the
mutex to prevent a race between start() and stop(), matching
the original behavior.
Register callbacks outside the lock since
sentrycrashbic_registerAddedCallback synchronously
replays loaded images, each re-entering the lock.
Unregistering stays inside the lock as it does not
replay.
The threshold check used > which meant a delay exactly equal
to the configured threshold would not trigger a notification.
AppHangTracker: mid-hang observer add/remove, re-add during
hang, double-remove, stop/restart lifecycle, duration assertions.

RunLoopDelayTracker: partial observer removal, observer added
during active hang, double-remove same token.
Replace NSRecursiveLock with SentryMutex<T> for both trackers.
The lock is non-recursive but safe here because handlers are
called from the background queue while add/remove are
main-queue-only, so re-entrance cannot occur.
@philprime philprime changed the base branch from main to ref/sentry-mutex June 25, 2026 15:11
@philprime philprime marked this pull request as ready for review June 25, 2026 15:16
@philprime philprime marked this pull request as draft June 25, 2026 15:32
The threshold check should use strict greater-than (>) to match
the watchdog termination tracking behavior. A delay exactly equal
to the threshold should not trigger a notification.
Replace MockAppHangTracker (which duplicated threshold logic) with
a MockRunLoopDelayTracker injected into the real
SentryDefaultAppHangTracker. This ensures tests exercise the actual
production comparison logic instead of a mock copy.
Base automatically changed from ref/sentry-mutex to main June 26, 2026 07:57
Replace standalone SentryRunLoopDelayTrackerDependencies protocol with
a typealias composing DateProviderProvider & ApplicationProvider.
Update all tests to use MockDependencies instead of the removed
dateProvider: init parameter.
Add getKeyWindow() to SentryApplication protocol and UIApplication
extension. Rename RunLoopObserver to SentryRunLoopObserver. Update
SentryDependencyContainer to pass self as dependencies. Fix whitespace
and comment alignment in SentryApplication.swift.
Remove redundant Sentry prefix, now SentryDefaultRunLoopDelayTracker.
@philprime philprime marked this pull request as ready for review June 26, 2026 09:08
Comment on lines +96 to +106
if delay.duration > entry.threshold && !entry.hasBeenNotified {
entries[token]?.hasBeenNotified = true
entry.handler(.init(duration: delay.duration, state: .started))
}
} else if entry.hasBeenNotified {
entries[token]?.hasBeenNotified = false
entry.handler(.init(duration: delay.duration, state: .ended))
}
}
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The processDelay function calls observer handlers while holding a non-reentrant lock, creating a fragile design that is susceptible to deadlocks if handlers are modified in the future.
Severity: MEDIUM

Suggested Fix

To prevent potential deadlocks, avoid calling observer handlers while holding the lock. Consider collecting the handlers to be called, releasing the lock, and then executing the handlers. This removes the risk of a handler causing a re-entrant lock acquisition.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: Sources/Swift/Integrations/AppHangTracking/SentryAppHangTracker.swift#L92-L106

Potential issue: In `SentryAppHangTracker`, the `processDelay` function calls observer
handlers while holding a non-reentrant `SentryMutex`. The underlying `os_unfair_lock`
will deadlock or panic if a handler attempts to re-acquire the lock by calling
`addObserver` or `removeObserver`. While the current handler implementation does not do
this, this design is fragile. A future change to an observer's handler, or the addition
of a new handler that calls back into the tracker, could easily introduce a deadlock.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0126e8f. Configure here.

Comment on lines +78 to +81
let (removed, isEmpty) = observers.withLock {
($0.removeValue(forKey: token), $0.isEmpty)
}
_ = removed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let (removed, isEmpty) = observers.withLock {
($0.removeValue(forKey: token), $0.isEmpty)
}
_ = removed
let (_, isEmpty) = observers.withLock {
($0.removeValue(forKey: token), $0.isEmpty)
}

Comment on lines +141 to +144
let (removed, isEmpty) = observers.withLock {
($0.removeValue(forKey: token), $0.isEmpty)
}
_ = removed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let (removed, isEmpty) = observers.withLock {
($0.removeValue(forKey: token), $0.isEmpty)
}
_ = removed
let (_, isEmpty) = observers.withLock {
($0.removeValue(forKey: token), $0.isEmpty)
}

Comment on lines +215 to +217
observers.values.forEach { observer in
observer(.init(duration: duration, isOngoing: false))
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be invoking these callbacks under the lock?

var semaphoreSuccess = false
while !semaphoreSuccess {
let timeout = DispatchTime.now() + DispatchTimeInterval.milliseconds(Int(pollingInterval * 1_000))
let result = semaphore.wait(timeout: timeout)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this logic didn't really change with this PR, but I wonder if we can come up with a better approach than blocking an entire thread with a semaphore. E.g. have we considered something in DispatchTimerSource API? I'm not personally super familiar with it, but essentially it seems like we just want something to notify us on a non-main thread after a certain amount of time has elapsed, which I think that API is designed to do. Seems a waste to burn a thread just blocking like this if there's some way to avoid it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge Use this label to trigger all PR workflows skip-changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants