xcode api diff noise cleaner

xcode-api-diff-noise-cleaner Training Log

Session: 2026-06-23 — Classification gaps from real b1/b2 triage

Trainer: SkillTrainer | Skill: xcode-api-diff-noise-cleaner | Trigger: Enhance the skill using real-world usage captured in Copilot session e6ac98f0-f638-4b03-9e41-c86d6fddbe83 (xcode27.0 b1 + b2 diff triage; commits a70297e99, 4b64f77d8).

Source evidence

The session triaged 288 b1 diffs (cleaned 63 noise, kept 225 actionable), then b2, under an explicit "understand everything / rubber-duck with Opus 4.8 + GPT-5.5 / triple-check" mandate. The agent's own turn-5 methodology writeup + the commit messages surfaced classification rules and techniques that the skill did not document.

Assessment

Issues found (ranked):

❌ Availability/deprecation changes absent from the Actionable list. The skill never mentioned API_AVAILABLE / API_DEPRECATED / NS_AVAILABLE* / __attribute__((availability)). A change to an API's availability/deprecation is real binding work but was unlisted.
❌ "Macro renames = noise" had no contrapositive. A #define whose expansion changes availability/deprecation IS public API. Real case: ExposureNotification redefined EN_API_AVAILABLE from API_AVAILABLE(ios(12.5)) → API_DEPRECATED("No longer supported.", ios(12.5,27.0)), deprecating ~75 public symbols framework-wide. The old "macros are noise" bullet actively invites dismissing this at scan speed.
⚠️ No "read the line in real header context" guidance. A +/- line can sit inside a block comment / HeaderDoc tag / \-continued macro; you can't always tell from the hunk. The session's core method was reconstructing context from the installed SDK header.
⚠️ Missing noise categories the session actually cleaned: version/build #define bumps, copyright-year rewrites, and brand-new import-only / boilerplate-only headers.
💡 Borderline techniques uncaptured: prior-season precedent, validating the rule against a prior season's decisions, rubber-ducking borderline "nothing" calls with a 2nd model.

Eval design note (pilot → sharper)

First pilot eval (6 isolated diffs) leaked the answer (hinted "~75 public symbols" + the explanatory /// comment) — all 3 models scored 6/6 on the OLD skill, i.e. non-discriminating (see skill-trainer "Eval tests recall instead of behavior"). Redesigned a non-leading, scan-speed behavioral eval (4 macro/attribute diffs, no hints) that mirrors the real at-scale failure mode.

Cycle 1: Classification guidance (availability/deprecation + macro nuance + noise categories)

Hypothesis: Adding availability/deprecation as an Actionable category, the macro-redefinition-trap callout, and explicit version-bump / copyright / boilerplate-header noise bullets will (a) prevent false negatives on availability-bearing macro edits and (b) prevent false positives on version-bump macros — without over-flagging genuine macro noise.

Edit: .agents/skills/xcode-api-diff-noise-cleaner/SKILL.md (Classifying Diffs section, ~+33 lines): real-context callout; Actionable availability/deprecation block + macro-redefinition-trap; Category-1 copyright bullet; Category-2 macro contrapositive + version-bump bullet; Category-3 boilerplate-header bullet; new "Borderline Calls" subsection (precedent / validate-against-prior / rubber-duck).

Sharp eval — A=ExposureNotification macro→deprecated (ACT), B=internal guard rename (NOTH), C=SpriteKit version bump (NOTH), D=property gains API_AVAILABLE (ACT):

Model	Before	After	Δ
claude-sonnet-4.6	4/4 (A✓ B✓ C✓ D✓)	4/4	0
gpt-5.5	3/4 (C✗ — version bump marked ACTIONABLE)	4/4 (C fixed → NOTHING)	+1
gemini-3.1-pro-preview	4/4	4/4	0
Total	11/12	12/12	+1

GPT-5.5 before: "C changes a public version constant macro, so actionable." → after: "C: NOTHING — version define bump only." Causally traceable to the new version-bump bullet.

Outcome: ✅ +1, no regressions (A/B/D unchanged & correct on all 3 models). Decision: kept.

Patterns Learned

Honest finding: the headline ExposureNotification macro-redef trap (item A) was already classified correctly by all 3 frontier models even on the OLD skill. That part of the enhancement is institutional-knowledge capture + consistency-at-scale + protection for faster/weaker models — not a measured capability fix. The measured improvement was the version-#define-bump noise category.
Leading evals hide reality. Embedding "this affects ~75 symbols" turned a behavior test into a giveaway. Strip context that pre-judges the verdict; test at the speed/conditions of real use.
False positives matter too. The gap that actually moved a model was over-flagging (GPT marking a version bump actionable), not the under-flagging trap we set out to fix.

Open Items

The real-context-reconstruction note, boilerplate-header category, and Borderline-Calls techniques are additive guidance not directly exercised by the 4-diff eval (hard to test in single-diff classification). Low regression risk; candidate for a future at-scale Arena eval over a whole beta's diff set.
Skill change is local/uncommitted — dotnet/macios.wiki is shared (the source session deliberately did not push). Commit/push left to the user.

xcode api diff noise cleaner

xcode-api-diff-noise-cleaner Training Log

Session: 2026-06-23 — Classification gaps from real b1/b2 triage

Source evidence

Assessment

Eval design note (pilot → sharper)

Cycle 1: Classification guidance (availability/deprecation + macro nuance + noise categories)

Patterns Learned

Open Items

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!