Skip to content

xcode api diff noise cleaner

Alex Soto edited this page Jun 23, 2026 · 1 revision

xcode-api-diff-noise-cleaner Training Log

Session: 2026-06-23 — Classification gaps from real b1/b2 triage

Trainer: SkillTrainer | Skill: xcode-api-diff-noise-cleaner | Trigger: Enhance the skill using real-world usage captured in Copilot session e6ac98f0-f638-4b03-9e41-c86d6fddbe83 (xcode27.0 b1 + b2 diff triage; commits a70297e99, 4b64f77d8).

Source evidence

The session triaged 288 b1 diffs (cleaned 63 noise, kept 225 actionable), then b2, under an explicit "understand everything / rubber-duck with Opus 4.8 + GPT-5.5 / triple-check" mandate. The agent's own turn-5 methodology writeup + the commit messages surfaced classification rules and techniques that the skill did not document.

Assessment

Issues found (ranked):

  1. Availability/deprecation changes absent from the Actionable list. The skill never mentioned API_AVAILABLE / API_DEPRECATED / NS_AVAILABLE* / __attribute__((availability)). A change to an API's availability/deprecation is real binding work but was unlisted.
  2. "Macro renames = noise" had no contrapositive. A #define whose expansion changes availability/deprecation IS public API. Real case: ExposureNotification redefined EN_API_AVAILABLE from API_AVAILABLE(ios(12.5))API_DEPRECATED("No longer supported.", ios(12.5,27.0)), deprecating ~75 public symbols framework-wide. The old "macros are noise" bullet actively invites dismissing this at scan speed.
  3. ⚠️ No "read the line in real header context" guidance. A +/- line can sit inside a block comment / HeaderDoc tag / \-continued macro; you can't always tell from the hunk. The session's core method was reconstructing context from the installed SDK header.
  4. ⚠️ Missing noise categories the session actually cleaned: version/build #define bumps, copyright-year rewrites, and brand-new import-only / boilerplate-only headers.
  5. 💡 Borderline techniques uncaptured: prior-season precedent, validating the rule against a prior season's decisions, rubber-ducking borderline "nothing" calls with a 2nd model.

Eval design note (pilot → sharper)

First pilot eval (6 isolated diffs) leaked the answer (hinted "~75 public symbols" + the explanatory /// comment) — all 3 models scored 6/6 on the OLD skill, i.e. non-discriminating (see skill-trainer "Eval tests recall instead of behavior"). Redesigned a non-leading, scan-speed behavioral eval (4 macro/attribute diffs, no hints) that mirrors the real at-scale failure mode.

Cycle 1: Classification guidance (availability/deprecation + macro nuance + noise categories)

Hypothesis: Adding availability/deprecation as an Actionable category, the macro-redefinition-trap callout, and explicit version-bump / copyright / boilerplate-header noise bullets will (a) prevent false negatives on availability-bearing macro edits and (b) prevent false positives on version-bump macros — without over-flagging genuine macro noise.

Edit: .agents/skills/xcode-api-diff-noise-cleaner/SKILL.md (Classifying Diffs section, ~+33 lines): real-context callout; Actionable availability/deprecation block + macro-redefinition-trap; Category-1 copyright bullet; Category-2 macro contrapositive + version-bump bullet; Category-3 boilerplate-header bullet; new "Borderline Calls" subsection (precedent / validate-against-prior / rubber-duck).

Sharp eval — A=ExposureNotification macro→deprecated (ACT), B=internal guard rename (NOTH), C=SpriteKit version bump (NOTH), D=property gains API_AVAILABLE (ACT):

Model Before After Δ
claude-sonnet-4.6 4/4 (A✓ B✓ C✓ D✓) 4/4 0
gpt-5.5 3/4 (C✗ — version bump marked ACTIONABLE) 4/4 (C fixed → NOTHING) +1
gemini-3.1-pro-preview 4/4 4/4 0
Total 11/12 12/12 +1

GPT-5.5 before: "C changes a public version constant macro, so actionable." → after: "C: NOTHING — version define bump only." Causally traceable to the new version-bump bullet.

Outcome: ✅ +1, no regressions (A/B/D unchanged & correct on all 3 models). Decision: kept.

Patterns Learned

  • Honest finding: the headline ExposureNotification macro-redef trap (item A) was already classified correctly by all 3 frontier models even on the OLD skill. That part of the enhancement is institutional-knowledge capture + consistency-at-scale + protection for faster/weaker models — not a measured capability fix. The measured improvement was the version-#define-bump noise category.
  • Leading evals hide reality. Embedding "this affects ~75 symbols" turned a behavior test into a giveaway. Strip context that pre-judges the verdict; test at the speed/conditions of real use.
  • False positives matter too. The gap that actually moved a model was over-flagging (GPT marking a version bump actionable), not the under-flagging trap we set out to fix.

Open Items

  • The real-context-reconstruction note, boilerplate-header category, and Borderline-Calls techniques are additive guidance not directly exercised by the 4-diff eval (hard to test in single-diff classification). Low regression risk; candidate for a future at-scale Arena eval over a whole beta's diff set.
  • Skill change is local/uncommitted — dotnet/macios.wiki is shared (the source session deliberately did not push). Commit/push left to the user.

Clone this wiki locally