-
Notifications
You must be signed in to change notification settings - Fork 570
xcode api diff noise cleaner
Trainer: SkillTrainer | Skill: xcode-api-diff-noise-cleaner | Trigger: Enhance the skill
using real-world usage captured in Copilot session e6ac98f0-f638-4b03-9e41-c86d6fddbe83
(xcode27.0 b1 + b2 diff triage; commits a70297e99, 4b64f77d8).
The session triaged 288 b1 diffs (cleaned 63 noise, kept 225 actionable), then b2, under an explicit "understand everything / rubber-duck with Opus 4.8 + GPT-5.5 / triple-check" mandate. The agent's own turn-5 methodology writeup + the commit messages surfaced classification rules and techniques that the skill did not document.
Issues found (ranked):
- ❌ Availability/deprecation changes absent from the Actionable list. The skill never mentioned
API_AVAILABLE/API_DEPRECATED/NS_AVAILABLE*/__attribute__((availability)). A change to an API's availability/deprecation is real binding work but was unlisted. - ❌ "Macro renames = noise" had no contrapositive. A
#definewhose expansion changes availability/deprecation IS public API. Real case:ExposureNotificationredefinedEN_API_AVAILABLEfromAPI_AVAILABLE(ios(12.5))→API_DEPRECATED("No longer supported.", ios(12.5,27.0)), deprecating ~75 public symbols framework-wide. The old "macros are noise" bullet actively invites dismissing this at scan speed. ⚠️ No "read the line in real header context" guidance. A+/-line can sit inside a block comment / HeaderDoc tag /\-continued macro; you can't always tell from the hunk. The session's core method was reconstructing context from the installed SDK header.⚠️ Missing noise categories the session actually cleaned: version/build#definebumps, copyright-year rewrites, and brand-new import-only / boilerplate-only headers.- 💡 Borderline techniques uncaptured: prior-season precedent, validating the rule against a prior season's decisions, rubber-ducking borderline "nothing" calls with a 2nd model.
First pilot eval (6 isolated diffs) leaked the answer (hinted "~75 public symbols" + the explanatory
/// comment) — all 3 models scored 6/6 on the OLD skill, i.e. non-discriminating
(see skill-trainer "Eval tests recall instead of behavior"). Redesigned a non-leading, scan-speed
behavioral eval (4 macro/attribute diffs, no hints) that mirrors the real at-scale failure mode.
Hypothesis: Adding availability/deprecation as an Actionable category, the macro-redefinition-trap callout, and explicit version-bump / copyright / boilerplate-header noise bullets will (a) prevent false negatives on availability-bearing macro edits and (b) prevent false positives on version-bump macros — without over-flagging genuine macro noise.
Edit: .agents/skills/xcode-api-diff-noise-cleaner/SKILL.md (Classifying Diffs section, ~+33 lines):
real-context callout; Actionable availability/deprecation block + macro-redefinition-trap; Category-1
copyright bullet; Category-2 macro contrapositive + version-bump bullet; Category-3 boilerplate-header
bullet; new "Borderline Calls" subsection (precedent / validate-against-prior / rubber-duck).
Sharp eval — A=ExposureNotification macro→deprecated (ACT), B=internal guard rename (NOTH), C=SpriteKit version bump (NOTH), D=property gains API_AVAILABLE (ACT):
| Model | Before | After | Δ |
|---|---|---|---|
| claude-sonnet-4.6 | 4/4 (A✓ B✓ C✓ D✓) | 4/4 | 0 |
| gpt-5.5 | 3/4 (C✗ — version bump marked ACTIONABLE) | 4/4 (C fixed → NOTHING) | +1 |
| gemini-3.1-pro-preview | 4/4 | 4/4 | 0 |
| Total | 11/12 | 12/12 | +1 |
GPT-5.5 before: "C changes a public version constant macro, so actionable." → after: "C: NOTHING — version define bump only." Causally traceable to the new version-bump bullet.
Outcome: ✅ +1, no regressions (A/B/D unchanged & correct on all 3 models). Decision: kept.
-
Honest finding: the headline ExposureNotification macro-redef trap (item A) was already classified
correctly by all 3 frontier models even on the OLD skill. That part of the enhancement is
institutional-knowledge capture + consistency-at-scale + protection for faster/weaker models — not a
measured capability fix. The measured improvement was the version-
#define-bump noise category. - Leading evals hide reality. Embedding "this affects ~75 symbols" turned a behavior test into a giveaway. Strip context that pre-judges the verdict; test at the speed/conditions of real use.
- False positives matter too. The gap that actually moved a model was over-flagging (GPT marking a version bump actionable), not the under-flagging trap we set out to fix.
- The real-context-reconstruction note, boilerplate-header category, and Borderline-Calls techniques are additive guidance not directly exercised by the 4-diff eval (hard to test in single-diff classification). Low regression risk; candidate for a future at-scale Arena eval over a whole beta's diff set.
- Skill change is local/uncommitted —
dotnet/macios.wikiis shared (the source session deliberately did not push). Commit/push left to the user.