From 14cf43bbd4f1133581f1d0fcb4363b2f29e08844 Mon Sep 17 00:00:00 2001 From: Louise Lau Date: Thu, 2 Jul 2026 08:50:32 +0800 Subject: [PATCH] =?UTF-8?q?feat(codegraph):=20helper=20extraction=20v2=20?= =?UTF-8?q?=E2=80=94=20parameterize=20proposals=20+=20external=20OSS=20cas?= =?UTF-8?q?e=20studies?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit v2 turns near-dupe DETECTION into a concrete CONSOLIDATION proposal, and proves the whole pipeline on three real open-source repos. - proposeParameterizedHelper (PURE): aligns cluster bodies token-by-token (literals kept, own name neutralized), turns each varying position into a parameter (co-varying positions collapse into one), names params from context (`kind: "min"` → param `kind`), and emits the reference body with substitutions + the exact call each original becomes. Conservative: declines with the reason when bodies don't align or >4 parts vary; mixed structural variants keep the LARGEST aligned subset and report the rest as dropped. - find_similar_helpers appends proposals for the top clusters (still detection-only — no edits). - Nested-member fix: an inner function no longer clusters with its own parent (caught live on axios: buildPath inside formDataToJSON). - docs/ADOPTION.md external case studies (real tool output, commits pinned, 2026-07-02): zod 912f0f5 → 38 clusters / ~671 lines (7-copy positive/negative family + the real parameterize proposal); hono b20d422 → importPublicKey/importPrivateKey ~95% + dup getQueryString; axios e435384 → setFormDataHeaders byte-near-identical across two files (manually verified). With the honest caveats (detection-only; duplication can be deliberate). --- docs/ADOPTION.md | 34 ++++++ src/codegraph/helper-extract.ts | 119 +++++++++++++++++++- src/tools/codegraph/find-similar-helpers.ts | 19 +++- test/helper-extract.test.ts | 76 ++++++++++++- 4 files changed, 244 insertions(+), 4 deletions(-) diff --git a/docs/ADOPTION.md b/docs/ADOPTION.md index f753b03..c69f69f 100644 --- a/docs/ADOPTION.md +++ b/docs/ADOPTION.md @@ -46,6 +46,40 @@ The #64 row is the one to internalize: **blocked runs are the product working**, A 3am "improvement" that changes behavior is worse than no improvement, so maintain proves safety or declines — and the receipt records which one happened. +## External case studies — three popular OSS repos + +We pointed the read-only detection (`find_similar_helpers`, zero setup — no install, no config) +at fresh clones of three widely-used projects (2026-07-02, shallow clones at the commits shown). +These are actual tool outputs, reproducible with the commands below: + +```bash +git clone --depth 1 https://github.com/colinhacks/zod && cd zod +qodex # → find_similar_helpers path="packages/zod/src" +``` + +| Repo (commit) | Scanned | Found | +|---|---|---| +| **zod** `912f0f5` (`packages/zod/src`) | 116 files, 1,103 functions | **38 near-dupe clusters, ~671 collapsible lines.** Headline: a **7-copy family** — `positive`/`negative`/`nonpositive`/`nonnegative` across ZodNumber+ZodBigInt (~94% similar), each differing only in `kind:`/`inclusive:` | +| **hono** `b20d422` (`src`) | 186 files, 255 functions | 2 clusters: `importPublicKey`/`importPrivateKey` (~95%, ~26 lines) and a duplicated `getQueryString` in the aws-lambda handler (~92%, ~21 lines) | +| **axios** `e435384` (`lib`) | 66 files, 152 functions | 1 cluster: `setFormDataHeaders` **byte-near-identical** in `helpers/resolveConfig.js` and `adapters/http.js` (verified by eye: the only difference is a `\|\| {}` guard) — a textbook `consolidate-dupes` candidate | + +And because detection now includes the **parameterize proposal** (v2), the zod family comes back +not as "these look similar" but as the concrete consolidation: + +``` +Proposed shared helper `extractedHelper(kind, inclusive)` — the bodies differ ONLY in these 2 spot(s): + positive(…) → extractedHelper("min", false) + nonpositive(…) → extractedHelper("max", true) + nonnegative(…) → extractedHelper("min", true) + (not covered — different structure: the ZodNumber variants, whose value is 0 vs BigInt(0)) +``` + +Two honest caveats. First, these are *detection* results — read-only, nothing was changed in any +repo. Second, duplication in mature libraries is sometimes deliberate (readability, hot-path +inlining, API symmetry): the tool's job is to surface the opportunity with proof; whether to take +it stays a maintainer's call. That division of labor — *mechanical evidence, human judgment* — is +the same verify-or-block philosophy the rest of maintain runs on. + ## Recommended rollout ladder Adopt one rung at a time; each rung is strictly riskier than the previous. Stay on a rung until diff --git a/src/codegraph/helper-extract.ts b/src/codegraph/helper-extract.ts index a9bebe5..4da85fb 100644 --- a/src/codegraph/helper-extract.ts +++ b/src/codegraph/helper-extract.ts @@ -140,7 +140,19 @@ export function findSimilarHelpers( } const clusters: HelperCluster[] = []; - for (const { members: idxs } of seeded) { + for (const { members: rawIdxs } of seeded) { + // Drop members NESTED inside another member (same file, contained line range): a inner helper + // shares most of its parent's tokens, so parent+child cluster together — but "extract the child + // from its parent" is not a duplicate-consolidation opportunity. Keep the outermost. + const idxs = rawIdxs.filter(i => { + const a = prepared[i]!.u; + return !rawIdxs.some(j => { + if (j === i) return false; + const b = prepared[j]!.u; + return a.file === b.file && b.startLine <= a.startLine && a.endLine <= b.endLine + && (b.endLine - b.startLine) > (a.endLine - a.startLine); + }); + }); if (idxs.length < 2) continue; let sum = 0, cnt = 0; for (let a = 0; a < idxs.length; a++) { @@ -166,6 +178,111 @@ export function findSimilarHelpers( return clusters.sort((a, b) => b.estLinesSaved - a.estLinesSaved || b.avgSimilarity - a.avgSimilarity); } +// ————————————————————————————— v2: parameterized-consolidation proposal ————————————————————————————— + +export interface ParamProposal { + ok: boolean; + reason?: string; + helperName: string; + /** One entry per parameter the shared helper needs; values are per-member (same order as input). */ + params: { name: string; values: string[] }[]; + /** The reference member's body with the varying tokens replaced by the parameter names. */ + sketch: string; + /** How each original function maps to a call of the shared helper. */ + calls: { member: string; args: string[] }[]; + /** Members left out because their body doesn't align with the majority (e.g. a BigInt variant). */ + dropped?: string[]; +} + +interface RawTok { text: string; start: number } + +/** Code tokens WITH literals kept and source offsets (for substitution in the sketch). PURE. */ +function rawTokens(body: string): RawTok[] { + const re = /"(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*'|`(?:[^`\\]|\\.)*`|\d+(?:\.\d+)?|[A-Za-z_$][\w$]*|[^\sA-Za-z0-9_$]/g; + const out: RawTok[] = []; + let m: RegExpExecArray | null; + while ((m = re.exec(body))) out.push({ text: m[0], start: m.index }); + return out; +} + +/** + * Given a cluster of near-duplicate functions, propose ONE parameterized helper: align the bodies + * token-by-token, turn each varying position into a parameter (positions that always vary together + * collapse into one), and emit the reference body with those tokens substituted — plus the exact + * call each original becomes. Mechanical and conservative: if the bodies don't align structurally + * (different token counts) or vary in too many places (>4 params), it declines with the reason + * rather than proposing something wrong. PURE. + */ +export function proposeParameterizedHelper(members: FunctionUnit[], helperName = 'extractedHelper'): ParamProposal { + const decline = (reason: string): ParamProposal => ({ ok: false, reason, helperName, params: [], sketch: '', calls: [] }); + if (members.length < 2) return decline('need at least two functions'); + + // Neutralize each function's own name so it never reads as a varying token. + let bodies = members.map(m => m.body.replace(new RegExp(`\\b${m.name.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\b`, 'g'), helperName)); + let toks = bodies.map(b => rawTokens(b)); + let dropped: string[] | undefined; + if (toks.some(t => t.length !== toks[0]!.length)) { + // Mixed structural variants in one cluster (e.g. a Number family + a BigInt family): keep the + // LARGEST same-token-count subset and report the rest as dropped, instead of declining outright. + const byCount = new Map(); + toks.forEach((t, i) => { const k = t.length; if (!byCount.has(k)) byCount.set(k, []); byCount.get(k)!.push(i); }); + const best = [...byCount.values()].sort((a, b) => b.length - a.length)[0]!; + if (best.length < 2) return decline('bodies do not align structurally (token counts differ) — too divergent to parameterize mechanically'); + dropped = members.filter((_, i) => !best.includes(i)).map(m => m.name); + members = best.map(i => members[i]!); + bodies = best.map(i => bodies[i]!); + toks = best.map(i => toks[i]!); + } + + // Positions where members disagree, grouped by their value-vector (co-varying positions = one param). + const n = toks[0]!.length; + const groups = new Map(); + for (let p = 0; p < n; p++) { + const values = toks.map(t => t[p]!.text); + if (values.every(v => v === values[0])) continue; + const key = JSON.stringify(values); + if (!groups.has(key)) groups.set(key, []); + groups.get(key)!.push(p); + } + if (groups.size === 0) return decline('bodies are identical — use consolidate-dupes, no parameter needed'); + if (groups.size > 4) return decline(`${groups.size} independently-varying parts — too many to parameterize cleanly`); + + // Name each parameter from context when possible (`kind: "min"` → param `kind`), else p1/p2…. + const used = new Set(); + const params: { name: string; values: string[]; positions: number[] }[] = []; + for (const [key, positions] of groups) { + const first = positions[0]!; + const ref = toks[0]!; + let name = ref[first - 1]?.text === ':' && /^[A-Za-z_$][\w$]*$/.test(ref[first - 2]?.text ?? '') ? ref[first - 2]!.text : `p${params.length + 1}`; + while (used.has(name)) name = `${name}_`; + used.add(name); + params.push({ name, values: JSON.parse(key), positions }); + } + + // Sketch: reference body with each varying token replaced by its parameter name (right-to-left). + let sketch = bodies[0]!; + const subs = params.flatMap(p => p.positions.map(pos => ({ pos, name: p.name }))) + .sort((a, b) => toks[0]![b.pos]!.start - toks[0]![a.pos]!.start); + for (const s of subs) { + const t = toks[0]![s.pos]!; + sketch = sketch.slice(0, t.start) + s.name + sketch.slice(t.start + t.text.length); + } + const calls = members.map((m, i) => ({ member: m.name, args: params.map(p => p.values[i]!) })); + return { ok: true, helperName, params: params.map(({ name, values }) => ({ name, values })), sketch, calls, dropped }; +} + +/** Render a proposal as a compact, reviewable block. PURE. */ +export function formatParamProposal(pr: ParamProposal): string { + if (!pr.ok) return `No mechanical parameterization: ${pr.reason}`; + const lines = [ + `Proposed shared helper \`${pr.helperName}(${pr.params.map(p => p.name).join(', ')})\` — the bodies differ ONLY in these ${pr.params.length} spot(s):`, + ]; + for (const c of pr.calls) lines.push(` ${c.member}(…) → ${pr.helperName}(${c.args.join(', ')})`); + if (pr.dropped?.length) lines.push(` (not covered — different structure: ${pr.dropped.join(', ')})`); + lines.push('', '```', pr.sketch.split('\n').slice(0, 16).join('\n'), '```'); + return lines.join('\n'); +} + /** A concise, agent-readable report of extraction opportunities. PURE. */ export function formatHelperClusters(clusters: HelperCluster[]): string { if (!clusters.length) return 'No near-duplicate helper clusters found — nothing worth extracting.'; diff --git a/src/tools/codegraph/find-similar-helpers.ts b/src/tools/codegraph/find-similar-helpers.ts index c417dd8..9b5f909 100644 --- a/src/tools/codegraph/find-similar-helpers.ts +++ b/src/tools/codegraph/find-similar-helpers.ts @@ -14,7 +14,7 @@ import { promises as fs } from 'fs'; import * as path from 'path'; import { Tool, type ToolContext, type ToolResult } from '../base.js'; import { extractSymbols } from '../../codegraph/extractor.js'; -import { findSimilarHelpers, formatHelperClusters, type FunctionUnit } from '../../codegraph/helper-extract.js'; +import { findSimilarHelpers, formatHelperClusters, proposeParameterizedHelper, formatParamProposal, type FunctionUnit } from '../../codegraph/helper-extract.js'; const Args = z.object({ path: z.string().optional().describe('Restrict the scan to this subdirectory (relative to cwd). Default: whole project.'), @@ -78,9 +78,24 @@ export class FindSimilarHelpersTool extends Tool> { const clusters = findSimilarHelpers(units, { minSim: args.min_similarity }) .filter(c => args.include_exact || !c.exact); + // v2: for the top clusters, try a mechanical PARAMETERIZE proposal — the concrete shared + // helper + the exact call each original becomes. Declines (with the reason) when bodies + // don't align, so a proposal only appears when the consolidation is genuinely mechanical. + const proposals: string[] = []; + for (const c of clusters.slice(0, 2)) { + const bodies = c.members + .map(m => units.find(u => u.name === m.name && u.file === m.file && u.startLine === m.startLine)) + .filter((u): u is FunctionUnit => !!u) + .slice(0, 6); + if (bodies.length < 2) continue; + const pr = proposeParameterizedHelper(bodies, c.suggestedName); + if (pr.ok) proposals.push(`For cluster \`${c.suggestedName}\`:\n${formatParamProposal(pr)}`); + } + const proposalBlock = proposals.length ? `\n\n${proposals.join('\n\n')}` : ''; + const header = `Scanned ${scanned} file(s), ${units.length} function(s).`; return { - content: `${header}\n\n${formatHelperClusters(clusters)}`, + content: `${header}\n\n${formatHelperClusters(clusters)}${proposalBlock}`, metadata: { filesScanned: scanned, functions: units.length, diff --git a/test/helper-extract.test.ts b/test/helper-extract.test.ts index ff54a3a..3566a11 100644 --- a/test/helper-extract.test.ts +++ b/test/helper-extract.test.ts @@ -1,5 +1,5 @@ import { describe, it, expect } from 'vitest'; -import { normalizeBody, codeTokens, findSimilarHelpers, formatHelperClusters, type FunctionUnit } from '../src/codegraph/helper-extract.ts'; +import { normalizeBody, codeTokens, findSimilarHelpers, formatHelperClusters, proposeParameterizedHelper, formatParamProposal, type FunctionUnit } from '../src/codegraph/helper-extract.ts'; // Two helpers copy-pasted then tweaked: different name, different constant — NEAR, not exact. const fmtUSD = `function formatUSD(n) { @@ -91,6 +91,16 @@ describe('findSimilarHelpers', () => { expect(findSimilarHelpers(distinct, { minTokens: 8, minSim: 0.88 })).toHaveLength(0); }); + it('drops a member NESTED inside another member (inner helper vs its parent is not a dupe)', () => { + // Real case from axios: buildPath is defined INSIDE formDataToJSON — they share most tokens. + const parent = { name: 'outer', file: 'x.ts', startLine: 10, endLine: 60, body: fmtUSD.replace('formatUSD', 'outer') }; + const child = { name: 'inner', file: 'x.ts', startLine: 15, endLine: 40, body: fmtUSD.replace('formatUSD', 'inner') }; + expect(findSimilarHelpers([parent, child], { minTokens: 10 })).toHaveLength(0); // cluster collapses to 1 → dropped + // …but a genuine same-file, NON-overlapping pair still clusters. + const sibling = { name: 'sib', file: 'x.ts', startLine: 70, endLine: 90, body: fmtUSD.replace('formatUSD', 'sib') }; + expect(findSimilarHelpers([parent, sibling], { minTokens: 10 })).toHaveLength(1); + }); + it('formats a readable report with the extract-with-review caveat', () => { const report = formatHelperClusters(findSimilarHelpers(units, { minTokens: 10 })); expect(report).toMatch(/near-duplicate helper cluster/); @@ -99,3 +109,67 @@ describe('findSimilarHelpers', () => { expect(formatHelperClusters([])).toMatch(/No near-duplicate/); }); }); + +describe('proposeParameterizedHelper (v2 — parameterize the near-dupes)', () => { + // The real zod shape: same body, differing only in kind ("min"/"max") and inclusive (false/true). + const check = (name: string, kind: string, inclusive: string) => ({ + name, file: 'types.ts', startLine: 1, endLine: 8, + body: `${name}(message) {\n return this._addCheck({\n kind: ${kind},\n value: 0,\n inclusive: ${inclusive},\n message: toString(message),\n });\n}`, + }); + const members = [check('positive', '"min"', 'false'), check('negative', '"max"', 'false'), check('nonnegative', '"min"', 'true')]; + + it('turns each varying token position into a parameter, named from its context', () => { + const pr = proposeParameterizedHelper(members, 'addRangeCheck'); + expect(pr.ok).toBe(true); + expect(pr.params.map(p => p.name).sort()).toEqual(['inclusive', 'kind']); // named from `kind:` / `inclusive:` + const kind = pr.params.find(p => p.name === 'kind')!; + expect(kind.values).toEqual(['"min"', '"max"', '"min"']); + expect(pr.sketch).toContain('kind: kind'); // varying token → param in the sketch + expect(pr.sketch).toContain('inclusive: inclusive'); + expect(pr.sketch).toContain('addRangeCheck(message)'); // own name neutralized to helper name + expect(pr.calls[1]).toEqual({ member: 'negative', args: ['"max"', 'false'] }); + }); + + it('co-varying positions collapse into ONE parameter', () => { + const two = [ + { name: 'a', file: 'f.ts', startLine: 1, endLine: 3, body: 'a() { log("x"); send("x"); }' }, + { name: 'b', file: 'f.ts', startLine: 5, endLine: 7, body: 'b() { log("y"); send("y"); }' }, + ]; + const pr = proposeParameterizedHelper(two); + expect(pr.ok).toBe(true); + expect(pr.params).toHaveLength(1); // "x"/"y" varies in two places, same vector → one param + expect(pr.calls[0]!.args).toEqual(['"x"']); + }); + + it('declines honestly when structures differ or too many parts vary', () => { + const divergent = [ + { name: 'a', file: 'f.ts', startLine: 1, endLine: 3, body: 'a() { return 1; }' }, + { name: 'b', file: 'f.ts', startLine: 5, endLine: 9, body: 'b() { for (const x of xs) go(x); return 2; }' }, + ]; + expect(proposeParameterizedHelper(divergent).ok).toBe(false); + expect(proposeParameterizedHelper(divergent).reason).toMatch(/token counts differ/); + const identical = [ + { name: 'a', file: 'f.ts', startLine: 1, endLine: 3, body: 'a() { return 1; }' }, + { name: 'b', file: 'g.ts', startLine: 1, endLine: 3, body: 'b() { return 1; }' }, + ]; + expect(proposeParameterizedHelper(identical).reason).toMatch(/consolidate-dupes/); + }); + + it('mixed structural variants: parameterizes the LARGEST aligned subset and reports the rest as dropped', () => { + // The real zod shape: a Number family plus a BigInt variant whose body has extra tokens. + const bigint = { name: 'positiveBig', file: 'types.ts', startLine: 20, endLine: 27, + body: 'positiveBig(message) {\n return this._addCheck({\n kind: "min",\n value: BigInt(0),\n inclusive: false,\n message: toString(message),\n });\n}' }; + const pr = proposeParameterizedHelper([...members, bigint], 'addRangeCheck'); + expect(pr.ok).toBe(true); + expect(pr.calls.map(c => c.member)).toEqual(['positive', 'negative', 'nonnegative']); + expect(pr.dropped).toEqual(['positiveBig']); + expect(formatParamProposal(pr)).toContain('not covered — different structure: positiveBig'); + }); + + it('formats a reviewable block with the call mapping', () => { + const out = formatParamProposal(proposeParameterizedHelper(members, 'addRangeCheck')); + expect(out).toContain('addRangeCheck(kind, inclusive)'); + expect(out).toContain('negative(…) → addRangeCheck("max", false)'); + expect(out).toContain('```'); + }); +});