Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions docs/ADOPTION.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,40 @@ The #64 row is the one to internalize: **blocked runs are the product working**,
A 3am "improvement" that changes behavior is worse than no improvement, so maintain proves
safety or declines — and the receipt records which one happened.

## External case studies — three popular OSS repos

We pointed the read-only detection (`find_similar_helpers`, zero setup — no install, no config)
at fresh clones of three widely-used projects (2026-07-02, shallow clones at the commits shown).
These are actual tool outputs, reproducible with the commands below:

```bash
git clone --depth 1 https://github.com/colinhacks/zod && cd zod
qodex # → find_similar_helpers path="packages/zod/src"
```

| Repo (commit) | Scanned | Found |
|---|---|---|
| **zod** `912f0f5` (`packages/zod/src`) | 116 files, 1,103 functions | **38 near-dupe clusters, ~671 collapsible lines.** Headline: a **7-copy family** — `positive`/`negative`/`nonpositive`/`nonnegative` across ZodNumber+ZodBigInt (~94% similar), each differing only in `kind:`/`inclusive:` |
| **hono** `b20d422` (`src`) | 186 files, 255 functions | 2 clusters: `importPublicKey`/`importPrivateKey` (~95%, ~26 lines) and a duplicated `getQueryString` in the aws-lambda handler (~92%, ~21 lines) |
| **axios** `e435384` (`lib`) | 66 files, 152 functions | 1 cluster: `setFormDataHeaders` **byte-near-identical** in `helpers/resolveConfig.js` and `adapters/http.js` (verified by eye: the only difference is a `\|\| {}` guard) — a textbook `consolidate-dupes` candidate |

And because detection now includes the **parameterize proposal** (v2), the zod family comes back
not as "these look similar" but as the concrete consolidation:

```
Proposed shared helper `extractedHelper(kind, inclusive)` — the bodies differ ONLY in these 2 spot(s):
positive(…) → extractedHelper("min", false)
nonpositive(…) → extractedHelper("max", true)
nonnegative(…) → extractedHelper("min", true)
(not covered — different structure: the ZodNumber variants, whose value is 0 vs BigInt(0))
```

Two honest caveats. First, these are *detection* results — read-only, nothing was changed in any
repo. Second, duplication in mature libraries is sometimes deliberate (readability, hot-path
inlining, API symmetry): the tool's job is to surface the opportunity with proof; whether to take
it stays a maintainer's call. That division of labor — *mechanical evidence, human judgment* — is
the same verify-or-block philosophy the rest of maintain runs on.

## Recommended rollout ladder

Adopt one rung at a time; each rung is strictly riskier than the previous. Stay on a rung until
Expand Down
119 changes: 118 additions & 1 deletion src/codegraph/helper-extract.ts
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,19 @@ export function findSimilarHelpers(
}

const clusters: HelperCluster[] = [];
for (const { members: idxs } of seeded) {
for (const { members: rawIdxs } of seeded) {
// Drop members NESTED inside another member (same file, contained line range): a inner helper
// shares most of its parent's tokens, so parent+child cluster together — but "extract the child
// from its parent" is not a duplicate-consolidation opportunity. Keep the outermost.
const idxs = rawIdxs.filter(i => {
const a = prepared[i]!.u;
return !rawIdxs.some(j => {
if (j === i) return false;
const b = prepared[j]!.u;
return a.file === b.file && b.startLine <= a.startLine && a.endLine <= b.endLine
&& (b.endLine - b.startLine) > (a.endLine - a.startLine);
});
});
if (idxs.length < 2) continue;
let sum = 0, cnt = 0;
for (let a = 0; a < idxs.length; a++) {
Expand All @@ -166,6 +178,111 @@ export function findSimilarHelpers(
return clusters.sort((a, b) => b.estLinesSaved - a.estLinesSaved || b.avgSimilarity - a.avgSimilarity);
}

// ————————————————————————————— v2: parameterized-consolidation proposal —————————————————————————————

export interface ParamProposal {
ok: boolean;
reason?: string;
helperName: string;
/** One entry per parameter the shared helper needs; values are per-member (same order as input). */
params: { name: string; values: string[] }[];
/** The reference member's body with the varying tokens replaced by the parameter names. */
sketch: string;
/** How each original function maps to a call of the shared helper. */
calls: { member: string; args: string[] }[];
/** Members left out because their body doesn't align with the majority (e.g. a BigInt variant). */
dropped?: string[];
}

interface RawTok { text: string; start: number }

/** Code tokens WITH literals kept and source offsets (for substitution in the sketch). PURE. */
function rawTokens(body: string): RawTok[] {
const re = /"(?:[^"\\]|\\.)*"|'(?:[^'\\]|\\.)*'|`(?:[^`\\]|\\.)*`|\d+(?:\.\d+)?|[A-Za-z_$][\w$]*|[^\sA-Za-z0-9_$]/g;
const out: RawTok[] = [];
let m: RegExpExecArray | null;
while ((m = re.exec(body))) out.push({ text: m[0], start: m.index });
return out;
}

/**
* Given a cluster of near-duplicate functions, propose ONE parameterized helper: align the bodies
* token-by-token, turn each varying position into a parameter (positions that always vary together
* collapse into one), and emit the reference body with those tokens substituted — plus the exact
* call each original becomes. Mechanical and conservative: if the bodies don't align structurally
* (different token counts) or vary in too many places (>4 params), it declines with the reason
* rather than proposing something wrong. PURE.
*/
export function proposeParameterizedHelper(members: FunctionUnit[], helperName = 'extractedHelper'): ParamProposal {
const decline = (reason: string): ParamProposal => ({ ok: false, reason, helperName, params: [], sketch: '', calls: [] });
if (members.length < 2) return decline('need at least two functions');

// Neutralize each function's own name so it never reads as a varying token.
let bodies = members.map(m => m.body.replace(new RegExp(`\\b${m.name.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')}\\b`, 'g'), helperName));
let toks = bodies.map(b => rawTokens(b));
let dropped: string[] | undefined;
if (toks.some(t => t.length !== toks[0]!.length)) {
// Mixed structural variants in one cluster (e.g. a Number family + a BigInt family): keep the
// LARGEST same-token-count subset and report the rest as dropped, instead of declining outright.
const byCount = new Map<number, number[]>();
toks.forEach((t, i) => { const k = t.length; if (!byCount.has(k)) byCount.set(k, []); byCount.get(k)!.push(i); });
const best = [...byCount.values()].sort((a, b) => b.length - a.length)[0]!;
if (best.length < 2) return decline('bodies do not align structurally (token counts differ) — too divergent to parameterize mechanically');
dropped = members.filter((_, i) => !best.includes(i)).map(m => m.name);
members = best.map(i => members[i]!);
bodies = best.map(i => bodies[i]!);
toks = best.map(i => toks[i]!);
}

// Positions where members disagree, grouped by their value-vector (co-varying positions = one param).
const n = toks[0]!.length;
const groups = new Map<string, number[]>();
for (let p = 0; p < n; p++) {
const values = toks.map(t => t[p]!.text);
if (values.every(v => v === values[0])) continue;
const key = JSON.stringify(values);
if (!groups.has(key)) groups.set(key, []);
groups.get(key)!.push(p);
}
if (groups.size === 0) return decline('bodies are identical — use consolidate-dupes, no parameter needed');
if (groups.size > 4) return decline(`${groups.size} independently-varying parts — too many to parameterize cleanly`);

// Name each parameter from context when possible (`kind: "min"` → param `kind`), else p1/p2….
const used = new Set<string>();
const params: { name: string; values: string[]; positions: number[] }[] = [];
for (const [key, positions] of groups) {
const first = positions[0]!;
const ref = toks[0]!;
let name = ref[first - 1]?.text === ':' && /^[A-Za-z_$][\w$]*$/.test(ref[first - 2]?.text ?? '') ? ref[first - 2]!.text : `p${params.length + 1}`;
while (used.has(name)) name = `${name}_`;
used.add(name);
params.push({ name, values: JSON.parse(key), positions });
}

// Sketch: reference body with each varying token replaced by its parameter name (right-to-left).
let sketch = bodies[0]!;
const subs = params.flatMap(p => p.positions.map(pos => ({ pos, name: p.name })))
.sort((a, b) => toks[0]![b.pos]!.start - toks[0]![a.pos]!.start);
for (const s of subs) {
const t = toks[0]![s.pos]!;
sketch = sketch.slice(0, t.start) + s.name + sketch.slice(t.start + t.text.length);
}
const calls = members.map((m, i) => ({ member: m.name, args: params.map(p => p.values[i]!) }));
return { ok: true, helperName, params: params.map(({ name, values }) => ({ name, values })), sketch, calls, dropped };
}

/** Render a proposal as a compact, reviewable block. PURE. */
export function formatParamProposal(pr: ParamProposal): string {
if (!pr.ok) return `No mechanical parameterization: ${pr.reason}`;
const lines = [
`Proposed shared helper \`${pr.helperName}(${pr.params.map(p => p.name).join(', ')})\` — the bodies differ ONLY in these ${pr.params.length} spot(s):`,
];
for (const c of pr.calls) lines.push(` ${c.member}(…) → ${pr.helperName}(${c.args.join(', ')})`);
if (pr.dropped?.length) lines.push(` (not covered — different structure: ${pr.dropped.join(', ')})`);
lines.push('', '```', pr.sketch.split('\n').slice(0, 16).join('\n'), '```');
return lines.join('\n');
}

/** A concise, agent-readable report of extraction opportunities. PURE. */
export function formatHelperClusters(clusters: HelperCluster[]): string {
if (!clusters.length) return 'No near-duplicate helper clusters found — nothing worth extracting.';
Expand Down
19 changes: 17 additions & 2 deletions src/tools/codegraph/find-similar-helpers.ts
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ import { promises as fs } from 'fs';
import * as path from 'path';
import { Tool, type ToolContext, type ToolResult } from '../base.js';
import { extractSymbols } from '../../codegraph/extractor.js';
import { findSimilarHelpers, formatHelperClusters, type FunctionUnit } from '../../codegraph/helper-extract.js';
import { findSimilarHelpers, formatHelperClusters, proposeParameterizedHelper, formatParamProposal, type FunctionUnit } from '../../codegraph/helper-extract.js';

const Args = z.object({
path: z.string().optional().describe('Restrict the scan to this subdirectory (relative to cwd). Default: whole project.'),
Expand Down Expand Up @@ -78,9 +78,24 @@ export class FindSimilarHelpersTool extends Tool<z.infer<typeof Args>> {
const clusters = findSimilarHelpers(units, { minSim: args.min_similarity })
.filter(c => args.include_exact || !c.exact);

// v2: for the top clusters, try a mechanical PARAMETERIZE proposal — the concrete shared
// helper + the exact call each original becomes. Declines (with the reason) when bodies
// don't align, so a proposal only appears when the consolidation is genuinely mechanical.
const proposals: string[] = [];
for (const c of clusters.slice(0, 2)) {
const bodies = c.members
.map(m => units.find(u => u.name === m.name && u.file === m.file && u.startLine === m.startLine))
.filter((u): u is FunctionUnit => !!u)
.slice(0, 6);
if (bodies.length < 2) continue;
const pr = proposeParameterizedHelper(bodies, c.suggestedName);
if (pr.ok) proposals.push(`For cluster \`${c.suggestedName}\`:\n${formatParamProposal(pr)}`);
}
const proposalBlock = proposals.length ? `\n\n${proposals.join('\n\n')}` : '';

const header = `Scanned ${scanned} file(s), ${units.length} function(s).`;
return {
content: `${header}\n\n${formatHelperClusters(clusters)}`,
content: `${header}\n\n${formatHelperClusters(clusters)}${proposalBlock}`,
metadata: {
filesScanned: scanned,
functions: units.length,
Expand Down
76 changes: 75 additions & 1 deletion test/helper-extract.test.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import { describe, it, expect } from 'vitest';
import { normalizeBody, codeTokens, findSimilarHelpers, formatHelperClusters, type FunctionUnit } from '../src/codegraph/helper-extract.ts';
import { normalizeBody, codeTokens, findSimilarHelpers, formatHelperClusters, proposeParameterizedHelper, formatParamProposal, type FunctionUnit } from '../src/codegraph/helper-extract.ts';

// Two helpers copy-pasted then tweaked: different name, different constant — NEAR, not exact.
const fmtUSD = `function formatUSD(n) {
Expand Down Expand Up @@ -91,6 +91,16 @@ describe('findSimilarHelpers', () => {
expect(findSimilarHelpers(distinct, { minTokens: 8, minSim: 0.88 })).toHaveLength(0);
});

it('drops a member NESTED inside another member (inner helper vs its parent is not a dupe)', () => {
// Real case from axios: buildPath is defined INSIDE formDataToJSON — they share most tokens.
const parent = { name: 'outer', file: 'x.ts', startLine: 10, endLine: 60, body: fmtUSD.replace('formatUSD', 'outer') };
const child = { name: 'inner', file: 'x.ts', startLine: 15, endLine: 40, body: fmtUSD.replace('formatUSD', 'inner') };
expect(findSimilarHelpers([parent, child], { minTokens: 10 })).toHaveLength(0); // cluster collapses to 1 → dropped
// …but a genuine same-file, NON-overlapping pair still clusters.
const sibling = { name: 'sib', file: 'x.ts', startLine: 70, endLine: 90, body: fmtUSD.replace('formatUSD', 'sib') };
expect(findSimilarHelpers([parent, sibling], { minTokens: 10 })).toHaveLength(1);
});

it('formats a readable report with the extract-with-review caveat', () => {
const report = formatHelperClusters(findSimilarHelpers(units, { minTokens: 10 }));
expect(report).toMatch(/near-duplicate helper cluster/);
Expand All @@ -99,3 +109,67 @@ describe('findSimilarHelpers', () => {
expect(formatHelperClusters([])).toMatch(/No near-duplicate/);
});
});

describe('proposeParameterizedHelper (v2 — parameterize the near-dupes)', () => {
// The real zod shape: same body, differing only in kind ("min"/"max") and inclusive (false/true).
const check = (name: string, kind: string, inclusive: string) => ({
name, file: 'types.ts', startLine: 1, endLine: 8,
body: `${name}(message) {\n return this._addCheck({\n kind: ${kind},\n value: 0,\n inclusive: ${inclusive},\n message: toString(message),\n });\n}`,
});
const members = [check('positive', '"min"', 'false'), check('negative', '"max"', 'false'), check('nonnegative', '"min"', 'true')];

it('turns each varying token position into a parameter, named from its context', () => {
const pr = proposeParameterizedHelper(members, 'addRangeCheck');
expect(pr.ok).toBe(true);
expect(pr.params.map(p => p.name).sort()).toEqual(['inclusive', 'kind']); // named from `kind:` / `inclusive:`
const kind = pr.params.find(p => p.name === 'kind')!;
expect(kind.values).toEqual(['"min"', '"max"', '"min"']);
expect(pr.sketch).toContain('kind: kind'); // varying token → param in the sketch
expect(pr.sketch).toContain('inclusive: inclusive');
expect(pr.sketch).toContain('addRangeCheck(message)'); // own name neutralized to helper name
expect(pr.calls[1]).toEqual({ member: 'negative', args: ['"max"', 'false'] });
});

it('co-varying positions collapse into ONE parameter', () => {
const two = [
{ name: 'a', file: 'f.ts', startLine: 1, endLine: 3, body: 'a() { log("x"); send("x"); }' },
{ name: 'b', file: 'f.ts', startLine: 5, endLine: 7, body: 'b() { log("y"); send("y"); }' },
];
const pr = proposeParameterizedHelper(two);
expect(pr.ok).toBe(true);
expect(pr.params).toHaveLength(1); // "x"/"y" varies in two places, same vector → one param
expect(pr.calls[0]!.args).toEqual(['"x"']);
});

it('declines honestly when structures differ or too many parts vary', () => {
const divergent = [
{ name: 'a', file: 'f.ts', startLine: 1, endLine: 3, body: 'a() { return 1; }' },
{ name: 'b', file: 'f.ts', startLine: 5, endLine: 9, body: 'b() { for (const x of xs) go(x); return 2; }' },
];
expect(proposeParameterizedHelper(divergent).ok).toBe(false);
expect(proposeParameterizedHelper(divergent).reason).toMatch(/token counts differ/);
const identical = [
{ name: 'a', file: 'f.ts', startLine: 1, endLine: 3, body: 'a() { return 1; }' },
{ name: 'b', file: 'g.ts', startLine: 1, endLine: 3, body: 'b() { return 1; }' },
];
expect(proposeParameterizedHelper(identical).reason).toMatch(/consolidate-dupes/);
});

it('mixed structural variants: parameterizes the LARGEST aligned subset and reports the rest as dropped', () => {
// The real zod shape: a Number family plus a BigInt variant whose body has extra tokens.
const bigint = { name: 'positiveBig', file: 'types.ts', startLine: 20, endLine: 27,
body: 'positiveBig(message) {\n return this._addCheck({\n kind: "min",\n value: BigInt(0),\n inclusive: false,\n message: toString(message),\n });\n}' };
const pr = proposeParameterizedHelper([...members, bigint], 'addRangeCheck');
expect(pr.ok).toBe(true);
expect(pr.calls.map(c => c.member)).toEqual(['positive', 'negative', 'nonnegative']);
expect(pr.dropped).toEqual(['positiveBig']);
expect(formatParamProposal(pr)).toContain('not covered — different structure: positiveBig');
});

it('formats a reviewable block with the call mapping', () => {
const out = formatParamProposal(proposeParameterizedHelper(members, 'addRangeCheck'));
expect(out).toContain('addRangeCheck(kind, inclusive)');
expect(out).toContain('negative(…) → addRangeCheck("max", false)');
expect(out).toContain('```');
});
});