agentlas.cloud · Agentlas Agent Lab · agent_spec_package_architect
A planning-specialist agent that writes the AI-native spec package — the cross-linked design-document set an AI builder reads to take a product from blank slate to working software without losing the thread.
AI builders fail end-to-end software work less often because the model is weak at one function, and more often because the plan they were handed was written for a human reader, not for a token-budgeted, statelessly-resumed agent. A narrative PRD, a slide-deck architecture, and ticket prose embed implicit knowledge, lack stable anchors for cross-reference, and leave acceptance criteria un-formalized. Handed that, an AI builder hallucinates the gaps, drops requirements between phases, re-litigates settled decisions, and "breaks" — it stops producing coherent forward progress.
Spec Package Architect proposes a different planning deliverable: not a document,
but a graph of small, cross-linked Markdown artifacts engineered for agentic
consumption. A fixed universal core — constitution, spec (in EARS notation),
design, data-model, api-contract, tasks (as a DAG), test-plan,
decision records, an agent operating manual, and a live memory file — is bound
together by a strict ID-and-anchor protocol so that every requirement traces
forward to the exact API operation, data entity, task, and test that satisfy it,
and backward again. Thin domain overlays (SaaS, games, mobile, systems) attach to
the core without forking it. Phase gates enforce that the builder may not write
code before requirements are clarified, nor declare a task done before its tests
pass.
The thesis is that continuity is engineered, not assumed: progressive disclosure (load small, retrieve just-in-time), external memory that survives a context reset, and machine-checkable contracts (EARS, OpenAPI, JSON Schema, an acyclic task DAG) are what let one resettable agent behave as if it had read and remembered everything. This repository is the research foundation and the template package for that agent.
AI-native specification, spec-driven development, agentic software engineering, EARS requirements, traceability matrix, context engineering, context rot, architecture decision records, task DAG, domain overlays, design-document package, meta-agent planning.
What documents, in what structure and with what links, must a planning agent produce so that a separate AI builder can take a web app, SaaS platform, game, mobile app, or system from idea to finished software without omitting a requirement and without breaking continuity across the many sessions and context resets the build will span?
This repository contributes six concrete artifacts:
- A document taxonomy (docs/taxonomy.md) — the four-layer file set (operating manual, spec spine, decisions/history, domain overlays) with the rationale for why each file exists and minimal sufficiency.
- A cross-linking topology (docs/cross-linking-topology.md) — the ID protocol, anchor convention, cross-reference syntax, and the directed dependency graph with its cardinal "no-orphan" invariants.
- A sequencing-and-gates model (docs/sequencing-and-gates.md) — the phase pipeline and the definition-of-ready / definition-of-done gates, split into machine-enforced vs advisory.
- A continuity-engineering protocol (docs/continuity-engineering.md) — progressive disclosure, the external memory file, JSON-for-state, compaction timing, and the exact resume protocol.
- A ready-to-fill template package (templates/) — the actual Markdown files an agent emits, with worked mini-examples and a fill order.
- A reusable agent contract (agent.md) for the Spec Package Architect, plus a four-axis evaluation framework (docs/evaluation.md).
The dominant AI-agent pattern still treats the plan as a prose document and memory as a conversation property. Both break on real builds. The convergent evidence from production spec-driven systems — GitHub Spec Kit, Amazon Kiro, BMAD-METHOD — and from multi-agent code-generation research — MetaGPT, ChatDev, AgileCoder, AgentCoder — is that a small, fixed core of cross-linked artifacts outperforms ad-hoc prompting. MetaGPT reports 85.9% / 87.7% Pass@1 on HumanEval / MBPP from mandating standardized intermediate artifacts; AgentCoder reaches 96.3% / 91.8% with a dedicated Test-Designer role; self-planning before code yields up to +25.4% Pass@1 over direct generation.
The deeper reason these work is context rot. Chroma's 2025 report evaluated 18 frontier models and found performance "grows increasingly unreliable as input length grows." A human-style PRD forces the builder to hold everything at once — exactly the regime where it degrades. The spec package inverts the cost: hold one node and its edges, retrieve neighbors just-in-time, reconstruct global state from a generated matrix on demand.
This is not a new idea invented here; it is a convergence observed across vendors with no shared incentive, distilled into a single reusable package and a single agent that produces it.
Layer 0 Agent operating manual constitution.md · AGENTS.md / CLAUDE.md
Layer 1 Specification spine spec.md (EARS) · design.md · data-model.md ·
api-contract.md · ui-spec.md · tasks.md (DAG) ·
test-plan.md · glossary.md
Layer 2 Decisions & history docs/decisions/NNNN-*.md (ADRs) ·
CHANGELOG.md (agentic memory) · traceability.md
Layer 3 Domain overlays saas/ · games/ · mobile/ · systems/
Every file carries YAML frontmatter (id, type, status, depends_on, satisfies, linked_decisions) and is wired to its neighbors by the anchor protocol. The full taxonomy and the reason each file is separate is in docs/taxonomy.md.
A single requirement is the worked example of the whole design:
### REQ-007: Account lockout after repeated failures {#req-007}
**EARS:** IF the failed-attempt counter for an account exceeds 5 within 15 minutes,
THEN THE SYSTEM SHALL lock the account for 30 minutes and emit `account.locked`.
**Acceptance:** AC-007.1 invalid creds → 401 · AC-007.2 6 fails/15min → lock
**Links:** design.md#cmp-003 · api-contract.md#op-post-login ·
data-model.md#entity-user · tasks.md#t-018 · docs/decisions/0007-*.mdFrom that one node the builder can reach the component, the endpoint, the entity, the task, the test, and the decision — each a single hop on a stable path. That walkability, rebuilt into a bidirectional traceability matrix at every gate, is what the user means by "organic, unbroken connection from start to finish."
Three mechanisms, detailed in docs/continuity-engineering.md:
| Mechanism | What it defeats |
|---|---|
| Progressive disclosure — small files, just-in-time retrieval | Holding the whole spec, degrading on length |
External memory — CHANGELOG.md with a "failed approaches & why" ledger |
Re-trying dead ends after a context reset |
Checkpointed DAG — tasks.json status, resume to first ready task |
Losing track of "where was I" |
Anthropic's prescription, adopted verbatim by this package: "ASSUME INTERRUPTION: your context window might be reset at any moment, so you risk losing any progress that is not recorded in your memory directory."
The builder is not trusted to remember the order; phase gates enforce it. Definition-of-ready before moving downstream; definition-of-done before declaring complete. Critical invariants are machine-enforced (hooks / CI, 100% compliance) rather than left to prose (70–90% compliance) — the gap is the justification.
P0 Bootstrap → P1 Specify → P2 Design → P3 Plan→Tasks → P4 Implement → P5 Verify
G0 G1 G2 G3 G4 G5
One universal core plus thin overlays — not a package per domain. The evidence (BMAD expansion-packs, Spec Kit presets, the fact that a generic spine makes the builder hallucinate a tenant model) is in docs/domain-overlays.md. Overlays ship their own domain gates (e.g. SaaS: every mutation carries an idempotency key; every privileged action writes to the audit log).
agent_spec_package_architect/
README.md this research note + practical guide
agent.md the Spec Package Architect agent contract
docs/
taxonomy.md the four-layer document set
cross-linking-topology.md ID/anchor protocol + dependency DAG
sequencing-and-gates.md phases + definition-of-ready/done gates
frontmatter-schema.md YAML manifest schema
continuity-engineering.md context-rot mitigation + resume protocol
domain-overlays.md universal core + domain packs
evaluation.md four-axis benchmark + ablations
repo-decisions.md public-safe decision log
research-log.md dated research notes
templates/ the ready-to-fill spec package
README.md package map + fill order
constitution.md spec.md design.md data-model.md api-contract.md
ui-spec.md tasks.md test-plan.md glossary.md traceability.md
changelog.md adr/0000-adr-template.md
overlays/ saas/ games/ mobile/ systems/
assets/
spec-package-topology.svg the cross-link figure
agentlas-agent-lab-banner.svg shared Agentlas banner
scripts/
public_safety_check.sh public-data hygiene check
The root filename memory.md is reserved for public agent memory; private
scratch context belongs outside Git or under .memory.local/.
- Bootstrap. Copy templates/ into the target project's
spec/folder. Fillconstitution.md— the non-negotiables. Approve it (gate G0). - Specify. Write
spec.mdin EARS; run the clarify loop until no[NEEDS CLARIFICATION]remains (gate G1). - Design. Produce
design.md,data-model.md,api-contract.md, and ADRs (gate G2). - Plan. Generate
tasks.mdas an acyclic DAG and the traceability matrix (gate G3). - Implement & verify. Builder agent picks the next ready task, writes tests
first, implements, ticks acceptance criteria, appends
changelog.md(gate G4), then closes with 100% traceability (gate G5).
The fill order, with per-file checklists, is in templates/README.md.
This repository is the research and template foundation for an Agentlas
planning-specialist agent built via /meta-agent. agent.md is the
agent contract the meta-agent consumes: the role, the operating loop (mapped to
the phase pipeline), the memory rules, the gates, and the done criteria. The
research docs are the agent's reference library; the templates are what it emits.
Four axes — completeness (% REQ whose acceptance tests pass), continuity (survival across forced context resets), traceability (bidirectional REQ ↔ test ↔ code), and cost-to-correctness (tokens / Pass@1) — over a domain-stratified workload set, with ablations isolating EARS, the memory file, ADRs, overlays, and gate enforcement. Full method and thresholds in docs/evaluation.md.
This is a research scaffold, not a validated claim. Known risks:
- No RCT of the whole package. The strongest controlled evidence is at the prompt-pattern level (self-planning +25.4%; AgileCoder +5.58/+6.33; SpecFix +4.3%); the end-to-end effect rests on convergence of practice plus ablations.
- Function-level benchmark numbers do not transfer. MetaGPT's 85.9% HumanEval collapses to 0.0536 project-level Req.acc on E2EDev — treat function-level Pass@1 as a ceiling, not a floor.
- Spec-driven processes can re-introduce waterfall rigidity (Gojko Adzic's warning). Mitigation: keep the spec lean and living while keeping the constitution firm — the inverse of the classical pattern.
- Context-rot evidence is vendor/blog-published (Chroma, Anthropic), corroborated by peer-reviewed RULER and NoLiMa; specific token thresholds should be re-measured per shipped model.
- EARS suits functional requirements but is awkward for aesthetic / UX / exploratory work; hybrid spec styles are realistic.
- The package is biased toward greenfield. Brownfield modernization needs an added reverse-engineering (code → spec) phase not covered here in depth.
| Stage | Deliverable | Advance when |
|---|---|---|
| 1 | Universal core templates + gates | CRUD web app built end-to-end, ≥90% traceability, zero hand-edits |
| 2 | Domain expansion packs (SaaS, game, mobile, systems) | Each pack raises completeness ≥15 pts vs core-only |
| 3 | Continuity instrumentation (mandatory memory, compaction, clarify gate) | Agent resumes after context loss, finishes with ≤10% extra tokens, ≥90% of trials |
| 4 | Evaluation harness + published benchmark | Four-axis benchmark on stratified workloads is public |
- GitHub. "Spec Kit — spec-driven development."
https://github.com/github/spec-kit - Amazon. "Kiro — spec-driven AI IDE."
https://kiro.dev - BMAD-METHOD (Breakthrough Method for Agile AI-Driven Development).
https://github.com/bmad-code-org/BMAD-METHOD - Hong, S. et al. "MetaGPT: Meta Programming for a Multi-Agent Collaborative
Framework." ICLR 2024. arXiv:2308.00352.
https://arxiv.org/abs/2308.00352 - Qian, C. et al. "ChatDev: Communicative Agents for Software Development."
ACL 2024. arXiv:2307.07924.
https://arxiv.org/abs/2307.07924 - Nguyen, M. H. et al. "AgileCoder: Dynamic Collaborative Agents for Software
Development based on Agile Methodology." FORGE 2025. arXiv:2406.11912.
https://arxiv.org/abs/2406.11912 - Huang, D. et al. "AgentCoder: Multi-Agent-based Code Generation with Iterative
Testing and Optimisation." arXiv:2312.13010.
https://arxiv.org/abs/2312.13010 - Jiang, X. et al. "Self-planning Code Generation with Large Language Models."
TOSEM 2024. arXiv:2303.06689.
https://arxiv.org/abs/2303.06689 - Mavin, A. et al. "Easy Approach to Requirements Syntax (EARS)." IEEE RE'09.
https://ieeexplore.ieee.org/document/5328509 - Hong, K., Troynikov, A., Huber, J. "Context Rot: How Increasing Input Tokens
Impacts LLM Performance." Chroma, 2025.
https://research.trychroma.com/context-rot - Anthropic. "Effective context engineering for AI agents." 2025.
https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents - Anthropic. "Building effective agents / agent harness & memory guidance."
2024–2025.
https://www.anthropic.com/engineering - Hsieh, C-P. et al. "RULER: What's the Real Context Size of Your Long-Context
Language Models?" arXiv:2404.06654.
https://arxiv.org/abs/2404.06654 - Modarressi, A. et al. "NoLiMa: Long-Context Evaluation Beyond Literal
Matching." arXiv:2502.05167.
https://arxiv.org/abs/2502.05167 - OpenAI. "AGENTS.md — a simple, open format for guiding coding agents."
https://agents.md - "MADR — Markdown Architectural Decision Records," v4.0.
https://adr.github.io/madr/ - Fan, Z. et al. "SpecFix: Repairing Ambiguous Requirements for Code
Generation." arXiv:2505.07270.
https://arxiv.org/abs/2505.07270 - Li, J. et al. "DevBench: A Comprehensive Benchmark for Software Development."
arXiv:2403.08604.
https://arxiv.org/abs/2403.08604 - "E2EDev: End-to-End Project-Level Code Generation Benchmark."
arXiv:2510.14509.
https://arxiv.org/abs/2510.14509 - Procida, D. "Diátaxis — a systematic framework for technical documentation."
https://diataxis.fr
MIT. Part of the Agentlas Agent Lab public research program.