Skip to content

agentlas-ai/agent_spec_package_architect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agentlas Agent Lab banner

Spec Package Architect

agentlas.cloud · Agentlas Agent Lab · agent_spec_package_architect

A planning-specialist agent that writes the AI-native spec package — the cross-linked design-document set an AI builder reads to take a product from blank slate to working software without losing the thread.

Spec package cross-link topology

Abstract

AI builders fail end-to-end software work less often because the model is weak at one function, and more often because the plan they were handed was written for a human reader, not for a token-budgeted, statelessly-resumed agent. A narrative PRD, a slide-deck architecture, and ticket prose embed implicit knowledge, lack stable anchors for cross-reference, and leave acceptance criteria un-formalized. Handed that, an AI builder hallucinates the gaps, drops requirements between phases, re-litigates settled decisions, and "breaks" — it stops producing coherent forward progress.

Spec Package Architect proposes a different planning deliverable: not a document, but a graph of small, cross-linked Markdown artifacts engineered for agentic consumption. A fixed universal core — constitution, spec (in EARS notation), design, data-model, api-contract, tasks (as a DAG), test-plan, decision records, an agent operating manual, and a live memory file — is bound together by a strict ID-and-anchor protocol so that every requirement traces forward to the exact API operation, data entity, task, and test that satisfy it, and backward again. Thin domain overlays (SaaS, games, mobile, systems) attach to the core without forking it. Phase gates enforce that the builder may not write code before requirements are clarified, nor declare a task done before its tests pass.

The thesis is that continuity is engineered, not assumed: progressive disclosure (load small, retrieve just-in-time), external memory that survives a context reset, and machine-checkable contracts (EARS, OpenAPI, JSON Schema, an acyclic task DAG) are what let one resettable agent behave as if it had read and remembered everything. This repository is the research foundation and the template package for that agent.

Keywords

AI-native specification, spec-driven development, agentic software engineering, EARS requirements, traceability matrix, context engineering, context rot, architecture decision records, task DAG, domain overlays, design-document package, meta-agent planning.

Research Question

What documents, in what structure and with what links, must a planning agent produce so that a separate AI builder can take a web app, SaaS platform, game, mobile app, or system from idea to finished software without omitting a requirement and without breaking continuity across the many sessions and context resets the build will span?

Contributions

This repository contributes six concrete artifacts:

  1. A document taxonomy (docs/taxonomy.md) — the four-layer file set (operating manual, spec spine, decisions/history, domain overlays) with the rationale for why each file exists and minimal sufficiency.
  2. A cross-linking topology (docs/cross-linking-topology.md) — the ID protocol, anchor convention, cross-reference syntax, and the directed dependency graph with its cardinal "no-orphan" invariants.
  3. A sequencing-and-gates model (docs/sequencing-and-gates.md) — the phase pipeline and the definition-of-ready / definition-of-done gates, split into machine-enforced vs advisory.
  4. A continuity-engineering protocol (docs/continuity-engineering.md) — progressive disclosure, the external memory file, JSON-for-state, compaction timing, and the exact resume protocol.
  5. A ready-to-fill template package (templates/) — the actual Markdown files an agent emits, with worked mini-examples and a fill order.
  6. A reusable agent contract (agent.md) for the Spec Package Architect, plus a four-axis evaluation framework (docs/evaluation.md).

Motivation

The dominant AI-agent pattern still treats the plan as a prose document and memory as a conversation property. Both break on real builds. The convergent evidence from production spec-driven systems — GitHub Spec Kit, Amazon Kiro, BMAD-METHOD — and from multi-agent code-generation research — MetaGPT, ChatDev, AgileCoder, AgentCoder — is that a small, fixed core of cross-linked artifacts outperforms ad-hoc prompting. MetaGPT reports 85.9% / 87.7% Pass@1 on HumanEval / MBPP from mandating standardized intermediate artifacts; AgentCoder reaches 96.3% / 91.8% with a dedicated Test-Designer role; self-planning before code yields up to +25.4% Pass@1 over direct generation.

The deeper reason these work is context rot. Chroma's 2025 report evaluated 18 frontier models and found performance "grows increasingly unreliable as input length grows." A human-style PRD forces the builder to hold everything at once — exactly the regime where it degrades. The spec package inverts the cost: hold one node and its edges, retrieve neighbors just-in-time, reconstruct global state from a generated matrix on demand.

This is not a new idea invented here; it is a convergence observed across vendors with no shared incentive, distilled into a single reusable package and a single agent that produces it.

The package at a glance

Layer 0  Agent operating manual    constitution.md · AGENTS.md / CLAUDE.md
Layer 1  Specification spine        spec.md (EARS) · design.md · data-model.md ·
                                    api-contract.md · ui-spec.md · tasks.md (DAG) ·
                                    test-plan.md · glossary.md
Layer 2  Decisions & history        docs/decisions/NNNN-*.md (ADRs) ·
                                    CHANGELOG.md (agentic memory) · traceability.md
Layer 3  Domain overlays            saas/ · games/ · mobile/ · systems/

Every file carries YAML frontmatter (id, type, status, depends_on, satisfies, linked_decisions) and is wired to its neighbors by the anchor protocol. The full taxonomy and the reason each file is separate is in docs/taxonomy.md.

The link that prevents "breaking"

A single requirement is the worked example of the whole design:

### REQ-007: Account lockout after repeated failures {#req-007}
**EARS:** IF the failed-attempt counter for an account exceeds 5 within 15 minutes,
THEN THE SYSTEM SHALL lock the account for 30 minutes and emit `account.locked`.
**Acceptance:** AC-007.1 invalid creds → 401 · AC-007.2 6 fails/15min → lock
**Links:** design.md#cmp-003 · api-contract.md#op-post-login ·
          data-model.md#entity-user · tasks.md#t-018 · docs/decisions/0007-*.md

From that one node the builder can reach the component, the endpoint, the entity, the task, the test, and the decision — each a single hop on a stable path. That walkability, rebuilt into a bidirectional traceability matrix at every gate, is what the user means by "organic, unbroken connection from start to finish."

How continuity is engineered

Three mechanisms, detailed in docs/continuity-engineering.md:

Mechanism What it defeats
Progressive disclosure — small files, just-in-time retrieval Holding the whole spec, degrading on length
External memoryCHANGELOG.md with a "failed approaches & why" ledger Re-trying dead ends after a context reset
Checkpointed DAGtasks.json status, resume to first ready task Losing track of "where was I"

Anthropic's prescription, adopted verbatim by this package: "ASSUME INTERRUPTION: your context window might be reset at any moment, so you risk losing any progress that is not recorded in your memory directory."

How steps are enforced

The builder is not trusted to remember the order; phase gates enforce it. Definition-of-ready before moving downstream; definition-of-done before declaring complete. Critical invariants are machine-enforced (hooks / CI, 100% compliance) rather than left to prose (70–90% compliance) — the gap is the justification.

P0 Bootstrap → P1 Specify → P2 Design → P3 Plan→Tasks → P4 Implement → P5 Verify
     G0           G1            G2            G3              G4            G5

Domain coverage

One universal core plus thin overlays — not a package per domain. The evidence (BMAD expansion-packs, Spec Kit presets, the fact that a generic spine makes the builder hallucinate a tenant model) is in docs/domain-overlays.md. Overlays ship their own domain gates (e.g. SaaS: every mutation carries an idempotency key; every privileged action writes to the audit log).

Repository structure

agent_spec_package_architect/
  README.md                         this research note + practical guide
  agent.md                          the Spec Package Architect agent contract
  docs/
    taxonomy.md                     the four-layer document set
    cross-linking-topology.md       ID/anchor protocol + dependency DAG
    sequencing-and-gates.md         phases + definition-of-ready/done gates
    frontmatter-schema.md           YAML manifest schema
    continuity-engineering.md       context-rot mitigation + resume protocol
    domain-overlays.md              universal core + domain packs
    evaluation.md                   four-axis benchmark + ablations
    repo-decisions.md               public-safe decision log
    research-log.md                 dated research notes
  templates/                        the ready-to-fill spec package
    README.md                       package map + fill order
    constitution.md  spec.md  design.md  data-model.md  api-contract.md
    ui-spec.md  tasks.md  test-plan.md  glossary.md  traceability.md
    changelog.md  adr/0000-adr-template.md
    overlays/ saas/ games/ mobile/ systems/
  assets/
    spec-package-topology.svg       the cross-link figure
    agentlas-agent-lab-banner.svg   shared Agentlas banner
  scripts/
    public_safety_check.sh          public-data hygiene check

The root filename memory.md is reserved for public agent memory; private scratch context belongs outside Git or under .memory.local/.

How to use

  1. Bootstrap. Copy templates/ into the target project's spec/ folder. Fill constitution.md — the non-negotiables. Approve it (gate G0).
  2. Specify. Write spec.md in EARS; run the clarify loop until no [NEEDS CLARIFICATION] remains (gate G1).
  3. Design. Produce design.md, data-model.md, api-contract.md, and ADRs (gate G2).
  4. Plan. Generate tasks.md as an acyclic DAG and the traceability matrix (gate G3).
  5. Implement & verify. Builder agent picks the next ready task, writes tests first, implements, ticks acceptance criteria, appends changelog.md (gate G4), then closes with 100% traceability (gate G5).

The fill order, with per-file checklists, is in templates/README.md.

Relationship to /meta-agent

This repository is the research and template foundation for an Agentlas planning-specialist agent built via /meta-agent. agent.md is the agent contract the meta-agent consumes: the role, the operating loop (mapped to the phase pipeline), the memory rules, the gates, and the done criteria. The research docs are the agent's reference library; the templates are what it emits.

Evaluation plan

Four axes — completeness (% REQ whose acceptance tests pass), continuity (survival across forced context resets), traceability (bidirectional REQ ↔ test ↔ code), and cost-to-correctness (tokens / Pass@1) — over a domain-stratified workload set, with ablations isolating EARS, the memory file, ADRs, overlays, and gate enforcement. Full method and thresholds in docs/evaluation.md.

Limitations

This is a research scaffold, not a validated claim. Known risks:

  • No RCT of the whole package. The strongest controlled evidence is at the prompt-pattern level (self-planning +25.4%; AgileCoder +5.58/+6.33; SpecFix +4.3%); the end-to-end effect rests on convergence of practice plus ablations.
  • Function-level benchmark numbers do not transfer. MetaGPT's 85.9% HumanEval collapses to 0.0536 project-level Req.acc on E2EDev — treat function-level Pass@1 as a ceiling, not a floor.
  • Spec-driven processes can re-introduce waterfall rigidity (Gojko Adzic's warning). Mitigation: keep the spec lean and living while keeping the constitution firm — the inverse of the classical pattern.
  • Context-rot evidence is vendor/blog-published (Chroma, Anthropic), corroborated by peer-reviewed RULER and NoLiMa; specific token thresholds should be re-measured per shipped model.
  • EARS suits functional requirements but is awkward for aesthetic / UX / exploratory work; hybrid spec styles are realistic.
  • The package is biased toward greenfield. Brownfield modernization needs an added reverse-engineering (code → spec) phase not covered here in depth.

Roadmap

Stage Deliverable Advance when
1 Universal core templates + gates CRUD web app built end-to-end, ≥90% traceability, zero hand-edits
2 Domain expansion packs (SaaS, game, mobile, systems) Each pack raises completeness ≥15 pts vs core-only
3 Continuity instrumentation (mandatory memory, compaction, clarify gate) Agent resumes after context loss, finishes with ≤10% extra tokens, ≥90% of trials
4 Evaluation harness + published benchmark Four-axis benchmark on stratified workloads is public

References

  1. GitHub. "Spec Kit — spec-driven development."
    https://github.com/github/spec-kit
  2. Amazon. "Kiro — spec-driven AI IDE."
    https://kiro.dev
  3. BMAD-METHOD (Breakthrough Method for Agile AI-Driven Development).
    https://github.com/bmad-code-org/BMAD-METHOD
  4. Hong, S. et al. "MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework." ICLR 2024. arXiv:2308.00352.
    https://arxiv.org/abs/2308.00352
  5. Qian, C. et al. "ChatDev: Communicative Agents for Software Development." ACL 2024. arXiv:2307.07924.
    https://arxiv.org/abs/2307.07924
  6. Nguyen, M. H. et al. "AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology." FORGE 2025. arXiv:2406.11912.
    https://arxiv.org/abs/2406.11912
  7. Huang, D. et al. "AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation." arXiv:2312.13010.
    https://arxiv.org/abs/2312.13010
  8. Jiang, X. et al. "Self-planning Code Generation with Large Language Models." TOSEM 2024. arXiv:2303.06689.
    https://arxiv.org/abs/2303.06689
  9. Mavin, A. et al. "Easy Approach to Requirements Syntax (EARS)." IEEE RE'09.
    https://ieeexplore.ieee.org/document/5328509
  10. Hong, K., Troynikov, A., Huber, J. "Context Rot: How Increasing Input Tokens Impacts LLM Performance." Chroma, 2025.
    https://research.trychroma.com/context-rot
  11. Anthropic. "Effective context engineering for AI agents." 2025.
    https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
  12. Anthropic. "Building effective agents / agent harness & memory guidance." 2024–2025.
    https://www.anthropic.com/engineering
  13. Hsieh, C-P. et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?" arXiv:2404.06654.
    https://arxiv.org/abs/2404.06654
  14. Modarressi, A. et al. "NoLiMa: Long-Context Evaluation Beyond Literal Matching." arXiv:2502.05167.
    https://arxiv.org/abs/2502.05167
  15. OpenAI. "AGENTS.md — a simple, open format for guiding coding agents."
    https://agents.md
  16. "MADR — Markdown Architectural Decision Records," v4.0.
    https://adr.github.io/madr/
  17. Fan, Z. et al. "SpecFix: Repairing Ambiguous Requirements for Code Generation." arXiv:2505.07270.
    https://arxiv.org/abs/2505.07270
  18. Li, J. et al. "DevBench: A Comprehensive Benchmark for Software Development." arXiv:2403.08604.
    https://arxiv.org/abs/2403.08604
  19. "E2EDev: End-to-End Project-Level Code Generation Benchmark." arXiv:2510.14509.
    https://arxiv.org/abs/2510.14509
  20. Procida, D. "Diátaxis — a systematic framework for technical documentation."
    https://diataxis.fr

License

MIT. Part of the Agentlas Agent Lab public research program.

About

Research + templates for AI-native spec packages: cross-linked design docs (constitution, EARS spec, design, data-model, api-contract, tasks DAG, traceability) that let an AI builder ship software end-to-end without losing the thread. Foundation for a meta-agent planning specialist.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages