Update claude md by freunda · Pull Request #78 · generative-computing/granite-switch

freunda · 2026-05-27T12:25:13Z

Summary

Trim CLAUDE.md (~393 → ~200 lines): removed sections that duplicate what Claude can infer from code (import paths, full directory tree, weight compatibility, config JSON examples). Kept only non-obvious, actionable rules.
Split architecture docs out of CLAUDE.md: created docs/ARCHITECTURE.md for theory (control tokens, backends, SingleSwitch). Fixed factually wrong SingleSwitch description (it's a single attention head with cumsum, not N transformer layers + projection head).
Push module-specific gotchas into child CLAUDE.md files: HF attention backend caveats → src/granite_switch/hf/CLAUDE.md, vLLM Punica/TP gotchas → src/granite_switch/vllm/CLAUDE.md, compose infra rule → src/granite_switch/composer/CLAUDE.md.
Create tutorials/CLAUDE.md: notebook cell ordering, HF login cell convention, duration comment placement, utility module references.
Fix stale claims in docs:
- README.md: Python 3.11+, PyTorch 2.10+, CUDA version remarks preserved, [dev] description corrected
- PREREQUISITES.md: Python 3.11+, RAG adapter count 5 → 6
- SUPPORTED_MODELS.md: remove wrong single-GPU-only claim (TP is supported and tested), add 30B model
- build_your_own_adapter.md: custom adapters are supported via Mellea's Intrinsic API — updated Step 4 note

CLAUDE.md was 393 lines and contained content Claude could infer from code (import paths, full directory tree, weight compatibility example, Switch Configuration JSON). Applying the Anthropic best- practice test — "would removing this cause Claude to make mistakes?" — cut it to 204 lines. Key changes: - Remove Import Paths, full Project Structure tree, Architecture section, Weight Compatibility, and Switch Configuration JSON - Fix the SingleSwitch description: the old text claimed "N transformer layers + linear projection head + ~1-2% of parameters", all of which are wrong. Actual implementation is a single attention head with one-hot dim-0 pattern and attention-based cumsum, with negligible parameter cost - Architecture theory moved to docs/ARCHITECTURE.md so it stays accessible but doesn't bloat the per-session context - Create tutorials/CLAUDE.md: Claude loads it automatically when reading any tutorials/ file, keeping notebook conventions (cell ordering, HF login cell, duration comments, utility modules) scoped to the directory where they apply

Installs two Claude Code skills in .claude/skills/: - validate-links: scans all .ipynb/.md/.py files for broken local links, stale labels (link text names the wrong file), and broken first-party imports after renames or restructuring. Proposes fixes; never edits without user confirmation. - tutorial-notebook: 15-item checklist + template for polishing or creating tutorial notebooks. Covers structure, correctness bugs, imports, comments, diagrams, demo coverage, and next-steps wiring. References added: - tutorials/CLAUDE.md: when to invoke each skill - docs/GIT_WORKFLOW.md: run /validate-links before any PR that touches notebooks or docs .gitignore updated to track .claude/skills/ while keeping local Claude settings (settings.json, etc.) ignored.

- README.md: Python 3.9+ → 3.11–3.13, PyTorch 2.0+ → 2.10+; [dev] described as "Everything" → accurate description; [vllm20] removed wrong "CUDA 13+" claim - PREREQUISITES.md: Python 3.10+ → 3.11–3.13; RAG adapter count 5 → 6 (granitelib-rag-r1.0 ships 6 adapters) - build_your_own_adapter.md: custom adapters ARE supported via Mellea's Intrinsic API; updated Step 4 note to reflect this

Apply the best-practices doc's child-directory pattern: gotchas scoped to a single backend get loaded on demand from src/granite_switch/{hf,vllm,composer}/CLAUDE.md instead of paying token cost in every session. Moved out of root: - vLLM: Punica -1 index detail, TP row-parallel bias-doubling, deployment commands - HF: eager-backend causal-masking quirk, fused-projections / bit-exact skip - composer: e2e-tests-must-use-compose rule, compose CLI Root keeps universal items (file org, test cadence, config params, control-token generatability, ALORA/LORA placement, hidden-count offset) plus a pointer block listing the child files. Drops root from 204 → 157 lines.

Llama is no longer supported. Drop the Granite-vs-Llama comparison gotcha and the parenthetical "main architectural difference with Llama" framing on logits_scaling in both CLAUDE.md and docs/ARCHITECTURE.md. Renumber the remaining root-level gotchas (1-4). Code references to Llama in src/granite_switch/vllm/core/decoder.py are kept: they document why the RMSNorm dispatch helper exists (different vLLM model classes use different calling conventions) and are not support claims.

TP is supported and tested (tests cover TP specifically). The single-GPU-only note was stale and excluded the 30B model.

Three small cuts that pass the "would removing this cause Claude to make mistakes?" test: 1. Test Files section — drop the per-subdirectory enumeration; the same list already appears in Project Structure. Keep the load-bearing rule (regression tests only, use scratch/ for throwaway). 2. Naming Conventions — drop test_*.py (pytest default) and snake_case.py (PEP 8). Keep only the non-default UPPER_CASE.md rule, renamed to "Documentation Naming". 3. Git Workflow — collapse the bullet list that restated GIT_WORKFLOW.md. Keep one-line pointer plus the "never sign as Claude" rule (the only item not covered by the linked doc). Drops root from 149 → 132 lines.

- Keep CUDA 12.x / CUDA 13+ distinction — useful context for users choosing a vLLM backend - Use '3.11+' instead of '3.11-3.13' — upper bound in pyproject.toml reflects untested versions, not incompatibility

scratch/ is gitignored, which makes it a per-developer convention rather than a project rule. Mandating it in the shared CLAUDE.md presumes every developer wants that workflow. The load-bearing rule for the project is "don't put throwaway scripts in tests/" (because pytest tests/ would pick them up); the scratch/ recommendation is just one possible workaround. Anyone who wants that convention can put it in their own CLAUDE.local.md. Removes two mentions: the Project Structure bullet and the parenthetical in the Test Files section.

…4.0-micro - Revert README Python/PyTorch versions and [dev] description (moving to separate PR) - Revert PREREQUISITES.md RAG adapter count (moving to separate PR) - Remove granite-4.0-micro from SUPPORTED_MODELS.md per review comment

antonpibm · 2026-05-28T06:08:54Z

 | `ibm-granite/granite-4.1-3b` | 3B | Dense, instruct |
 | `ibm-granite/granite-4.1-8b` | 8B | Dense, instruct |
-| `ibm-granite/granite-4.0-micro` | 3B | Dense, instruct |
+| `ibm-granite/granite-4.1-30b` | 30B | Dense, instruct |


granite-4.0-micro is still supported

antonpibm · 2026-05-28T06:14:08Z

+  sequence) or right before the generation prompt
+- **LORA adapters**: token placed at sequence beginning
+
+### 4. Optional Trainable Router (SingleSwitch)


Can we adapt this section to better reflect how the switch really works?

antonpibm · 2026-05-28T06:15:32Z

+
+## Two Backends
+
+Both backends share the same checkpoint format (`save_pretrained` / `from_pretrained`).


from_pretrained is HF API. The checkpoint format is called Safetensors format

antonpibm · 2026-05-28T06:16:26Z

+
+Full `transformers` integration (`PreTrainedModel`, `GenerationMixin`). Used for training and
+debugging. Uses fused QKV and gate-up projections, which changes floating-point reduction order
+relative to the upstream `GraniteMoeHybridForCausalLM` (see Common Gotchas #9 in `CLAUDE.md`).


Enumeration is no longer correct

antonpibm · 2026-05-28T06:17:21Z

+| `num_adapters` | Number of embedded LoRA adapters |
+| `adapter_token_ids` | Token IDs for each adapter's control token |
+| `adapter_names` | Human-readable names for each adapter |
+| `hiding_groups` | Named groups of adapters for KV hiding |


Not up-to date. Are you referring to the object args or the model config?

antonpibm · 2026-05-28T06:19:36Z

+group-based control dimensions (`K=finfo.min`, `Q=per-adapter policy`). Control tokens are
+KV-hidden to prevent cross-request interference.
+
+### 3. Chat Template Integration


Should we add the fact that those are activated using the adapter_name arg in apply_chat_template?

Alon Freund and others added 11 commits May 27, 2026 11:18

Fix SUPPORTED_MODELS.md: remove wrong single-GPU claim, add 30B model

6cb87fb

TP is supported and tested (tests cover TP specifically). The single-GPU-only note was stale and excluded the 30B model.

Restore CUDA version remarks; use Python 3.11+ in README install section

053760c

- Keep CUDA 12.x / CUDA 13+ distinction — useful context for users choosing a vLLM backend - Use '3.11+' instead of '3.11-3.13' — upper bound in pyproject.toml reflects untested versions, not incompatibility

Remove skills — will be added separately

c8f23d8

Remove skill references until skills are added back

10b1e50

yairallouche reviewed May 27, 2026

View reviewed changes

Comment thread docs/SUPPORTED_MODELS.md Outdated

Comment thread README.md Outdated

Comment thread README.md

Comment thread tutorials/PREREQUISITES.md

AlonMalach force-pushed the feature/trim-claude-md branch from 5de1809 to dc04aea Compare May 27, 2026 14:39

freunda marked this pull request as ready for review May 27, 2026 15:54

freunda requested a review from antonpibm as a code owner May 27, 2026 15:54

freunda requested a review from yairallouche May 27, 2026 15:54

antonpibm reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update claude md#78

Update claude md#78
freunda wants to merge 12 commits into
mainfrom
feature/trim-claude-md

freunda commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antonpibm May 28, 2026

Uh oh!

antonpibm May 28, 2026

Uh oh!

antonpibm May 28, 2026

Uh oh!

antonpibm May 28, 2026

Uh oh!

antonpibm May 28, 2026

Uh oh!

antonpibm May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		## Two Backends

		Both backends share the same checkpoint format (`save_pretrained` / `from_pretrained`).

Conversation

freunda commented May 27, 2026

Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antonpibm May 28, 2026

Choose a reason for hiding this comment

Uh oh!

antonpibm May 28, 2026

Choose a reason for hiding this comment

Uh oh!

antonpibm May 28, 2026

Choose a reason for hiding this comment

Uh oh!

antonpibm May 28, 2026

Choose a reason for hiding this comment

Uh oh!

antonpibm May 28, 2026

Choose a reason for hiding this comment

Uh oh!

antonpibm May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants