Update claude md#78
Open
freunda wants to merge 12 commits into
Open
Conversation
CLAUDE.md was 393 lines and contained content Claude could infer from code (import paths, full directory tree, weight compatibility example, Switch Configuration JSON). Applying the Anthropic best- practice test — "would removing this cause Claude to make mistakes?" — cut it to 204 lines. Key changes: - Remove Import Paths, full Project Structure tree, Architecture section, Weight Compatibility, and Switch Configuration JSON - Fix the SingleSwitch description: the old text claimed "N transformer layers + linear projection head + ~1-2% of parameters", all of which are wrong. Actual implementation is a single attention head with one-hot dim-0 pattern and attention-based cumsum, with negligible parameter cost - Architecture theory moved to docs/ARCHITECTURE.md so it stays accessible but doesn't bloat the per-session context - Create tutorials/CLAUDE.md: Claude loads it automatically when reading any tutorials/ file, keeping notebook conventions (cell ordering, HF login cell, duration comments, utility modules) scoped to the directory where they apply
Installs two Claude Code skills in .claude/skills/: - validate-links: scans all .ipynb/.md/.py files for broken local links, stale labels (link text names the wrong file), and broken first-party imports after renames or restructuring. Proposes fixes; never edits without user confirmation. - tutorial-notebook: 15-item checklist + template for polishing or creating tutorial notebooks. Covers structure, correctness bugs, imports, comments, diagrams, demo coverage, and next-steps wiring. References added: - tutorials/CLAUDE.md: when to invoke each skill - docs/GIT_WORKFLOW.md: run /validate-links before any PR that touches notebooks or docs .gitignore updated to track .claude/skills/ while keeping local Claude settings (settings.json, etc.) ignored.
- README.md: Python 3.9+ → 3.11–3.13, PyTorch 2.0+ → 2.10+; [dev] described as "Everything" → accurate description; [vllm20] removed wrong "CUDA 13+" claim - PREREQUISITES.md: Python 3.10+ → 3.11–3.13; RAG adapter count 5 → 6 (granitelib-rag-r1.0 ships 6 adapters) - build_your_own_adapter.md: custom adapters ARE supported via Mellea's Intrinsic API; updated Step 4 note to reflect this
Apply the best-practices doc's child-directory pattern: gotchas scoped to a
single backend get loaded on demand from src/granite_switch/{hf,vllm,composer}/CLAUDE.md
instead of paying token cost in every session.
Moved out of root:
- vLLM: Punica -1 index detail, TP row-parallel bias-doubling, deployment commands
- HF: eager-backend causal-masking quirk, fused-projections / bit-exact skip
- composer: e2e-tests-must-use-compose rule, compose CLI
Root keeps universal items (file org, test cadence, config params, control-token
generatability, ALORA/LORA placement, hidden-count offset) plus a pointer block
listing the child files. Drops root from 204 → 157 lines.
Llama is no longer supported. Drop the Granite-vs-Llama comparison gotcha and the parenthetical "main architectural difference with Llama" framing on logits_scaling in both CLAUDE.md and docs/ARCHITECTURE.md. Renumber the remaining root-level gotchas (1-4). Code references to Llama in src/granite_switch/vllm/core/decoder.py are kept: they document why the RMSNorm dispatch helper exists (different vLLM model classes use different calling conventions) and are not support claims.
TP is supported and tested (tests cover TP specifically). The single-GPU-only note was stale and excluded the 30B model.
Three small cuts that pass the "would removing this cause Claude to make mistakes?" test: 1. Test Files section — drop the per-subdirectory enumeration; the same list already appears in Project Structure. Keep the load-bearing rule (regression tests only, use scratch/ for throwaway). 2. Naming Conventions — drop test_*.py (pytest default) and snake_case.py (PEP 8). Keep only the non-default UPPER_CASE.md rule, renamed to "Documentation Naming". 3. Git Workflow — collapse the bullet list that restated GIT_WORKFLOW.md. Keep one-line pointer plus the "never sign as Claude" rule (the only item not covered by the linked doc). Drops root from 149 → 132 lines.
- Keep CUDA 12.x / CUDA 13+ distinction — useful context for users choosing a vLLM backend - Use '3.11+' instead of '3.11-3.13' — upper bound in pyproject.toml reflects untested versions, not incompatibility
scratch/ is gitignored, which makes it a per-developer convention rather than a project rule. Mandating it in the shared CLAUDE.md presumes every developer wants that workflow. The load-bearing rule for the project is "don't put throwaway scripts in tests/" (because pytest tests/ would pick them up); the scratch/ recommendation is just one possible workaround. Anyone who wants that convention can put it in their own CLAUDE.local.md. Removes two mentions: the Project Structure bullet and the parenthetical in the Test Files section.
…4.0-micro - Revert README Python/PyTorch versions and [dev] description (moving to separate PR) - Revert PREREQUISITES.md RAG adapter count (moving to separate PR) - Remove granite-4.0-micro from SUPPORTED_MODELS.md per review comment
5de1809 to
dc04aea
Compare
antonpibm
reviewed
May 28, 2026
| | `ibm-granite/granite-4.1-3b` | 3B | Dense, instruct | | ||
| | `ibm-granite/granite-4.1-8b` | 8B | Dense, instruct | | ||
| | `ibm-granite/granite-4.0-micro` | 3B | Dense, instruct | | ||
| | `ibm-granite/granite-4.1-30b` | 30B | Dense, instruct | |
Collaborator
There was a problem hiding this comment.
granite-4.0-micro is still supported
antonpibm
reviewed
May 28, 2026
| sequence) or right before the generation prompt | ||
| - **LORA adapters**: token placed at sequence beginning | ||
|
|
||
| ### 4. Optional Trainable Router (SingleSwitch) |
Collaborator
There was a problem hiding this comment.
Can we adapt this section to better reflect how the switch really works?
antonpibm
reviewed
May 28, 2026
|
|
||
| ## Two Backends | ||
|
|
||
| Both backends share the same checkpoint format (`save_pretrained` / `from_pretrained`). |
Collaborator
There was a problem hiding this comment.
from_pretrained is HF API. The checkpoint format is called Safetensors format
antonpibm
reviewed
May 28, 2026
|
|
||
| Full `transformers` integration (`PreTrainedModel`, `GenerationMixin`). Used for training and | ||
| debugging. Uses fused QKV and gate-up projections, which changes floating-point reduction order | ||
| relative to the upstream `GraniteMoeHybridForCausalLM` (see Common Gotchas #9 in `CLAUDE.md`). |
Collaborator
There was a problem hiding this comment.
Enumeration is no longer correct
antonpibm
reviewed
May 28, 2026
| | `num_adapters` | Number of embedded LoRA adapters | | ||
| | `adapter_token_ids` | Token IDs for each adapter's control token | | ||
| | `adapter_names` | Human-readable names for each adapter | | ||
| | `hiding_groups` | Named groups of adapters for KV hiding | |
Collaborator
There was a problem hiding this comment.
Not up-to date. Are you referring to the object args or the model config?
antonpibm
reviewed
May 28, 2026
| group-based control dimensions (`K=finfo.min`, `Q=per-adapter policy`). Control tokens are | ||
| KV-hidden to prevent cross-request interference. | ||
|
|
||
| ### 3. Chat Template Integration |
Collaborator
There was a problem hiding this comment.
Should we add the fact that those are activated using the adapter_name arg in apply_chat_template?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CLAUDE.md(~393 → ~200 lines): removed sections that duplicate what Claude can infer from code (import paths, full directory tree, weight compatibility, config JSON examples). Kept only non-obvious, actionable rules.CLAUDE.md: createddocs/ARCHITECTURE.mdfor theory (control tokens, backends, SingleSwitch). Fixed factually wrong SingleSwitch description (it's a single attention head with cumsum, not N transformer layers + projection head).CLAUDE.mdfiles: HF attention backend caveats →src/granite_switch/hf/CLAUDE.md, vLLM Punica/TP gotchas →src/granite_switch/vllm/CLAUDE.md, compose infra rule →src/granite_switch/composer/CLAUDE.md.tutorials/CLAUDE.md: notebook cell ordering, HF login cell convention, duration comment placement, utility module references.README.md: Python 3.11+, PyTorch 2.10+, CUDA version remarks preserved,[dev]description correctedPREREQUISITES.md: Python 3.11+, RAG adapter count 5 → 6SUPPORTED_MODELS.md: remove wrong single-GPU-only claim (TP is supported and tested), add 30B modelbuild_your_own_adapter.md: custom adapters are supported via Mellea'sIntrinsicAPI — updated Step 4 note