Phase 3: write-path correctness, CRAM-writer hardening, contract docs#22
Merged
Conversation
Audits valid-input correctness and behavioral consistency that Phases 1–2 (cosmetic cleanups, malformed-input hardening) did not cover. Write-path error handling: - BAM encodeRecord now checks the return of every variable-length write (name/cigar/seq/qual/aux), matching the already-checked fixed fields, so a mid-stream I/O error propagates instead of silently truncating output. - Propagate the final gzip flush error in the CRAM encodeBlock path, and the gzip Close error on the SAM text reader's Close. CRAM writer correctness & contract: - Close() is now idempotent (closed flag), so a double-close no longer writes a second EOF container or double-closes the file; Write after Close errors. - New htsio.CigarQueryLen / htsio.ValidateCigarSeq; the CRAM writer rejects a record whose CIGAR query length disagrees with its SEQ length rather than silently dropping inserted/soft-clipped bases (it reconstructs SEQ from the CIGAR). The BAM writer stores SEQ verbatim, so it stays lenient by design. - Copy WriterOpts at construction so post-construction caller mutation cannot change writer behavior. - Clarify the RR (reference-required) flag (confirmed correct, not a bug). Concurrency audit (verify-then-fix): the sorted BAM writer and the parallel BGZF writer were reviewed and confirmed race-free (go test -race clean) and leak-free — no code changes needed. Documented the Semaphore.Close() contract. API contracts: aligned and documented Close() semantics (idempotent, flushes, writes trailer) across the SAM/BAM/CRAM writers — the SAM writer is now idempotent too — and documented SamReader's single-pass / shared-header / freshly-allocated-record semantics. Tests: cigar query-length + validation; CRAM double-Close, write-after-close, and CIGAR/SEQ-mismatch rejection; BAM mismatch leniency + failing-sink error propagation; SAM double-Close idempotency. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 3 added htsio.ValidateCigarSeq to the CRAM writer but left the BAM writer lenient. A BAM record whose CIGAR query length disagrees with len(SEQ) is malformed per the SAM spec, so the BAM writer now rejects it at Write time too, consistent with CRAM. Reworked the UMI-dedup tests to build well-formed records: the shared rec() helper now sizes SEQ to the CIGAR query length (via htsio.CigarQueryLen), using the caller's seq as a base pattern repeated/truncated to fit — a single helper change, no call-site edits. Flipped the BAM writer's mismatch test from "allows" to "rejects". No production path synthesizes mismatched records (UMI-dedup only rewrites the Flag on records read from a valid BAM), so this cannot reject legitimate output. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a static-analysis gate (Printf format checks, mutex-by-value copies, struct tags, lost results, etc.) ahead of `go test`. Vet compiles the packages but needs no samtools; it runs after the samtools install for ordering simplicity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Third QC pass, covering valid-input correctness and cross-implementation consistency that Phases 1–2 (cosmetic cleanups; malformed-input hardening) didn't address. Theme continues the prior phases: turn silent corruption/loss into explicit errors, and make SAM/BAM/CRAM behave consistently.
Write-path error handling
encodeRecordnow checks the return of every variable-length write (name/cigar/seq/qual/aux) — previously only the fixed fields were checked, so a mid-stream I/O error was silently swallowed and the writer kept emitting onto a failed sink.encodeBlock, and the gzipCloseerror in the SAM text reader'sClose.CRAM writer correctness & contract
Close()is now idempotent (matches BAM): a double-close no longer writes a second EOF container or double-closes the file;WriteafterCloseerrors.htsio.CigarQueryLen/htsio.ValidateCigarSeq; the CRAM writer rejects a record whose CIGAR query length ≠ SEQ length instead of silently dropping inserted/soft-clipped bases (it reconstructs SEQ from the CIGAR). The BAM writer stores SEQ verbatim, so it intentionally stays lenient.WriterOptsat construction so caller mutation can't change writer behavior mid-stream.RRflag (investigated — confirmed correct).Concurrency audit (verify-then-fix)
The sorted BAM writer and parallel BGZF writer were reviewed and confirmed race-free (
go test -raceclean) and leak-free — no code changes needed. Documented theSemaphore.Close()contract. (Reporting this explicitly so the "no change" is a deliberate, verified outcome, not an omission.)API contracts
Aligned and documented
Close()semantics (idempotent, flushes, writes trailer) across all three writers — the SAM writer is now idempotent too — and documentedSamReader's single-pass / shared-header / freshly-allocated-record semantics on the interface.Tests
CIGAR query-length + validation; CRAM double-Close, write-after-close, CIGAR/SEQ-mismatch rejection; BAM mismatch leniency + failing-sink error propagation; SAM double-Close idempotency.
Verification
go build ./...,go vet ./...,go test ./...all green (withsamtools 1.21), plusgo test -raceclean on the bgzf/bam/ontcmd/align packages.🤖 Generated with Claude Code