feat(detect): surface windowed match count/proportion in evidence#247
Open
Lioscro wants to merge 2 commits into
Open
feat(detect): surface windowed match count/proportion in evidence#247Lioscro wants to merge 2 commits into
Lioscro wants to merge 2 commits into
Conversation
Single-position `match_count`/`match_proportion` describe only the canonical position; they understate what `cyto map --remap-window N` would actually score when libraries have positional drift (V2 GEX `[:18]` spacer drift, V1 `[probe]` jitter). Add `windowed_match_count`/`windowed_match_proportion` that sum hits over `[pos ± remap_window]` on the same mate, so detect's stderr lets users see the effective match rate at the recommended window. In `validate_and_aggregate`, per-file `PositionAccumulator`s are now plumbed through and merged for an aggregated re-walk at `max_remap_window` -- naive sum-of-per-file under-counts when per-file `W` differs from the aggregated `W`. Test `test_validate_and_aggregate_windowed_cross_file` exercises this divergence (merged 12500 vs sum-of-per-file 12000). For short references like the 8bp `[probe]` multiplex barcode, a single read can contribute hits at multiple positions in the window, so `windowed_match_count` is an upper bound on what `cyto map` would score, not an exact count. The doc comment on `ComponentEvidence::windowed_match_count` calls this out explicitly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces windowed match counts and proportions to the geometry detection module to better estimate read counts within a remap window, addressing positional drift in short references like probes. It updates ComponentEvidence and PositionAccumulator to calculate and aggregate these metrics across files, and adds comprehensive unit and integration tests. The feedback suggests using saturating_add when calculating the upper bound of the window to defensively prevent potential overflow panics.
Mirror the saturating_sub already used for the lower bound. Overflow is unreachable in practice (both best_pos and window are bounded by read length) but the symmetry makes the intent obvious and matches the doc-comment range expression. Addresses Gemini PR #247 review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
windowed_match_count: usizeandwindowed_match_proportion: f64toComponentEvidence. They sum hits across[pos ± remap_window]on the same mate, surfacing whatcyto map --remap-window Nwould score for each component.PositionAccumulators intovalidate_and_aggregateso the aggregated path merges accumulators and re-walks atmax_remap_window(centered on file 0's best position).Why
Per-component evidence today reports counts at a single canonical position only. When a library has positional drift -- V2 GEX
[:18]spacer drift, V1[probe]jitter -- the headline proportion looks low even thoughcyto map --remap-window Nwould score most of those reads. The windowed values let users see the effective rate at the recommended window.What this looks like
Test plan
cargo test -p cyto-map-- 76 unit tests (69 + 7 new) + 4 integration tests, all pass.cargo test --workspacegreen.cargo clippy -p cyto-map --all-targets --no-deps -- -D warnings-- zero new errors beyond the 10 lib + 1 lib-test pre-existing on base.*p == best_pos) catches tests 2, 3, 5, 6, 7. Mutation B (formula short-circuit to0) catches tests 1, 2, 3 + integration assertion. Union covers every new test; production code restored after each.cyto detect gexandcyto detect crispr): windowed tokens emitted on every per-component line; probe showswindowed_count > match_counton V1 fixture; recommended-remap-window line still emitted; stdout contract unchanged.test_detect_gex_geometry_from_binseqasserts probe positional drift via strictwindowed_match_count > match_count.🤖 Generated with Claude Code