Skip to content

feat(detect): surface windowed match count/proportion in evidence#247

Open
Lioscro wants to merge 2 commits into
joseph.min/remap-window-probe-inclusionfrom
joseph.min/windowed-match-proportion
Open

feat(detect): surface windowed match count/proportion in evidence#247
Lioscro wants to merge 2 commits into
joseph.min/remap-window-probe-inclusionfrom
joseph.min/windowed-match-proportion

Conversation

@Lioscro

@Lioscro Lioscro commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds windowed_match_count: usize and windowed_match_proportion: f64 to ComponentEvidence. They sum hits across [pos ± remap_window] on the same mate, surfacing what cyto map --remap-window N would score for each component.
  • Plumbs per-file PositionAccumulators into validate_and_aggregate so the aggregated path merges accumulators and re-walks at max_remap_window (centered on file 0's best position).
  • Surfaces the new values inline in both stderr log functions. No CLI changes, no mapper changes, no stdout-contract change.

Why

Per-component evidence today reports counts at a single canonical position only. When a library has positional drift -- V2 GEX [:18] spacer drift, V1 [probe] jitter -- the headline proportion looks low even though cyto map --remap-window N would score most of those reads. The windowed values let users see the effective rate at the recommended window.

What this looks like

[2026-06-23T23:00:43.804Z INFO  cyto_map::detect] Detected geometry: `[barcode][umi:12][:13][probe] | [gex]`
[2026-06-23T23:00:43.804Z INFO  cyto_map::detect] Recommended --remap-window: 2
[2026-06-23T23:00:43.804Z INFO  cyto_map::detect] Detection sampled 400000 reads total (4 files)
[2026-06-23T23:00:43.804Z INFO  cyto_map::detect]   [barcode] R1 pos=0 count=371755 proportion=0.9294 windowed_count=372522 windowed_proportion=0.9313
[2026-06-23T23:00:43.804Z INFO  cyto_map::detect]     alt: R2 pos=17 count=19615
[2026-06-23T23:00:43.804Z INFO  cyto_map::detect]     alt: R2 pos=3 count=9423
[2026-06-23T23:00:43.804Z INFO  cyto_map::detect]     alt: R2 pos=31 count=6000
[2026-06-23T23:00:43.804Z INFO  cyto_map::detect]   [gex] R2 pos=0 count=371995 proportion=0.9300 windowed_count=371995 windowed_proportion=0.9300
[2026-06-23T23:00:43.804Z INFO  cyto_map::detect]   [probe] R1 pos=41 count=101140 proportion=0.2529 windowed_count=378998 windowed_proportion=0.9475
[2026-06-23T23:00:43.804Z INFO  cyto_map::detect]     alt: R1 pos=40 count=96612
[2026-06-23T23:00:43.804Z INFO  cyto_map::detect]     alt: R1 pos=42 count=90595
[2026-06-23T23:00:43.804Z INFO  cyto_map::detect]     alt: R1 pos=43 count=82843

Test plan

  • cargo test -p cyto-map -- 76 unit tests (69 + 7 new) + 4 integration tests, all pass.
  • cargo test --workspace green.
  • cargo clippy -p cyto-map --all-targets --no-deps -- -D warnings -- zero new errors beyond the 10 lib + 1 lib-test pre-existing on base.
  • Mutation experiment (two rounds): Mutation A (range-predicate collapse to *p == best_pos) catches tests 2, 3, 5, 6, 7. Mutation B (formula short-circuit to 0) catches tests 1, 2, 3 + integration assertion. Union covers every new test; production code restored after each.
  • Manual fixture smoke (cyto detect gex and cyto detect crispr): windowed tokens emitted on every per-component line; probe shows windowed_count > match_count on V1 fixture; recommended-remap-window line still emitted; stdout contract unchanged.
  • Integration test test_detect_gex_geometry_from_binseq asserts probe positional drift via strict windowed_match_count > match_count.

🤖 Generated with Claude Code

Single-position `match_count`/`match_proportion` describe only the canonical
position; they understate what `cyto map --remap-window N` would actually
score when libraries have positional drift (V2 GEX `[:18]` spacer drift,
V1 `[probe]` jitter). Add `windowed_match_count`/`windowed_match_proportion`
that sum hits over `[pos ± remap_window]` on the same mate, so detect's
stderr lets users see the effective match rate at the recommended window.

In `validate_and_aggregate`, per-file `PositionAccumulator`s are now plumbed
through and merged for an aggregated re-walk at `max_remap_window` -- naive
sum-of-per-file under-counts when per-file `W` differs from the aggregated
`W`. Test `test_validate_and_aggregate_windowed_cross_file` exercises this
divergence (merged 12500 vs sum-of-per-file 12000).

For short references like the 8bp `[probe]` multiplex barcode, a single
read can contribute hits at multiple positions in the window, so
`windowed_match_count` is an upper bound on what `cyto map` would score,
not an exact count. The doc comment on `ComponentEvidence::windowed_match_count`
calls this out explicitly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces windowed match counts and proportions to the geometry detection module to better estimate read counts within a remap window, addressing positional drift in short references like probes. It updates ComponentEvidence and PositionAccumulator to calculate and aggregate these metrics across files, and adds comprehensive unit and integration tests. The feedback suggests using saturating_add when calculating the upper bound of the window to defensively prevent potential overflow panics.

Comment thread crates/cyto-map/src/detect.rs Outdated
Mirror the saturating_sub already used for the lower bound. Overflow
is unreachable in practice (both best_pos and window are bounded by
read length) but the symmetry makes the intent obvious and matches
the doc-comment range expression.

Addresses Gemini PR #247 review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant