Splice fixes for chanmon_consistency fuzz target#4655
Conversation
|
👋 Thanks for assigning @TheBlueMatt as a reviewer! |
| for (idx, node) in nodes.iter().enumerate() { | ||
| if node.broadcaster.txn_broadcasted.borrow().is_empty() { | ||
| continue; | ||
| } | ||
|
|
||
| let pending_events = node.get_and_clear_pending_events(); |
There was a problem hiding this comment.
The verification is one-directional: for each broadcast tx, you check a matching SpliceNegotiated event exists, but you don't verify the reverse — that every pending SpliceNegotiated event has a corresponding broadcast tx. A node with no broadcasts (skipped at line 2017) but with dangling SpliceNegotiated events passes silently.
Consider also checking nodes that have no broadcasts but do have pending SpliceNegotiated events, as that would indicate the splice tx was not broadcast despite an event being generated.
| for (idx, node) in nodes.iter().enumerate() { | |
| if node.broadcaster.txn_broadcasted.borrow().is_empty() { | |
| continue; | |
| } | |
| let pending_events = node.get_and_clear_pending_events(); | |
| for (idx, node) in nodes.iter().enumerate() { | |
| let pending_events = node.get_and_clear_pending_events(); | |
| if node.broadcaster.txn_broadcasted.borrow().is_empty() { | |
| assert!( | |
| !pending_events.iter().any(|e| matches!(e, events::Event::SpliceNegotiated { .. })), | |
| "node {} has pending SpliceNegotiated event(s) but no broadcast tx", | |
| idx, | |
| ); | |
| continue; | |
| } |
|
The current diff contains only two hunks, both of which I examined. The error string No new issues found in this diff beyond my prior review comment. The two changes are correct:
My previously posted comment on the one-directional broadcast/event verification in |
|
🔔 1st Reminder Hey @TheBlueMatt! This PR has been waiting for your review. |
|
🔔 2nd Reminder Hey @TheBlueMatt! This PR has been waiting for your review. |
|
🔔 3rd Reminder Hey @TheBlueMatt! This PR has been waiting for your review. |
| use std::sync::{Arc, Mutex}; | ||
|
|
||
| const MAX_FEE: u32 = 10_000; | ||
| const MAX_SETTLE_ITERATIONS: usize = 256; |
There was a problem hiding this comment.
Codex's analysis on why it's needed:
• They are draining real protocol work in a very serialized harness loop.
process_all_events() checks work in this order and restarts the loop after the
first thing that makes progress:
1. manager persistence / monitor completions
2. node 0 messages
3. node 1 messages
4. node 2 messages
5. node 0 events
6. node 1 events
7. node 2 events
8. quiet pass
So one commitment round can take several loop iterations: deliver
commitment_signed, complete monitor update, deliver revoke_and_ack, complete
another monitor update, process PaymentClaimable, send update_fulfill_htlc, then
another commitment_signed/revoke_and_ack round.
For payload 1, the long 4:102 loop is the final ff. It is settling:
- probe HTLCs left by the previous settle_all() channel-liveness check
- a0: A starts an A-B splice
- 41: B pays A
- 4d: C pays A via B
- a3: C starts a B-C splice
- 73: A pays C via B with MPP split over the B-C channels
In that one settle region, the normal log shows this wire work:
52 commitment_signed
48 revoke_and_ack
15 update_add_htlc
15 update_fulfill_htlc
8 tx_complete
5 tx_add_input
5 tx_add_output
4 tx_signatures
4 stfu
1 tx_init_rbf
1 tx_ack_rbf
1 splice_init
1 splice_ack
So it is mostly HTLC commitment handshakes. The splice/RBF messages are there
too, but the big multiplier is: multiple channels, MPP parts, previous settle
probe payments, and each add/fulfill needing the commitment dance.
For payload 2, the long 0:103 loop is the first explicit ff. Before it, the
payload has already staged both A-B and B-C splices, partially delivered some
splice messages manually, sent a tiny A->B payment, sent two MPP A->B->C
payments, two B->C direct payments, and started another B-C splice. Its config
byte also makes node B’s monitors start as InProgress and makes A/C use deferred
monitor writes.
That settle region delivers:
26 commitment_signed
18 revoke_and_ack
15 update_add_htlc
15 update_fulfill_htlc
16 tx_complete
10 tx_add_input
10 tx_add_output
8 tx_signatures
4 stfu
2 tx_init_rbf
2 tx_ack_rbf
2 splice_ack
The first visible HTLC burst is especially dense: A sends 7 update_add_htlcs to B
in one commitment update, from the tiny direct A->B payment plus the two 3-part
MPP first hops. B later sends 8 update_add_htlcs to C from the two direct B->C
payments plus the two 3-part MPP second hops. Then C claims, fulfills go back
C->B, then B fulfills back B->A, each with more commitment handshakes and monitor
completions.
The last iterations are not extra messages. They are cleanup:
- payload 1: iteration 99 is still monitor/persistence progress, then 100 and 101
are the two quiet passes.
- payload 2: iteration 100 is still monitor/persistence progress, then 101 and
102 are the two quiet passes.
That is why the old 100 cap was wrong: it fired while the harness was still
making legitimate forward progress, right before quiescence.
Now that the fuzz target supports canceling splice funding attempts, we may see failed signing attempts due to the cancellation.
LDK and the chanmon_consistency fuzz target have grown in complexity recently and thus require more iterations than previously assumed to fully settle the state of all active channels.
baa5d0c to
099bb09
Compare
| Err(APIError::APIMisuseError { ref err }) | ||
| if err.contains("not expecting funding signatures") => | ||
| { | ||
| // A queued signing event can be invalidated by a later `tx_abort` |
There was a problem hiding this comment.
Stricter invariant might check if the tx_abort has indeed happened.
There was a problem hiding this comment.
tx_abort can happen for other reasons as well. cancel_funding_contributed would be a better signal, but that's still not enough to identify this case, e.g., both nodes of the splice have to sign but one canceled, so the other node attempting to sign will fail even though they didn't cancel.
While this target found several issues in our production code, there were also issues with the fuzz target itself, which this PR addresses. It fixes the following payloads from #4363: