fix(fxconfig): make dropped notifications observable with logging and counter#218
Conversation
There was a problem hiding this comment.
Pull request overview
Improves observability of dropped transaction status notifications in fxconfig’s notification client, and also changes TLS configuration behavior to be secure-by-default across services.
Changes:
- Add drop observability to
NotificationClientvia WARN logging and an atomic dropped-notifications counter, and refactor dispatching for unit testability. - Flip TLS default to enabled-by-default, update tests/config samples accordingly, and warn in the CLI when TLS is explicitly disabled.
- Update docs to describe the new TLS default and the resulting validation behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/fxconfig/internal/client/notifications.go | Refactors notification dispatch and adds dropped-notification logging + counter. |
| tools/fxconfig/internal/client/notifications_test.go | Adds unit tests for the refactored collection/dispatch paths and drop counting. |
| tools/fxconfig/internal/config/config.go | Changes TLS default behavior to enabled-by-default via struct defaults/normalization. |
| tools/fxconfig/internal/config/config_test.go | Updates TLS normalization test expectations for the new default. |
| tools/fxconfig/internal/config/load_test.go | Expands TLS-related load tests, including a new “enabled by default” test. |
| tools/fxconfig/internal/cli/v1/root.go | Adds a CLI warning when any service has TLS disabled. |
| tools/fxconfig/integration/helpers_test.go | Updates integration test config generation to explicitly disable TLS. |
| tools/fxconfig/docs/README.md | Documents secure-by-default TLS behavior and migration guidance. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // goroutine, but every dropped event is logged and counted — the txID has | ||
| // already been removed from the subscribers map, so a drop is unrecoverable | ||
| // and must be observable. | ||
| func (n *NotificationClient) dispatchNotifications(notifications []notificationCall) { | ||
| for _, c := range notifications { | ||
| select { | ||
| case c.receiverQueue <- c.status: | ||
| default: | ||
| n.droppedNotifications.Add(1) | ||
| logger.Warnf( | ||
| "notification dropped (unrecoverable): txID=%s status=%d (subscriber buffer full or receiver gone)", | ||
| c.txID, c.status, | ||
| ) |
There was a problem hiding this comment.
dispatchNotifications logs a WARN for every dropped notification. Under sustained backpressure this can generate extremely high log volume and become a performance/operational issue. Consider rate-limiting/sampling this log (or logging periodic summaries keyed off droppedNotifications) while still incrementing the counter for full fidelity diagnostics.
| // goroutine, but every dropped event is logged and counted — the txID has | |
| // already been removed from the subscribers map, so a drop is unrecoverable | |
| // and must be observable. | |
| func (n *NotificationClient) dispatchNotifications(notifications []notificationCall) { | |
| for _, c := range notifications { | |
| select { | |
| case c.receiverQueue <- c.status: | |
| default: | |
| n.droppedNotifications.Add(1) | |
| logger.Warnf( | |
| "notification dropped (unrecoverable): txID=%s status=%d (subscriber buffer full or receiver gone)", | |
| c.txID, c.status, | |
| ) | |
| // goroutine. Dropped events are always counted, and warning logs are sampled | |
| // to avoid excessive log volume under sustained backpressure. The txID has | |
| // already been removed from the subscribers map, so a drop is unrecoverable | |
| // and must remain observable. | |
| func (n *NotificationClient) dispatchNotifications(notifications []notificationCall) { | |
| const droppedNotificationLogEvery = 100 | |
| for _, c := range notifications { | |
| select { | |
| case c.receiverQueue <- c.status: | |
| default: | |
| dropped := n.droppedNotifications.Add(1) | |
| if dropped == 1 || dropped%droppedNotificationLogEvery == 0 { | |
| logger.Warnf( | |
| "notification dropped (unrecoverable): droppedNotifications=%d latestTxID=%s status=%d (subscriber buffer full or receiver gone)", | |
| dropped, c.txID, c.status, | |
| ) | |
| } |
| Enabled *bool `mapstructure:"enabled" yaml:"enabled,omitempty" desc:"Enable/disable TLS" default:"true"` | ||
| ClientKeyPath string `mapstructure:"clientKey" yaml:"clientKey,omitempty" desc:"Path to TLS client private key"` |
There was a problem hiding this comment.
This PR’s title/description focuses on notification drop observability, but this change flips the default TLS behavior to enabled-by-default (a breaking behavioral change that will affect existing configs and validation). Please either update the PR description/title to explicitly call out the TLS default change and migration implications, or split the TLS work into a separate PR.
| // Normalize ensures the TLS config has an explicit enabled flag. | ||
| // Sets enabled to false if not specified. | ||
| // Sets enabled to true if not specified. | ||
| func (c *TLSConfig) Normalize() { | ||
| if c.Enabled == nil { | ||
| enabled := false | ||
| enabled := true | ||
| c.Enabled = &enabled | ||
| } | ||
| } |
There was a problem hiding this comment.
Normalize now defaults Enabled to true, but IsEnabled still returns false when Enabled is nil. That makes the effective default depend on whether callers remembered to run Normalize/ResolveTLS, which is easy to miss in programmatic uses. Consider aligning IsEnabled with the new secure-by-default behavior (or documenting clearly that callers must normalize before checking).
The notification dispatcher in NotificationClient.listen used a bare non-blocking send with a comment, silently losing status events whenever the subscriber buffer was full or the receiver had given up. Because the txID is removed from the subscribers map before delivery is attempted, a dropped event is unrecoverable — and indistinguishable from a genuine committer timeout from the caller's perspective. Signed-off-by: mayanksharmaCSE <mayanksharmacse1@gmail.com>
07b1b8d to
3c5c794
Compare
Type of change
NotificationClient#217Description
The notification dispatcher in
NotificationClient.listen(tools/fxconfig/internal/client/notifications.go:223–228) used a non-blocking send with a bare// message droppedcomment in thedefaultbranch. Any status event that could not be delivered to a subscriber's channel was silently discarded - no log line, no counter, no error returned to the caller.This is particularly harmful because the txID is removed from the subscribers map before the send is attempted, making every drop unrecoverable. From the caller's perspective the drop is indistinguishable from a genuine committer timeout, corrupting any retry or SLA decision built on top.
Changes:
droppedNotifications atomic.Uint64field toNotificationClientand an exportedDroppedNotifications() uint64accessor for diagnostics and tests.WARN-level log via the existingflogging"client"logger, including thetxIDandstatus.listeninto two focused methodscollectNotificationsanddispatchNotificationsand promotednotificationCallto package scope with atxIDfield, making the drop path directly unit-testable.fmt.Printflistener-error line withlogger.Errorf(the commented-outlogger.Errorfdirectly beneath it already showed the intent).