Skip to content

Harden pipeline interpreter state reload against transient MongoDB errors (7.1)#26506

Merged
patrickmann merged 2 commits into
7.1from
backport-7.1/fix/harden-pipeline-state-reload
Jun 30, 2026
Merged

Harden pipeline interpreter state reload against transient MongoDB errors (7.1)#26506
patrickmann merged 2 commits into
7.1from
backport-7.1/fix/harden-pipeline-state-reload

Conversation

@patrickmann

@patrickmann patrickmann commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Note: This is a backport of #25894 to 7.1.

Closes #25750

Description

PipelineInterpreterStateUpdater.reloadAndSave() could silently replace a valid pipeline state with an empty one when MongoDB hit a transient error during an event-triggered reload. Messages processed in that window bypassed all pipeline rules and landed in the default stream.

This PR implements three of the four fixes from #25750:

  1. Let MongoException propagate from MongoDbRuleService and MongoDbPipelineService loadAll() and friends, instead of swallowing it and returning an empty set. Transient MongoDB failures now fail loudly, so callers can react. Other callers (REST resources, content packs, migrations) get a 500 on transient failure, which is correct.

  2. Migrate state reload to a new PipelineInterpreterStateReloadJob (SystemJob) submitted via SystemJobManager. On failure the job retries with a 1 second delay. The constructor of PipelineInterpreterStateUpdater now performs the synchronous initial state load before registering on the event bus, closing the startup race window described in Pipeline rules not applied during multi-node restart due to async state reload race #25745. Pattern follows the existing PipelineMetadataUpdateJob.

  3. PipelineInterpreterStateUpdater.updateState() refuses to replace a non-empty state with an empty one and logs at WARN. Defense in depth.

  4. Null safety in PipelineInterpreter.process(). If getLatestState() returns null, messages pass through unchanged with a warning log instead of NPE. The companion change for IlluminateMessageProcessor.process() is in Graylog2/graylog-plugin-enterprise#14157.

  5. Synchronize metric updates on state reload to eliminate a race condition

Note on retry policy: SystemJobResult.withRetry requires maxRetries == Integer.MAX_VALUE until per-trigger retry tracking lands in the system scheduler.

How Tested

  • Manual:
    • start a single-node Graylog with one pipeline attached to a stream and verify message processing.
    • Edit the pipeline rule via the UI, confirm the new rule takes effect within a few seconds.
    • Stop MongoDB briefly while editing another rule, then restart MongoDB, and verify the system job retries (server log shows Failed to reload pipeline interpreter state, retrying) and pipeline state is eventually rebuilt with no empty-state interval observed in message processing. Verify that messages never lose their pipeline rule effects (fields still set, routing still works). The server log should show retry messages, not silent empty-state replacements.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactoring (non-breaking change)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have requested a documentation update.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.

/prd Graylog2/graylog-plugin-enterprise#14645

…rors (#25894)

* Harden pipeline interpreter state reload against transient MongoDB errors

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* sync metric reload

* revert to per-node scheduling

* coalescing and unit tests

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Ismail Belkacim <xd4rker@users.noreply.github.com>
(cherry picked from commit 20259ca)
@patrickmann patrickmann requested a review from xd4rker June 29, 2026 15:34
@patrickmann patrickmann marked this pull request as ready for review June 30, 2026 05:34

@xd4rker xd4rker left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@patrickmann patrickmann merged commit cb23b5c into 7.1 Jun 30, 2026
24 checks passed
@patrickmann patrickmann deleted the backport-7.1/fix/harden-pipeline-state-reload branch June 30, 2026 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants