Harden pipeline interpreter state reload against transient MongoDB errors (7.1)#26506
Merged
Merged
Conversation
…rors (#25894) * Harden pipeline interpreter state reload against transient MongoDB errors Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * sync metric reload * revert to per-node scheduling * coalescing and unit tests --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Ismail Belkacim <xd4rker@users.noreply.github.com> (cherry picked from commit 20259ca)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: This is a backport of #25894 to
7.1.Closes #25750
Description
PipelineInterpreterStateUpdater.reloadAndSave()could silently replace a valid pipeline state with an empty one when MongoDB hit a transient error during an event-triggered reload. Messages processed in that window bypassed all pipeline rules and landed in the default stream.This PR implements three of the four fixes from #25750:
Let
MongoExceptionpropagate fromMongoDbRuleServiceandMongoDbPipelineServiceloadAll()and friends, instead of swallowing it and returning an empty set. Transient MongoDB failures now fail loudly, so callers can react. Other callers (REST resources, content packs, migrations) get a 500 on transient failure, which is correct.Migrate state reload to a new
PipelineInterpreterStateReloadJob(SystemJob) submitted viaSystemJobManager. On failure the job retries with a 1 second delay. The constructor ofPipelineInterpreterStateUpdaternow performs the synchronous initial state load before registering on the event bus, closing the startup race window described in Pipeline rules not applied during multi-node restart due to async state reload race #25745. Pattern follows the existingPipelineMetadataUpdateJob.PipelineInterpreterStateUpdater.updateState()refuses to replace a non-empty state with an empty one and logs at WARN. Defense in depth.Null safety in
PipelineInterpreter.process(). IfgetLatestState()returns null, messages pass through unchanged with a warning log instead of NPE. The companion change forIlluminateMessageProcessor.process()is in Graylog2/graylog-plugin-enterprise#14157.Synchronize metric updates on state reload to eliminate a race condition
Note on retry policy:
SystemJobResult.withRetryrequiresmaxRetries == Integer.MAX_VALUEuntil per-trigger retry tracking lands in the system scheduler.How Tested
Failed to reload pipeline interpreter state, retrying) and pipeline state is eventually rebuilt with no empty-state interval observed in message processing. Verify that messages never lose their pipeline rule effects (fields still set, routing still works). The server log should show retry messages, not silent empty-state replacements.Types of changes
Checklist:
/prd Graylog2/graylog-plugin-enterprise#14645