fix: use API reader for namespace label checks in event filters#3000
fix: use API reader for namespace label checks in event filters#3000RobuRishabh wants to merge 1 commit into
Conversation
Signed-off-by: roburishabh <roburishabh@outlook.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR fixes a race condition in namespace-selector event filtering by switching namespace label lookups in predicates from the controller’s cached client to the manager’s uncached API reader, preventing events from being dropped when the namespace informer cache is temporarily stale.
Changes:
- Updated
NamespaceMatcher.MatchesWithClient(and related predicate wiring) to use aclient.Readerfor namespace lookups. - Switched SparkApplication (pods + apps), ScheduledSparkApplication, and SparkConnect controllers to pass
mgr.GetAPIReader()into namespace-based predicates/filters. - Adjusted SparkApplication event-filter unit tests to match the updated constructor signature.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/util/predicates.go | NewNamespacePredicate now uses a client.Reader for namespace matching in predicate callbacks. |
| pkg/util/namespace.go | MatchesWithClient now uses a client.Reader for namespace retrieval to avoid cache lag. |
| internal/controller/sparkconnect/reconciler.go | SparkConnect controller now builds namespace predicate with mgr.GetAPIReader(). |
| internal/controller/sparkapplication/event_filter.go | SparkApplication filters now use an API reader for namespace label checks while retaining cached client for writes. |
| internal/controller/sparkapplication/event_filter_test.go | Updated tests to provide the new namespaceReader parameter. |
| internal/controller/sparkapplication/controller.go | Wiring updated to pass mgr.GetAPIReader() into pod/app event filters. |
| internal/controller/scheduledsparkapplication/event_filter.go | ScheduledSparkApplication filter now uses an API reader for namespace label checks. |
| internal/controller/scheduledsparkapplication/controller.go | Wiring updated to pass mgr.GetAPIReader() into the scheduled app event filter. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if err := r.Get(ctx, client.ObjectKey{Name: namespaceName}, ns); err != nil { | ||
| return false, fmt.Errorf("failed to get namespace %s: %v", namespaceName, err) | ||
| } |
There was a problem hiding this comment.
whenever GetAPIReader is used, I think of API costs.
In this case with namespace selector used, no caching is used? So this mean we will hit the API for pod updates and app updates?
also combining ctx TODO with the changes are problematic. It can block indefinitely.
@nabuskey I traced the code paths to clarify scope. Caching is not removed. This PR only changes namespace label lookups in predicates from cached client → live API is wired (4 sites only):
the namespace GET actually happens: only Not every pod/app event hits the API:
Before this PR we already called |
|
When |
@nabuskey Your both concerns make sense to me. On API cost: You're right that with the current PR, namespace label lookups in predicates always use the API reader when On On "just to fix a test": The e2e test exposes a real production race, create labeled namespace + SparkApp before the namespace informer indexes it → predicate returns false → create event is silently dropped (no retry) → app stuck in What I can do, I can
if you prefer a different approach (e.g. namespace watch + requeue). or just close this PR. I am open to anything. |
Purpose of this PR
Fixes a race condition in namespace selector filtering that causes flaky e2e failures (
Namespace Filtering > With namespace selector > should process SparkApp in labeled namespace).When
jobNamespaceSelectoris configured, event filters evaluate namespace labels via the controller's cached client. If a SparkApplication create event arrives before the namespace informer has indexed the new namespace, the lookup fails, the predicate returnsfalse, and the event is silently dropped with no retry. The SparkApplication stays inNewstate until the test times out.Observed CI failures
This race has been seen across multiple PRs and Kubernetes versions:
New, 5min timeoutAlso reproduced on PR #2990 and other unrelated PRs (e.g. REST submitter).
Proposed changes:
mgr.GetAPIReader()for live namespace label lookups in SparkApplication, ScheduledSparkApplication, and SparkConnect event filters/predicatesMatchesWithClientto acceptclient.Readerinstead ofclient.Clientclient.Clienton SparkApplication event filters only where writes are needed (e.g.Status().Update()in the Update predicate)Change Category
Rationale
The operator maintains two independent informer caches (namespaces and SparkApplications). A test (or user) can create a labeled namespace and immediately submit a SparkApplication — the test client sees the namespace, but the controller cache may not yet.
Using the uncached API reader for label checks eliminates this lag while preserving correct fail-closed behavior: unlabeled namespaces still return
false, and non-existent namespaces still return a realNotFound.This is preferred over returning
trueon lookup error, which would fix the flake but incorrectly allow events through for unlabeled namespaces during the same race window.Checklist