fix: use API reader for namespace label checks in event filters by RobuRishabh · Pull Request #3000 · kubeflow/spark-operator

RobuRishabh · 2026-06-22T19:41:01Z

Purpose of this PR

Fixes a race condition in namespace selector filtering that causes flaky e2e failures (Namespace Filtering > With namespace selector > should process SparkApp in labeled namespace).

When jobNamespaceSelector is configured, event filters evaluate namespace labels via the controller's cached client. If a SparkApplication create event arrives before the namespace informer has indexed the new namespace, the lookup fails, the predicate returns false, and the event is silently dropped with no retry. The SparkApplication stays in New state until the test times out.

Observed CI failures

This race has been seen across multiple PRs and Kubernetes versions:

CI run	K8s version	Symptom
job/79602618303	v1.32.11	Namespace filtering flake — SparkApp stuck in `New`, 5min timeout
job/79746346579	v1.33.7	BeforeSuite timeout (likely separate Helm readiness issue)
job/79606205486	v1.34.3	BeforeSuite timeout (likely separate Helm readiness issue)

Also reproduced on PR #2990 and other unrelated PRs (e.g. REST submitter).

Proposed changes:

Use mgr.GetAPIReader() for live namespace label lookups in SparkApplication, ScheduledSparkApplication, and SparkConnect event filters/predicates
Update MatchesWithClient to accept client.Reader instead of client.Client
Keep the cached client.Client on SparkApplication event filters only where writes are needed (e.g. Status().Update() in the Update predicate)

Change Category

Bugfix (non-breaking change which fixes an issue)
Feature (non-breaking change which adds functionality)
Breaking change (fix or feature that could affect existing functionality)
Documentation update

Rationale

The operator maintains two independent informer caches (namespaces and SparkApplications). A test (or user) can create a labeled namespace and immediately submit a SparkApplication — the test client sees the namespace, but the controller cache may not yet.

Using the uncached API reader for label checks eliminates this lag while preserving correct fail-closed behavior: unlabeled namespaces still return false, and non-existent namespaces still return a real NotFound.

This is preferred over returning true on lookup error, which would fix the flake but incorrectly allow events through for unlabeled namespaces during the same race window.

Checklist

I have conducted a self-review of my own code.
I have updated documentation accordingly.
I have added tests that prove my changes are effective or that my feature works.
Existing unit tests pass locally with my changes.

Signed-off-by: roburishabh <roburishabh@outlook.com>

google-oss-prow · 2026-06-22T19:41:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chenyi015 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR fixes a race condition in namespace-selector event filtering by switching namespace label lookups in predicates from the controller’s cached client to the manager’s uncached API reader, preventing events from being dropped when the namespace informer cache is temporarily stale.

Changes:

Updated NamespaceMatcher.MatchesWithClient (and related predicate wiring) to use a client.Reader for namespace lookups.
Switched SparkApplication (pods + apps), ScheduledSparkApplication, and SparkConnect controllers to pass mgr.GetAPIReader() into namespace-based predicates/filters.
Adjusted SparkApplication event-filter unit tests to match the updated constructor signature.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
pkg/util/predicates.go	`NewNamespacePredicate` now uses a `client.Reader` for namespace matching in predicate callbacks.
pkg/util/namespace.go	`MatchesWithClient` now uses a `client.Reader` for namespace retrieval to avoid cache lag.
internal/controller/sparkconnect/reconciler.go	SparkConnect controller now builds namespace predicate with `mgr.GetAPIReader()`.
internal/controller/sparkapplication/event_filter.go	SparkApplication filters now use an API reader for namespace label checks while retaining cached client for writes.
internal/controller/sparkapplication/event_filter_test.go	Updated tests to provide the new `namespaceReader` parameter.
internal/controller/sparkapplication/controller.go	Wiring updated to pass `mgr.GetAPIReader()` into pod/app event filters.
internal/controller/scheduledsparkapplication/event_filter.go	ScheduledSparkApplication filter now uses an API reader for namespace label checks.
internal/controller/scheduledsparkapplication/controller.go	Wiring updated to pass `mgr.GetAPIReader()` into the scheduled app event filter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+	if err := r.Get(ctx, client.ObjectKey{Name: namespaceName}, ns); err != nil {
 		return false, fmt.Errorf("failed to get namespace %s: %v", namespaceName, err)
 	}


nabuskey

whenever GetAPIReader is used, I think of API costs.

In this case with namespace selector used, no caching is used? So this mean we will hit the API for pod updates and app updates?

also combining ctx TODO with the changes are problematic. It can block indefinitely.

RobuRishabh · 2026-06-26T01:13:31Z

whenever GetAPIReader is used, I think of API costs.

In this case with namespace selector used, no caching is used? So this mean we will hit the API for pod updates and app updates?

@nabuskey I traced the code paths to clarify scope.

Caching is not removed. This PR only changes namespace label lookups in predicates from cached client → GetAPIReader(). Pod/SparkApp reconciles, event handlers, and informer caches are unchanged (newCacheOptions untouched).

live API is wired (4 sites only):

sparkapplication/controller.go — pod + app event filters
scheduledsparkapplication/controller.go — event filter
sparkconnect/reconciler.go — namespace predicate

the namespace GET actually happens: only pkg/util/namespace.go → MatchesWithClient(), and only when jobNamespaceSelector is set and the namespace is not in the explicit --namespaces list. Three fast-paths return without any GET.

Not every pod/app event hits the API:

Reconcile: r.client.Get() (cache) — see getSparkApplication()
Pod updates: filtered out if phase unchanged or pod is not Spark-operator-labeled before namespace check runs

Before this PR we already called Get(namespace) on these paths, the difference is cache vs live API. The live lookup fixes the informer cache race that silently dropped create events.

nabuskey · 2026-06-26T18:15:49Z

When namespaceSelector is used, caching is effectively removed, correct? It doesn't use cache for anything and it fires every single event. Namespaces do not change often in most clusters. Far fewer than pods. I don't think we should introduce this just to fix a test.
The use of todo context with direct GET is also a big concern. We now hit the api directly without bounded by the cache. Previously, the use of todo context was not an issue because it hits the cache, but now it hits the API directly with no constraints as to when it exists. This can be quite bad when API is under high load which is often the case for production spark clusters due to the nature of how Spark works in k8s.

RobuRishabh · 2026-06-26T18:36:43Z

When namespaceSelector is used, caching is effectively removed, correct? It doesn't use cache for anything and it fires every single event. Namespaces do not change often in most clusters. Far fewer than pods. I don't think we should introduce this just to fix a test. The use of todo context with direct GET is also a big concern. We now hit the api directly without bounded by the cache. Previously, the use of todo context was not an issue because it hits the cache, but now it hits the API directly with no constraints as to when it exists. This can be quite bad when API is under high load which is often the case for production spark clusters due to the nature of how Spark works in k8s.

@nabuskey Your both concerns make sense to me.

On API cost: You're right that with the current PR, namespace label lookups in predicates always use the API reader when jobNamespaceSelector is set. I should have been clearer: workload caching (pods, SparkApps, reconciler reads) is unchanged, but namespace label checks in predicates no longer use cache, so yes, those paths can hit the apiserver on every filtered event. That's heavier than I'd like for production Spark clusters.

On context.TODO(): Agree this is problematic once we call the live API. Predicate callbacks don't receive a context today, but we can use a bounded timeout (e.g. context.WithTimeout) for the API fallback path rather than unbounded context.TODO().

On "just to fix a test": The e2e test exposes a real production race, create labeled namespace + SparkApp before the namespace informer indexes it → predicate returns false → create event is silently dropped (no retry) → app stuck in New. Same pattern can happen with CI applying both resources in one pass. So I'd like to fix the bug, but I agree pure GetAPIReader() everywhere isn't the right tradeoff.

What I can do, I can

Try cached client first for namespace label lookup
On NotFound only, fall back to API reader with a short timeout
Use %w for error wrapping

if you prefer a different approach (e.g. namespace watch + requeue). or just close this PR. I am open to anything.

fix: use API reader for namespace label checks in event filters

c57b36b

Signed-off-by: roburishabh <roburishabh@outlook.com>

Copilot AI review requested due to automatic review settings June 22, 2026 19:41

google-oss-prow Bot requested review from ImpSy, andreyvelich and mwielgus June 22, 2026 19:41

google-oss-prow Bot added the size/M label Jun 22, 2026

Copilot started reviewing on behalf of RobuRishabh June 22, 2026 19:41 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

Comment thread pkg/util/namespace.go

Comment on lines +119 to 121

if err := r.Get(ctx, client.ObjectKey{Name: namespaceName}, ns); err != nil {

return false, fmt.Errorf("failed to get namespace %s: %v", namespaceName, err)

}

RobuRishabh closed this Jun 23, 2026

RobuRishabh deleted the fix/namespace-predicate-api-reader branch June 23, 2026 16:19

RobuRishabh restored the fix/namespace-predicate-api-reader branch June 25, 2026 14:55

RobuRishabh reopened this Jun 25, 2026

nabuskey reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: use API reader for namespace label checks in event filters#3000

fix: use API reader for namespace label checks in event filters#3000
RobuRishabh wants to merge 1 commit into
kubeflow:masterfrom
RobuRishabh:fix/namespace-predicate-api-reader

RobuRishabh commented Jun 22, 2026

Uh oh!

google-oss-prow Bot commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

nabuskey left a comment •

edited

Loading

Uh oh!

RobuRishabh commented Jun 26, 2026

Uh oh!

nabuskey commented Jun 26, 2026

Uh oh!

RobuRishabh commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

RobuRishabh commented Jun 22, 2026

Purpose of this PR

Observed CI failures

Change Category

Rationale

Checklist

Uh oh!

google-oss-prow Bot commented Jun 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

nabuskey left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobuRishabh commented Jun 26, 2026

Uh oh!

nabuskey commented Jun 26, 2026

Uh oh!

RobuRishabh commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nabuskey left a comment •

edited

Loading

RobuRishabh commented Jun 26, 2026 •

edited

Loading