Skip to content

fix: use API reader for namespace label checks in event filters#3000

Open
RobuRishabh wants to merge 1 commit into
kubeflow:masterfrom
RobuRishabh:fix/namespace-predicate-api-reader
Open

fix: use API reader for namespace label checks in event filters#3000
RobuRishabh wants to merge 1 commit into
kubeflow:masterfrom
RobuRishabh:fix/namespace-predicate-api-reader

Conversation

@RobuRishabh

Copy link
Copy Markdown
Contributor

Purpose of this PR

Fixes a race condition in namespace selector filtering that causes flaky e2e failures (Namespace Filtering > With namespace selector > should process SparkApp in labeled namespace).

When jobNamespaceSelector is configured, event filters evaluate namespace labels via the controller's cached client. If a SparkApplication create event arrives before the namespace informer has indexed the new namespace, the lookup fails, the predicate returns false, and the event is silently dropped with no retry. The SparkApplication stays in New state until the test times out.

Observed CI failures

This race has been seen across multiple PRs and Kubernetes versions:

CI run K8s version Symptom
job/79602618303 v1.32.11 Namespace filtering flake — SparkApp stuck in New, 5min timeout
job/79746346579 v1.33.7 BeforeSuite timeout (likely separate Helm readiness issue)
job/79606205486 v1.34.3 BeforeSuite timeout (likely separate Helm readiness issue)

Also reproduced on PR #2990 and other unrelated PRs (e.g. REST submitter).

Proposed changes:

  • Use mgr.GetAPIReader() for live namespace label lookups in SparkApplication, ScheduledSparkApplication, and SparkConnect event filters/predicates
  • Update MatchesWithClient to accept client.Reader instead of client.Client
  • Keep the cached client.Client on SparkApplication event filters only where writes are needed (e.g. Status().Update() in the Update predicate)

Change Category

  • Bugfix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that could affect existing functionality)
  • Documentation update

Rationale

The operator maintains two independent informer caches (namespaces and SparkApplications). A test (or user) can create a labeled namespace and immediately submit a SparkApplication — the test client sees the namespace, but the controller cache may not yet.

Using the uncached API reader for label checks eliminates this lag while preserving correct fail-closed behavior: unlabeled namespaces still return false, and non-existent namespaces still return a real NotFound.

This is preferred over returning true on lookup error, which would fix the flake but incorrectly allow events through for unlabeled namespaces during the same race window.

Checklist

  • I have conducted a self-review of my own code.
  • I have updated documentation accordingly.
  • I have added tests that prove my changes are effective or that my feature works.
  • Existing unit tests pass locally with my changes.

Signed-off-by: roburishabh <roburishabh@outlook.com>
Copilot AI review requested due to automatic review settings June 22, 2026 19:41
@google-oss-prow

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chenyi015 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition in namespace-selector event filtering by switching namespace label lookups in predicates from the controller’s cached client to the manager’s uncached API reader, preventing events from being dropped when the namespace informer cache is temporarily stale.

Changes:

  • Updated NamespaceMatcher.MatchesWithClient (and related predicate wiring) to use a client.Reader for namespace lookups.
  • Switched SparkApplication (pods + apps), ScheduledSparkApplication, and SparkConnect controllers to pass mgr.GetAPIReader() into namespace-based predicates/filters.
  • Adjusted SparkApplication event-filter unit tests to match the updated constructor signature.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/util/predicates.go NewNamespacePredicate now uses a client.Reader for namespace matching in predicate callbacks.
pkg/util/namespace.go MatchesWithClient now uses a client.Reader for namespace retrieval to avoid cache lag.
internal/controller/sparkconnect/reconciler.go SparkConnect controller now builds namespace predicate with mgr.GetAPIReader().
internal/controller/sparkapplication/event_filter.go SparkApplication filters now use an API reader for namespace label checks while retaining cached client for writes.
internal/controller/sparkapplication/event_filter_test.go Updated tests to provide the new namespaceReader parameter.
internal/controller/sparkapplication/controller.go Wiring updated to pass mgr.GetAPIReader() into pod/app event filters.
internal/controller/scheduledsparkapplication/event_filter.go ScheduledSparkApplication filter now uses an API reader for namespace label checks.
internal/controller/scheduledsparkapplication/controller.go Wiring updated to pass mgr.GetAPIReader() into the scheduled app event filter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/util/namespace.go
Comment on lines +119 to 121
if err := r.Get(ctx, client.ObjectKey{Name: namespaceName}, ns); err != nil {
return false, fmt.Errorf("failed to get namespace %s: %v", namespaceName, err)
}
@RobuRishabh RobuRishabh deleted the fix/namespace-predicate-api-reader branch June 23, 2026 16:19
@RobuRishabh RobuRishabh restored the fix/namespace-predicate-api-reader branch June 25, 2026 14:55
@RobuRishabh RobuRishabh reopened this Jun 25, 2026

@nabuskey nabuskey left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whenever GetAPIReader is used, I think of API costs.

In this case with namespace selector used, no caching is used? So this mean we will hit the API for pod updates and app updates?

also combining ctx TODO with the changes are problematic. It can block indefinitely.

@RobuRishabh

Copy link
Copy Markdown
Contributor Author

whenever GetAPIReader is used, I think of API costs.

In this case with namespace selector used, no caching is used? So this mean we will hit the API for pod updates and app updates?

@nabuskey I traced the code paths to clarify scope.

Caching is not removed. This PR only changes namespace label lookups in predicates from cached client → GetAPIReader(). Pod/SparkApp reconciles, event handlers, and informer caches are unchanged (newCacheOptions untouched).

live API is wired (4 sites only):

  • sparkapplication/controller.go — pod + app event filters
  • scheduledsparkapplication/controller.go — event filter
  • sparkconnect/reconciler.go — namespace predicate

the namespace GET actually happens: only pkg/util/namespace.goMatchesWithClient(), and only when jobNamespaceSelector is set and the namespace is not in the explicit --namespaces list. Three fast-paths return without any GET.

Not every pod/app event hits the API:

  • Reconcile: r.client.Get() (cache) — see getSparkApplication()
  • Pod updates: filtered out if phase unchanged or pod is not Spark-operator-labeled before namespace check runs

Before this PR we already called Get(namespace) on these paths, the difference is cache vs live API. The live lookup fixes the informer cache race that silently dropped create events.

@nabuskey

Copy link
Copy Markdown
Contributor

When namespaceSelector is used, caching is effectively removed, correct? It doesn't use cache for anything and it fires every single event. Namespaces do not change often in most clusters. Far fewer than pods. I don't think we should introduce this just to fix a test.
The use of todo context with direct GET is also a big concern. We now hit the api directly without bounded by the cache. Previously, the use of todo context was not an issue because it hits the cache, but now it hits the API directly with no constraints as to when it exists. This can be quite bad when API is under high load which is often the case for production spark clusters due to the nature of how Spark works in k8s.

@RobuRishabh

RobuRishabh commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

When namespaceSelector is used, caching is effectively removed, correct? It doesn't use cache for anything and it fires every single event. Namespaces do not change often in most clusters. Far fewer than pods. I don't think we should introduce this just to fix a test. The use of todo context with direct GET is also a big concern. We now hit the api directly without bounded by the cache. Previously, the use of todo context was not an issue because it hits the cache, but now it hits the API directly with no constraints as to when it exists. This can be quite bad when API is under high load which is often the case for production spark clusters due to the nature of how Spark works in k8s.

@nabuskey Your both concerns make sense to me.

On API cost: You're right that with the current PR, namespace label lookups in predicates always use the API reader when jobNamespaceSelector is set. I should have been clearer: workload caching (pods, SparkApps, reconciler reads) is unchanged, but namespace label checks in predicates no longer use cache, so yes, those paths can hit the apiserver on every filtered event. That's heavier than I'd like for production Spark clusters.

On context.TODO(): Agree this is problematic once we call the live API. Predicate callbacks don't receive a context today, but we can use a bounded timeout (e.g. context.WithTimeout) for the API fallback path rather than unbounded context.TODO().

On "just to fix a test": The e2e test exposes a real production race, create labeled namespace + SparkApp before the namespace informer indexes it → predicate returns false → create event is silently dropped (no retry) → app stuck in New. Same pattern can happen with CI applying both resources in one pass. So I'd like to fix the bug, but I agree pure GetAPIReader() everywhere isn't the right tradeoff.

What I can do, I can

  1. Try cached client first for namespace label lookup
  2. On NotFound only, fall back to API reader with a short timeout
  3. Use %w for error wrapping

if you prefer a different approach (e.g. namespace watch + requeue). or just close this PR. I am open to anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants