add support for default nvidiadriver and migration from clusterpolicy to nvidiadriver by rahulait · Pull Request #2518 · NVIDIA/gpu-operator

rahulait · 2026-06-04T19:24:59Z

Fixes:

Summary

This PR adds support for a default NVIDIADriver CR and enables controlled migration from legacy ClusterPolicy driver management to NVIDIADriver-based driver management.

When driver.nvidiaDriverCRD.enabled=true and driver.nvidiaDriverCRD.deployDefaultCR=true, Helm renders a default NVIDIADriver CR. This CR is identified by spec.default=true:

spec:
  default: true

The CR name is not semantically important. Helm renders it as default, but users can create an arbitrarily named NVIDIADriver and mark it as the fallback driver by setting spec.default=true.

Behavior

Non-default NVIDIADriver CRs own nodes selected by their spec.nodeSelector.
The default NVIDIADriver acts as fallback for GPU nodes that do not match any non-default driver.
If no NVIDIADriver has spec.default=true, fallback ownership is skipped.
If more than one NVIDIADriver has spec.default=true, reconciliation fails closed.
A default NVIDIADriver cannot define spec.nodeSelector; the fallback always applies to remaining GPU nodes that do not match non-default drivers.
If spec.default is changed to false, that CR becomes a normal user-defined NVIDIADriver and participates in nodeSelector conflict detection.
User-defined driver conflicts do not relabel nodes to the default driver or clear existing owner labels.
NVIDIADriver node selectors cannot include the operator-managed owner label nvidia.com/gpu.driver.owner.
In NVIDIADriver mode, NVIDIADriver ownership conflicts mark ClusterPolicy as notReady and surface the conflict in its status conditions.

Migration Support

This PR supports migration from ClusterPolicy driver management to NVIDIADriver driver management by:

rendering a default NVIDIADriver from Helm values when CRD mode is enabled
preserving legacy ClusterPolicy driver fields when driver.nvidiaDriverCRD.enabled=false
routing GPU nodes to NVIDIADriver ownership through the operator-managed node label nvidia.com/gpu.driver.owner
cleaning up ClusterPolicy-owned driver daemonsets after NVIDIADriver ownership takes over
preserving running driver pods during migration by orphaning old ClusterPolicy driver daemonsets before NVIDIADriver rolls replacement pods

Migration requires the target GPU nodes to be matched by at least one NVIDIADriver CR. If driver.nvidiaDriverCRD.enabled=true but no matching NVIDIADriver exists, the controller has no driver owner to assign to those nodes, so fallback ownership is skipped until a matching CR is created.

Upgrade Example

When upgrading from a release that does not contain this migration support, migration from ClusterPolicy driver management to NVIDIADriver driver management must be performed in two upgrades. This applies only to clusters that are currently using ClusterPolicy for driver management and are switching to NVIDIADriver. It applies to releases older than v26.7.0.

Clusters that are already using NVIDIADriver in an older release do not need this two-phase migration flow.

Starting with v26.7.0, the migration support is already present in the running controller. Clusters already running v26.7.0 or newer can switch to NVIDIADriver mode directly by upgrading or updating configuration with driver.nvidiaDriverCRD.enabled=true, as long as the target GPU nodes are matched by NVIDIADriver CRs. The recommended path is to also set driver.nvidiaDriverCRD.deployDefaultCR=true so Helm renders the default fallback CR. Alternatively, users can create their own NVIDIADriver CRs and mark one as the default fallback with spec.default=true. See the Behavior section for details on default and non-default ownership.

Do not jump directly from an old release to the new release with driver.nvidiaDriverCRD.enabled=true. The old controller does not know about the controlled migration flow. If the old controller exits while the new release is already configured for NVIDIADriver mode, it can tear down all existing driver pods immediately before the new controller has a chance to take ownership.

Use this sequence instead:

Upgrade to the new release with ClusterPolicy driver management still enabled.

Example Helm settings:
```
driver:
  enabled: true
  nvidiaDriverCRD:
    enabled: false
    deployDefaultCR: false
```
This gets the new controller logic running while the existing ClusterPolicy-owned driver pods remain in place.
Upgrade again to the same new release, this time enabling NVIDIADriver mode.

Example Helm settings:
```
driver:
  enabled: true
  nvidiaDriverCRD:
    enabled: true
    deployDefaultCR: true
```
This renders the default NVIDIADriver and performs the controlled migration from ClusterPolicy ownership to NVIDIADriver ownership.

Issues Found And Fixed

During testing, we found several edge cases:

Treating the literal name default as special was too restrictive and could conflict with user-created CRs. The default driver is now identified by spec.default=true instead.
If multiple default drivers existed, reconciliation could behave ambiguously. The controller now fails closed when multiple defaults are found.
Allowing the default driver to carry a node selector made fallback behavior ambiguous. Default drivers now reject non-empty spec.nodeSelector.
On user-driver nodeSelector conflicts, nodes could be relabeled away from their existing owner, causing existing driver pods to disappear. Owner assignment now preflights conflicts and preserves existing owner labels on conflict.
User-supplied spec.nodeSelector could include nvidia.com/gpu.driver.owner and override the operator-managed owner selector used by driver daemonsets. The controller, admission validation, and daemonset nodeSelector construction now reject or defensively preserve that invariant.
ClusterPolicy reconciliation previously attempted NVIDIADriver owner assignment even when NVIDIADriver mode was disabled. Owner assignment and orphaned-pod migration are now gated on driver.enabled=true and driver.nvidiaDriverCRD.enabled=true.
ClusterPolicy status did not consistently move to notReady for early NVIDIADriver ownership failures. It now sets status.state to notReady and records the concrete conflict message in status conditions.
Moving default ownership is now a spec change through spec.default, so it bumps .metadata.generation and appears in the kubectl get nvidiadriver Default print column.
Conflict messages did not identify the other conflicting drivers. They now include the first conflicting node and the conflicting NVIDIADriver names.

E2E Coverage

The NVIDIADriver e2e flow now covers:

Helm renders a default NVIDIADriver only when driver.nvidiaDriverCRD.enabled=true and driver.nvidiaDriverCRD.deployDefaultCR=true.
Helm does not render a default CR when CRD mode is disabled.
Helm does not render a default CR when deployDefaultCR=false.
Operator installs in NVIDIADriver CRD mode.
Default driver can use an arbitrary name as long as it has spec.default=true.
User-defined NVIDIADriver can take ownership of GPU nodes.
Driver image/version update works through a user-defined NVIDIADriver.
Custom driver pod labels update through a user-defined NVIDIADriver.
Setting spec.default=false makes that CR a normal driver and conflict detection applies.
Conflicts preserve existing node owner labels.
ClusterPolicy reports notReady with the NVIDIADriver conflict message when ownership assignment fails in NVIDIADriver mode.
Multiple default drivers are rejected or fail closed without disrupting existing ownership.
The ClusterPolicy e2e workflow now performs an in-place migration to NVIDIADriver mode before teardown. It verifies the legacy driver daemonset is removed, the existing legacy pod is orphaned, the default NVIDIADriver takes ownership, the orphaned pod is deleted by the upgrade flow within 5 minutes, and the replacement driver pod becomes ready.
Helm uninstall remains covered by the existing e2e teardown.

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint)
Generated assets in-sync (make validate-generated-assets)
Go mod artifacts in-sync (make validate-modules)
Test cases are added for new code paths

Testing

Added unit-tests, e2e tests and also did manual testing on a 3 node cluster with multiple nvidiadriver CRs. Everything worked as expected even with migration.

copy-pr-bot · 2026-06-10T13:20:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rahulait · 2026-06-10T13:21:05Z

/ok to test a1dea10

rajathagasthya

Overall looks solid, just a few questions.

rajathagasthya · 2026-06-10T18:43:33Z

+
+	nvidiaDriverEnabled := nvidiaDriverCRDEnabled(instance)
+	if nvidiaDriverEnabled {
+		if err := assignNVIDIADriverOwners(ctx, r.Client); err != nil {


Any error here (such as conflicting NVIDIADriver nodeSelectors) will block the whole operation and no other operand gets processed until this is fixed. Are we fine with that? I feel like any error here should pause driver management only.

I think thats how it is even today. The driver pod also controls rollout of other operands when it comes up. So if driver is stuck, others get stuck too and this should be fine. Until the conflict is resolved, reconciles will not do anything.

rajathagasthya · 2026-06-10T18:49:29Z

+		if node.Labels == nil {
+			node.Labels = map[string]string{}
+		}
+		node.Labels[consts.NVIDIADriverOwnerLabel] = owner


Can these node labels go stale? For example: on a helm uninstall or if NFD reports gpu.present=false, are these labels cleaned up? Not sure if this is even possible, but just thinking about edge cases.

Helm uninstall won't remove the labels, so they remain on the node even after uninstall. helm pre-uninstall hook might not be useful as there are other workflows with which users can install gpu-operator, like openshift doesn't use helm. Even if the label remains, when gpu-operator is installed again, clusterpolicy will update the node label to right value. If its not supposed to have any label, it will remove it. So I think it should be fine if stale labels remain after uninstall. I'll check if we can remove them somehow.

… to nvidiadriver changes include: * added default nvidiadriver which doesn't conflict with other user-specified nvidiadrivers * added migration support from clusterpolicy to nvidiadriver * added/updated e2e tests for nvidiadriver workflow * use new field to mark a nvidiadriver as default instead of using a label * don't allow adding nodeselector to default nvidiadriver, it should cover entire cluster * move default NVIDIADriver ownership helpers into internal/nvidiadriver and re-use * requeue after node owner label changes so later reconcile steps read updated cache state * use patch instead of update for node label updates Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

rahulait · 2026-06-10T22:32:04Z

/ok to test 452812c

cdesiniotis · 2026-06-10T22:27:23Z

+		if changed {
+			return ctrl.Result{RequeueAfter: time.Second}, nil
+		}
+	}


I am trying to reason about why we are doing this. Does the below statement (from the PR description) capture the reasoning?

ClusterPolicy status did not consistently move to notReady for early NVIDIADriver ownership failures. It now sets status.state to notReady and records the concrete conflict message in status conditions.

It feels wrong to call AssignOwners() here in the ClusterPolicy controller when it is already being called in the NVIDIADriver controller (and I would argue it should only be called there). If the goal is to mark ClusterPolicy as notReady when any NVIDIADriver CR is notReady could we not check the status of all the NVIDIADriver CRs in the cluster instead? And even then, I question whether we want the ClusterPolicy status to take the NVIDIADriver CR statuses into account.

@rahulait it looks like my comment only got added to L164. I meant to attach this to the entire code block.

These changes were for migration from clusterpolicy to nvidiadriver. I can see these are causing confusion, let me see if I can optimize them further and only let nvidiadriver add ownership labels.

Regarding status field of clusterpolicy: how should we model ClusterPolicy status when driver management is moved to NVIDIADriver?
With NVIDIADriver enabled, ownership of the driver operand moves out of ClusterPolicy, while the remaining operands are still managed by ClusterPolicy. During a node driver upgrade, the driver pod is replaced first, and then other operands may restart or go through init again.
Do we want ClusterPolicy status to reflect only the operands it still owns, which may cause it to become Ready during driver uninstall/replacement and then flip back to NotReady when dependent operands restart? Or should ClusterPolicy continue to represent the combined health of the overall GPU Operator stack, including NVIDIADriver-managed driver state?

Good question.

Do we want ClusterPolicy status to reflect only the operands it still owns, which may cause it to become Ready during driver uninstall/replacement and then flip back to NotReady when dependent operands restart?

I lean towards this.

cdesiniotis · 2026-06-10T22:29:15Z

 		return err
 	}

+	err = c.Watch(


Just an observation -- it feels wrong for the ClusterPolicy controller to have to watch for updates to NVIDIADriver CRs.

In current implementation, clusterpolicy controller is responsible for adding driver ownership labels to nodes during migration from clusterpolicy to nvidiadriver. In case there is no nvidiadriver, or there is a conflict between nvidiadrivers and it is resolved, clusterpolicy reconciler should wait and then re-evaluate everything and add ownership labels. Hence, its keeping a watch on nvidiadriver CRs

I think we can get rid of this defensive approach if we think its cleaner. Right now, it does ownership transfer to nvidiadriver and then deletes daemonset with cascade delete orphan. If we think its better we delete with cascade orphan and nvidiadriver will set the ownership, thats also fine in that case.

Looked further, this is needed because if a NVIDIADriver is created/updated after clusterpolicy has already reconciled,
we want to requeue clusterpolicy so it can label nodes with orphaned legacy driver pods with "upgrade-required" to have upgrade-controller upgrade them in controlled manner.

In a post-ClusterPolicy world (DRA with #2513), who is responsible for the owner assignment? I wonder if there should be a dedicated controller to do this?

nvidiadriver would be responsible for ownership assignment. Right now, just to trigger migration when nvidiadriver is added and clusterpolicy has already reconciled, we need to retrigger its reconciliation so that it can add upgrade_required labels to facilitate migration. But once we are in DRA, we wouldn't need this. So it can be dropped later on. This is required just for clusterpolicy -> nvidiadriver migration and that too the edge case when user didn't added any nvidiadriver initially but later on added it. Basically any time when clusterpolicy ran and didn't find any nvidiadriver and we need it to reconcile when nvidiadriver shows up.

Looked further, this is needed because if a NVIDIADriver is created/updated after clusterpolicy has already reconciled,
we want to requeue clusterpolicy so it can label nodes with orphaned legacy driver pods with "upgrade-required" to have upgrade-controller upgrade them in controlled manner.

Got it.

Question -- couldn't the NVIDIADriver controller technically be able to do this? For example, a new NVIDIADriver CR is created and upon reconciling it, the NVIDIADriver controller iterates over all nodes that match its node selector and adds the upgrad-required label if an orphan driver pod is running there?

The NVIDIADriver controller already iterates over all nodes by calling AssignOwners, so I am wondering if this logic of setting the upgrade-required label can be taken care of there (and eliminating the need to duplicate the AssignOwners call in the ClusterPolicy controller).

Yes, it could. Let me think more about this.

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

rahulait · 2026-06-11T03:27:05Z

/ok to test c028c0c

tariq1890 · 2026-06-11T15:28:16Z

+		if changed {
+			return ctrl.Result{RequeueAfter: time.Second}, nil
+		}
+	}


Good question.

Do we want ClusterPolicy status to reflect only the operands it still owns, which may cause it to become Ready during driver uninstall/replacement and then flip back to NotReady when dependent operands restart?

I lean towards this.

tariq1890 · 2026-06-11T15:54:27Z

+		return false, fmt.Errorf("failed to list GPU nodes: %w", err)
+	}
+
+	for i := range nodes.Items {


The AssignOwners method will be called during every reconcile of a NVIDIADriver CR and each reconciler involves a loop through the entire list of gpu nodes. How will the performance be in large clusters ?

rahulait · 2026-06-11T19:24:33Z

/ok to test 93152ba

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

rahulait · 2026-06-11T19:56:46Z

/ok to test bfc12e5

rahulait force-pushed the default-nvidiadriver branch 7 times, most recently from cf23d79 to 4adfc71 Compare June 5, 2026 17:12

rahulait marked this pull request as ready for review June 5, 2026 17:14

rahulait requested review from cdesiniotis, karthikvetrivel, rajathagasthya, shivamerla and tariq1890 as code owners June 5, 2026 17:14

rahulait force-pushed the default-nvidiadriver branch 2 times, most recently from 8125f47 to b21a20b Compare June 5, 2026 20:12

rahulait mentioned this pull request Jun 5, 2026

[WIP] : Adds support for migrating from clusterpolicy to nvidiadriver and vice-versa #2117

Closed

5 tasks

rahulait force-pushed the default-nvidiadriver branch from b21a20b to a1dea10 Compare June 10, 2026 13:20

rajathagasthya reviewed Jun 10, 2026

View reviewed changes

rahulait force-pushed the default-nvidiadriver branch from a1dea10 to 2e803be Compare June 10, 2026 21:04

rahulait force-pushed the default-nvidiadriver branch from 2e803be to 452812c Compare June 10, 2026 22:02

cdesiniotis reviewed Jun 10, 2026

View reviewed changes

Comment thread internal/nvidiadriver/nvidiadriver.go Outdated

tariq1890 reviewed Jun 10, 2026

View reviewed changes

Comment thread controllers/clusterpolicy_controller.go Outdated

Comment thread controllers/object_controls.go Outdated

Comment thread internal/consts/consts.go Outdated

Comment thread internal/state/driver.go Outdated

Comment thread controllers/clusterpolicy_controller.go Outdated

rajathagasthya reviewed Jun 11, 2026

View reviewed changes

Comment thread controllers/nvidiadriver_controller.go Outdated

address review comments

c028c0c

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

tariq1890 reviewed Jun 11, 2026

View reviewed changes

Comment thread tests/scripts/update-nvidiadriver.sh Outdated

address review comments

bfc12e5

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

rahulait force-pushed the default-nvidiadriver branch from 93152ba to bfc12e5 Compare June 11, 2026 19:53

Conversation

rahulait commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes:

Summary

Behavior

Migration Support

Upgrade Example

Issues Found And Fixed

E2E Coverage

Checklist

Testing

Uh oh!

copy-pr-bot Bot commented Jun 10, 2026

Uh oh!

rahulait commented Jun 10, 2026

Uh oh!

rajathagasthya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahulait commented Jun 10, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahulait Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahulait Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahulait commented Jun 4, 2026 •

edited

Loading

rahulait Jun 10, 2026 •

edited

Loading

rahulait Jun 11, 2026 •

edited

Loading