add support for default nvidiadriver and migration from clusterpolicy to nvidiadriver#2518
add support for default nvidiadriver and migration from clusterpolicy to nvidiadriver#2518rahulait wants to merge 3 commits into
Conversation
cf23d79 to
4adfc71
Compare
8125f47 to
b21a20b
Compare
b21a20b to
a1dea10
Compare
|
/ok to test a1dea10 |
rajathagasthya
left a comment
There was a problem hiding this comment.
Overall looks solid, just a few questions.
|
|
||
| nvidiaDriverEnabled := nvidiaDriverCRDEnabled(instance) | ||
| if nvidiaDriverEnabled { | ||
| if err := assignNVIDIADriverOwners(ctx, r.Client); err != nil { |
There was a problem hiding this comment.
Any error here (such as conflicting NVIDIADriver nodeSelectors) will block the whole operation and no other operand gets processed until this is fixed. Are we fine with that? I feel like any error here should pause driver management only.
There was a problem hiding this comment.
I think thats how it is even today. The driver pod also controls rollout of other operands when it comes up. So if driver is stuck, others get stuck too and this should be fine. Until the conflict is resolved, reconciles will not do anything.
| if node.Labels == nil { | ||
| node.Labels = map[string]string{} | ||
| } | ||
| node.Labels[consts.NVIDIADriverOwnerLabel] = owner |
There was a problem hiding this comment.
Can these node labels go stale? For example: on a helm uninstall or if NFD reports gpu.present=false, are these labels cleaned up? Not sure if this is even possible, but just thinking about edge cases.
There was a problem hiding this comment.
Helm uninstall won't remove the labels, so they remain on the node even after uninstall. helm pre-uninstall hook might not be useful as there are other workflows with which users can install gpu-operator, like openshift doesn't use helm. Even if the label remains, when gpu-operator is installed again, clusterpolicy will update the node label to right value. If its not supposed to have any label, it will remove it. So I think it should be fine if stale labels remain after uninstall. I'll check if we can remove them somehow.
a1dea10 to
2e803be
Compare
… to nvidiadriver changes include: * added default nvidiadriver which doesn't conflict with other user-specified nvidiadrivers * added migration support from clusterpolicy to nvidiadriver * added/updated e2e tests for nvidiadriver workflow * use new field to mark a nvidiadriver as default instead of using a label * don't allow adding nodeselector to default nvidiadriver, it should cover entire cluster * move default NVIDIADriver ownership helpers into internal/nvidiadriver and re-use * requeue after node owner label changes so later reconcile steps read updated cache state * use patch instead of update for node label updates Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
2e803be to
452812c
Compare
|
/ok to test 452812c |
| if changed { | ||
| return ctrl.Result{RequeueAfter: time.Second}, nil | ||
| } | ||
| } |
There was a problem hiding this comment.
I am trying to reason about why we are doing this. Does the below statement (from the PR description) capture the reasoning?
ClusterPolicy status did not consistently move to notReady for early NVIDIADriver ownership failures. It now sets status.state to notReady and records the concrete conflict message in status conditions.
It feels wrong to call AssignOwners() here in the ClusterPolicy controller when it is already being called in the NVIDIADriver controller (and I would argue it should only be called there). If the goal is to mark ClusterPolicy as notReady when any NVIDIADriver CR is notReady could we not check the status of all the NVIDIADriver CRs in the cluster instead? And even then, I question whether we want the ClusterPolicy status to take the NVIDIADriver CR statuses into account.
There was a problem hiding this comment.
@rahulait it looks like my comment only got added to L164. I meant to attach this to the entire code block.
There was a problem hiding this comment.
These changes were for migration from clusterpolicy to nvidiadriver. I can see these are causing confusion, let me see if I can optimize them further and only let nvidiadriver add ownership labels.
There was a problem hiding this comment.
Regarding status field of clusterpolicy: how should we model ClusterPolicy status when driver management is moved to NVIDIADriver?
With NVIDIADriver enabled, ownership of the driver operand moves out of ClusterPolicy, while the remaining operands are still managed by ClusterPolicy. During a node driver upgrade, the driver pod is replaced first, and then other operands may restart or go through init again.
Do we want ClusterPolicy status to reflect only the operands it still owns, which may cause it to become Ready during driver uninstall/replacement and then flip back to NotReady when dependent operands restart? Or should ClusterPolicy continue to represent the combined health of the overall GPU Operator stack, including NVIDIADriver-managed driver state?
There was a problem hiding this comment.
Good question.
Do we want ClusterPolicy status to reflect only the operands it still owns, which may cause it to become Ready during driver uninstall/replacement and then flip back to NotReady when dependent operands restart?
I lean towards this.
| return err | ||
| } | ||
|
|
||
| err = c.Watch( |
There was a problem hiding this comment.
Just an observation -- it feels wrong for the ClusterPolicy controller to have to watch for updates to NVIDIADriver CRs.
There was a problem hiding this comment.
In current implementation, clusterpolicy controller is responsible for adding driver ownership labels to nodes during migration from clusterpolicy to nvidiadriver. In case there is no nvidiadriver, or there is a conflict between nvidiadrivers and it is resolved, clusterpolicy reconciler should wait and then re-evaluate everything and add ownership labels. Hence, its keeping a watch on nvidiadriver CRs
There was a problem hiding this comment.
I think we can get rid of this defensive approach if we think its cleaner. Right now, it does ownership transfer to nvidiadriver and then deletes daemonset with cascade delete orphan. If we think its better we delete with cascade orphan and nvidiadriver will set the ownership, thats also fine in that case.
There was a problem hiding this comment.
Looked further, this is needed because if a NVIDIADriver is created/updated after clusterpolicy has already reconciled,
we want to requeue clusterpolicy so it can label nodes with orphaned legacy driver pods with "upgrade-required" to have upgrade-controller upgrade them in controlled manner.
There was a problem hiding this comment.
In a post-ClusterPolicy world (DRA with #2513), who is responsible for the owner assignment? I wonder if there should be a dedicated controller to do this?
There was a problem hiding this comment.
nvidiadriver would be responsible for ownership assignment. Right now, just to trigger migration when nvidiadriver is added and clusterpolicy has already reconciled, we need to retrigger its reconciliation so that it can add upgrade_required labels to facilitate migration. But once we are in DRA, we wouldn't need this. So it can be dropped later on. This is required just for clusterpolicy -> nvidiadriver migration and that too the edge case when user didn't added any nvidiadriver initially but later on added it. Basically any time when clusterpolicy ran and didn't find any nvidiadriver and we need it to reconcile when nvidiadriver shows up.
There was a problem hiding this comment.
Looked further, this is needed because if a NVIDIADriver is created/updated after clusterpolicy has already reconciled,
we want to requeue clusterpolicy so it can label nodes with orphaned legacy driver pods with "upgrade-required" to have upgrade-controller upgrade them in controlled manner.
Got it.
Question -- couldn't the NVIDIADriver controller technically be able to do this? For example, a new NVIDIADriver CR is created and upon reconciling it, the NVIDIADriver controller iterates over all nodes that match its node selector and adds the upgrad-required label if an orphan driver pod is running there?
There was a problem hiding this comment.
The NVIDIADriver controller already iterates over all nodes by calling AssignOwners, so I am wondering if this logic of setting the upgrade-required label can be taken care of there (and eliminating the need to duplicate the AssignOwners call in the ClusterPolicy controller).
There was a problem hiding this comment.
Yes, it could. Let me think more about this.
Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
|
/ok to test c028c0c |
| if changed { | ||
| return ctrl.Result{RequeueAfter: time.Second}, nil | ||
| } | ||
| } |
There was a problem hiding this comment.
Good question.
Do we want ClusterPolicy status to reflect only the operands it still owns, which may cause it to become Ready during driver uninstall/replacement and then flip back to NotReady when dependent operands restart?
I lean towards this.
| return false, fmt.Errorf("failed to list GPU nodes: %w", err) | ||
| } | ||
|
|
||
| for i := range nodes.Items { |
There was a problem hiding this comment.
The AssignOwners method will be called during every reconcile of a NVIDIADriver CR and each reconciler involves a loop through the entire list of gpu nodes. How will the performance be in large clusters ?
|
/ok to test 93152ba |
Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
93152ba to
bfc12e5
Compare
|
/ok to test bfc12e5 |
Fixes:
#2353
#2355
Summary
This PR adds support for a default
NVIDIADriverCR and enables controlled migration from legacyClusterPolicydriver management toNVIDIADriver-based driver management.When
driver.nvidiaDriverCRD.enabled=trueanddriver.nvidiaDriverCRD.deployDefaultCR=true, Helm renders a defaultNVIDIADriverCR. This CR is identified byspec.default=true:The CR name is not semantically important. Helm renders it as
default, but users can create an arbitrarily namedNVIDIADriverand mark it as the fallback driver by settingspec.default=true.Behavior
NVIDIADriverCRs own nodes selected by theirspec.nodeSelector.NVIDIADriveracts as fallback for GPU nodes that do not match any non-default driver.NVIDIADriverhasspec.default=true, fallback ownership is skipped.NVIDIADriverhasspec.default=true, reconciliation fails closed.NVIDIADrivercannot definespec.nodeSelector; the fallback always applies to remaining GPU nodes that do not match non-default drivers.spec.defaultis changed tofalse, that CR becomes a normal user-definedNVIDIADriverand participates in nodeSelector conflict detection.NVIDIADrivernode selectors cannot include the operator-managed owner labelnvidia.com/gpu.driver.owner.NVIDIADrivermode,NVIDIADriverownership conflicts markClusterPolicyasnotReadyand surface the conflict in its status conditions.Migration Support
This PR supports migration from
ClusterPolicydriver management toNVIDIADriverdriver management by:NVIDIADriverfrom Helm values when CRD mode is enabledClusterPolicydriver fields whendriver.nvidiaDriverCRD.enabled=falseNVIDIADriverownership through the operator-managed node labelnvidia.com/gpu.driver.ownerClusterPolicy-owned driver daemonsets afterNVIDIADriverownership takes overClusterPolicydriver daemonsets beforeNVIDIADriverrolls replacement podsMigration requires the target GPU nodes to be matched by at least one
NVIDIADriverCR. Ifdriver.nvidiaDriverCRD.enabled=truebut no matchingNVIDIADriverexists, the controller has no driver owner to assign to those nodes, so fallback ownership is skipped until a matching CR is created.Upgrade Example
When upgrading from a release that does not contain this migration support, migration from
ClusterPolicydriver management toNVIDIADriverdriver management must be performed in two upgrades. This applies only to clusters that are currently usingClusterPolicyfor driver management and are switching toNVIDIADriver. It applies to releases older thanv26.7.0.Clusters that are already using
NVIDIADriverin an older release do not need this two-phase migration flow.Starting with
v26.7.0, the migration support is already present in the running controller. Clusters already runningv26.7.0or newer can switch toNVIDIADrivermode directly by upgrading or updating configuration withdriver.nvidiaDriverCRD.enabled=true, as long as the target GPU nodes are matched byNVIDIADriverCRs. The recommended path is to also setdriver.nvidiaDriverCRD.deployDefaultCR=trueso Helm renders the default fallback CR. Alternatively, users can create their ownNVIDIADriverCRs and mark one as the default fallback withspec.default=true. See theBehaviorsection for details on default and non-default ownership.Do not jump directly from an old release to the new release with
driver.nvidiaDriverCRD.enabled=true. The old controller does not know about the controlled migration flow. If the old controller exits while the new release is already configured forNVIDIADrivermode, it can tear down all existing driver pods immediately before the new controller has a chance to take ownership.Use this sequence instead:
Upgrade to the new release with
ClusterPolicydriver management still enabled.Example Helm settings:
This gets the new controller logic running while the existing
ClusterPolicy-owned driver pods remain in place.Upgrade again to the same new release, this time enabling
NVIDIADrivermode.Example Helm settings:
This renders the default
NVIDIADriverand performs the controlled migration fromClusterPolicyownership toNVIDIADriverownership.Issues Found And Fixed
During testing, we found several edge cases:
defaultas special was too restrictive and could conflict with user-created CRs. The default driver is now identified byspec.default=trueinstead.spec.nodeSelector.spec.nodeSelectorcould includenvidia.com/gpu.driver.ownerand override the operator-managed owner selector used by driver daemonsets. The controller, admission validation, and daemonset nodeSelector construction now reject or defensively preserve that invariant.ClusterPolicyreconciliation previously attemptedNVIDIADriverowner assignment even whenNVIDIADrivermode was disabled. Owner assignment and orphaned-pod migration are now gated ondriver.enabled=trueanddriver.nvidiaDriverCRD.enabled=true.ClusterPolicystatus did not consistently move tonotReadyfor earlyNVIDIADriverownership failures. It now setsstatus.statetonotReadyand records the concrete conflict message in status conditions.spec.default, so it bumps.metadata.generationand appears in thekubectl get nvidiadriverDefaultprint column.NVIDIADrivernames.E2E Coverage
The
NVIDIADrivere2e flow now covers:NVIDIADriveronly whendriver.nvidiaDriverCRD.enabled=trueanddriver.nvidiaDriverCRD.deployDefaultCR=true.deployDefaultCR=false.NVIDIADriverCRD mode.spec.default=true.NVIDIADrivercan take ownership of GPU nodes.NVIDIADriver.NVIDIADriver.spec.default=falsemakes that CR a normal driver and conflict detection applies.ClusterPolicyreportsnotReadywith theNVIDIADriverconflict message when ownership assignment fails inNVIDIADrivermode.NVIDIADrivermode before teardown. It verifies the legacy driver daemonset is removed, the existing legacy pod is orphaned, the defaultNVIDIADrivertakes ownership, the orphaned pod is deleted by the upgrade flow within 5 minutes, and the replacement driver pod becomes ready.Checklist
make lint)make validate-generated-assets)make validate-modules)Testing
Added unit-tests, e2e tests and also did manual testing on a 3 node cluster with multiple nvidiadriver CRs. Everything worked as expected even with migration.