Skip to content

[state-device-plugin] infer feature enablement envars from kernel modules#2525

Merged
tariq1890 merged 1 commit into
mainfrom
device-plugin-feature-driver-ctr-ready
Jun 10, 2026
Merged

[state-device-plugin] infer feature enablement envars from kernel modules#2525
tariq1890 merged 1 commit into
mainfrom
device-plugin-feature-driver-ctr-ready

Conversation

@tariq1890

@tariq1890 tariq1890 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Problem

This PR is linked to NVIDIA/k8s-device-plugin#1837. After PR NVIDIA/k8s-device-plugin#1550 was merged to support NVIDIADriver use cases, we observed that default enablement of feature flags especially MOFED_ENABLED was leading to disruptive side effects. Users of GPU Operator v26.3.2 (which includes k8s-device-plugin v0.19.2) were observing that GPU workload containers were getting the entire list of ibverbs dev nodes injected into them. This led to failures in RDMA/NCCL-based workloads where the workloads were referencing the wrong NICs as opposed to only picking the appropriate the NICs suited for that workload. See here for an example - NVIDIA/k8s-device-plugin#1692.

Solution

Instead of unconditionally enabling the feature flags, we enable them dynamically based on the kernel modules that are currently loaded on the node. This dynamic enablement of feature flags will ensure that ClusterPolicy, NVIDIADriver and pre-installed driver use cases are supported without depending on a cluster-wide setting in the daemonset spec.

Comment thread assets/state-device-plugin/0400_configmap.yaml Outdated
@tariq1890 tariq1890 force-pushed the device-plugin-feature-driver-ctr-ready branch 6 times, most recently from fc4e2ea to 3d4f1aa Compare June 9, 2026 02:20
@tariq1890 tariq1890 changed the title [state-device-plugin] infer feature enablement envars from the .driver-ctr-ready file [state-device-plugin] infer feature enablement envars from kernel modules Jun 9, 2026
@tariq1890 tariq1890 force-pushed the device-plugin-feature-driver-ctr-ready branch 2 times, most recently from 00ea318 to 8ebad89 Compare June 9, 2026 02:35
Comment thread assets/state-device-plugin/0400_configmap.yaml Outdated
@tariq1890 tariq1890 force-pushed the device-plugin-feature-driver-ctr-ready branch 2 times, most recently from 9a10e36 to ef03eb7 Compare June 9, 2026 21:36
@tariq1890 tariq1890 self-assigned this Jun 9, 2026

@rahulait rahulait left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

echo "GDS_ENABLED=true" >> feature-flags.env
fi
if [ -z "$MOFED_ENABLED" ]; then
echo "MOFED_ENABLED=true" >> feature-flags.env

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rahulait @cdesiniotis Note that I am pushing this change right now.

Suggested change
echo "MOFED_ENABLED=true" >> feature-flags.env
echo "MOFED_ENABLED=true" >> feature-flags.env

@tariq1890 tariq1890 force-pushed the device-plugin-feature-driver-ctr-ready branch from ef03eb7 to 8dc251a Compare June 10, 2026 00:57
@tariq1890 tariq1890 enabled auto-merge June 10, 2026 01:04
…ules

Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
@tariq1890 tariq1890 force-pushed the device-plugin-feature-driver-ctr-ready branch from 8dc251a to 18220f4 Compare June 10, 2026 02:39
@tariq1890

Copy link
Copy Markdown
Contributor Author

/cherry-pick release-26.3

@tariq1890 tariq1890 merged commit 27b9f65 into main Jun 10, 2026
18 checks passed
@github-actions

Copy link
Copy Markdown
Contributor

🤖 Backport PR created for release-26.3: #2528

@tariq1890 tariq1890 deleted the device-plugin-feature-driver-ctr-ready branch June 10, 2026 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants