Skip to content

feat(dra): DRA driver installation#8713

Open
runzhen wants to merge 29 commits into
mainfrom
runzhen/dra2
Open

feat(dra): DRA driver installation#8713
runzhen wants to merge 29 commits into
mainfrom
runzhen/dra2

Conversation

@runzhen

@runzhen runzhen commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:
Initial checkin for dra-driver-nvidia-gpu

Below items will be handled in another PR

  1. add "enable DRA" filed in proto buffer.
  2. support AzureLinux

unit_test.txt


Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  37s   default-scheduler  Successfully assigned default/pod to 3718-2026-06-16-ubuntu2404dradrivernvidiagpurunningdefaul000000
  Normal  Created    13s   kubelet            spec.containers{ctr0}: Created container: ctr0
  Normal  Started    13s   kubelet            spec.containers{ctr0}: Started container ctr0
  Normal  Pulled     13s   kubelet            spec.containers{ctr1}: Container image "mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu" already present on machine
  Normal  Created    13s   kubelet            spec.containers{ctr1}: Created container: ctr1
  Normal  Started    13s   kubelet            spec.containers{ctr1}: Started container ctr1

k logs -f pod --all-containers
GPU 0: NVIDIA A10-4Q (UUID: GPU-920b9300-69c2-11f1-9f17-62dd377f7a08)
GPU 0: NVIDIA A10-4Q (UUID: GPU-920b9300-69c2-11f1-9f17-62dd377f7a08)

Which issue(s) this PR fixes:

Fixes #

Copilot AI review requested due to automatic review settings June 15, 2026 20:58

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

parts/linux/cloud-init/artifacts/cse_helpers.sh:80

  • ERR_DRA_DRIVER_START_FAIL is assigned exit code 124, but this file explicitly documents that 124 is the standard exit code returned by the timeout command when it times out (without --preserve-status). Reusing 124 will make failures ambiguous in logs/diagnostics and can break any tooling that interprets 124 as a timeout.
ERR_ENABLE_MANAGED_GPU_EXPERIENCE=123 # Error confguring managed GPU experience
ERR_DRA_DRIVER_START_FAIL=124 # dra-driver-nvidia-gpu could not be started by systemctl

# 123 is free for use

# Error code 124 is returned when a `timeout` command times out, and --preserve-status is not specified: https://man7.org/linux/man-pages/man1/timeout.1.html
ERR_VHD_BUILD_ERROR=125 # Reserved for VHD CI exit conditions

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh
Comment thread e2e/scenario_gpu_managed_experience_test.go Outdated
Comment thread aks-node-controller/parser/parser.go Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 15, 2026 21:20
runzhen and others added 2 commits June 15, 2026 14:21
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread vhdbuilder/packer/install-dependencies.sh Outdated
Comment thread e2e/scenario_gpu_managed_experience_test.go Outdated
Comment thread parts/linux/cloud-init/artifacts/cse_helpers.sh
Comment thread parts/common/components.json
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 15, 2026 21:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

parts/linux/cloud-init/artifacts/cse_helpers.sh:79

  • ERR_DRA_DRIVER_START_FAIL is assigned exit code 124, but this file explicitly documents 124 as the default exit status from the timeout command. Reusing 124 for a CSE-specific failure makes it hard to distinguish an actual DRA failure from a generic timeout, and conflicts with the existing reservation comment.
ERR_ENABLE_MANAGED_GPU_EXPERIENCE=123 # Error confguring managed GPU experience
ERR_DRA_DRIVER_START_FAIL=124 # dra-driver-nvidia-gpu could not be started by systemctl

# 123 is free for use

# Error code 124 is returned when a `timeout` command times out, and --preserve-status is not specified: https://man7.org/linux/man-pages/man1/timeout.1.html

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/cse_main.sh
Copilot AI review requested due to automatic review settings June 15, 2026 22:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread e2e/validators.go
Comment thread e2e/scenario_gpu_managed_experience_test.go Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 15, 2026 22:32
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Copilot AI review requested due to automatic review settings June 16, 2026 17:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
Comment thread parts/linux/cloud-init/artifacts/cse_config.sh
@runzhen runzhen changed the title feat(dra): add DRA driver for Nvidia GPU feat(dra): DRA driver installation Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

components This pull request updates cached components on Linux or Windows VHDs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants