Add GPUCluster CRD and controller for DRA-based stack#2571
Add GPUCluster CRD and controller for DRA-based stack#2571karthikvetrivel wants to merge 6 commits into
Conversation
| # NVIDIADriver CR). GPUClusterConfig does not manage the driver or device plugin | ||
| # itself; it waits for driver readiness before deploying the DRA driver. | ||
| gpuClusterConfig: | ||
| enabled: false |
There was a problem hiding this comment.
We should think a bit more on the right interface for this. A few questions come to mind:
- Is
enabledthe right name for this field? As currently implemented, settinggpuClusterConfig.enabled=truewill create a default GPUClusterConfig CR and will NOT create a default ClusterPolicy CR when the helm chart gets rendered. This may not be clear to the user. - Should the
draDriverstruct be embedded under the top-levelgpuClusterConfigstruct?
| affinity: {{ .KubeletPluginAffinity | toJson }} | ||
| {{- else }} | ||
| affinity: | ||
| nodeAffinity: |
There was a problem hiding this comment.
Question -- should we add a nodeAntiAffinity rule here to prevent the kubelet-plugin from running on a node where the k8s-device-plugin is running? E.g. don't run on nodes labeled with nvidia.com/gpu.deploy.device-plugin=true
| {{- else }} | ||
| deviceClassName: gpu.nvidia.com | ||
| allocationMode: All | ||
| adminAccess: true |
There was a problem hiding this comment.
Question -- does the GPU Operator namespace have to be labeled with resource.kubernetes.io/admin-access: true for this? From https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#admin-access:
Only users authorized to create ResourceClaim or ResourceClaimTemplate objects in namespaces labeled with resource.kubernetes.io/admin-access: "true" (case-sensitive) can use the adminAccess field.
There was a problem hiding this comment.
Yes, I believe it does (as the link you found mentioned). We already handle this.
gpu-operator/internal/state/gfd.go
Lines 115 to 133 in 5691bbc
As it exists right now, it isn't pre-baked in.
5691bbc to
e5dcecd
Compare
f00b187 to
a4c09b7
Compare
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
a4c09b7 to
6b5ac48
Compare
1. Overview
We introduce a new CRD named
GPUClusterand a new controller for reconciling it. LikeClusterPolicytoday, it is a singleton, cluster-scoped CRD that configures the operands needed to enable GPUs in Kubernetes.GPUClusterrepresents the new DRA-based software-enablement stack; it is an evolution ofClusterPolicy. UnlikeClusterPolicy, it does not manage the driver or a device plugin: the driver is either preinstalled on the host or managed byNVIDIADriverCRs, and GPUs are surfaced to workloads through DRA. AGPUClustermay coexist with aClusterPolicyin the same cluster, with every GPU node served by exactly one of the two stacks.Change Log
3e1c3a0 — Add GPUCluster v1alpha1 API, CRD, and generated client
The operand blocks (
dcgm,dcgmExporter,gfd) and shared building blocks (daemonsets,hostPaths) reuse the v1ClusterPolicyspec types directly instead of mirrored copies, so the two stacks share one definition per operand and cannot drift; only thedraDrivertree is defined fresh in v1alpha1. There are deliberately no driver, toolkit, or device-plugin blocks, sinceGPUClusterdoes not manage them.3fae53b — Add DRA driver operand and dra-driver-validator init container
The operand renders through the same
state.Manager+ Go-template engine thatNVIDIADriveruses (rather thanassets/+ object transforms), with per-capability enablement (gpus,computeDomains) and the servedresource.k8s.ioapiVersion auto-detected via API discovery (v1 > v1beta2 > v1beta1). The newdra-driver-validatorinit container probes for a host-installed driver first and falls back to a containerized install, validates withnvidia-smi --versiononly (safe when GPUs are bound to vfio-pci for passthrough), and writes the minimaldriver-readycontract (NVIDIA_DRIVER_ROOT,DRIVER_ROOT_CTR_PATH) that the kubelet-plugin containers source on startup.40a79cf — Add GFD, DCGM, DCGM Exporter, and DRA validation operands to GPUCluster
These operands share a single
configurableStateimplementation, so each operand file declares only its enablement check, image resolution, and render data; they acquire GPU access through DRAadminAccessResourceClaims instead of the legacy privileged/run/nvidiamounts. Standalone DCGM defaults to disabled — dcgm-exporter runs its embedded hostengine and re-points to thenvidia-dcgm-draService only when DCGM is enabled — and the ServiceMonitor renders only when the Prometheus Operator CRD is actually served, so a default install does not require it.ed1d381 — Add GPUCluster controller with singleton status and GPU node labeling
The controller mirrors
ClusterPolicy's singleton first-wins semantics and relies on owner references for cleanup, so deleting the CR garbage-collects every rendered object. Node labeling applies the DRA operand deploy labels (nvidia.com/gpu.deploy.*) only when absent, preserving the k8s-driver-manager's ability to pause an operand by flipping its label during a driver reload.c806c0a — Add per-node stack selection between device-plugin and DRA planes
A new
nvidia.com/gpu-operator.modenode label (device-plugin|dra) routes each GPU node to exactly one plane, and every operand DaemonSet of both stacks gates its nodeSelector on it, which is what makesClusterPolicy/GPUClustercoexistence safe. The label is only ever set on unlabeled nodes (when both CRs exist, the default comes from the operator'sPREFERRED_MODEenv) and is never overwritten or removed, so changing the preference never migrates a node that is already serving GPUs.6b5ac48 — Add GPUCluster Helm install with ClusterPolicy/NVIDIADriver coexistence
Setting
gpuCluster.enabled=truerenders the singletonGPUClusterCR from a new template that reuses the existingdcgm/dcgmExporter/gfdvalues, so one values file drives whichever stack is enabled. The chart-level mutual exclusion withClusterPolicyfrom earlier revisions is lifted, since per-node mode selection now provides the isolation.