Add GPUCluster CRD and controller for DRA-based stack by karthikvetrivel · Pull Request #2571 · NVIDIA/gpu-operator

karthikvetrivel · 2026-06-23T15:47:15Z

1. Overview

We introduce a new CRD named GPUCluster and a new controller for reconciling it. Like ClusterPolicy today, it is a singleton, cluster-scoped CRD that configures the operands needed to enable GPUs in Kubernetes. GPUCluster represents the new DRA-based software-enablement stack; it is an evolution of ClusterPolicy. Unlike ClusterPolicy, it does not manage the driver or a device plugin: the driver is either preinstalled on the host or managed by NVIDIADriver CRs, and GPUs are surfaced to workloads through DRA. A GPUCluster may coexist with a ClusterPolicy in the same cluster, with every GPU node served by exactly one of the two stacks.

Change Log

3e1c3a0 — Add GPUCluster v1alpha1 API, CRD, and generated client
The operand blocks (dcgm, dcgmExporter, gfd) and shared building blocks (daemonsets, hostPaths) reuse the v1 ClusterPolicy spec types directly instead of mirrored copies, so the two stacks share one definition per operand and cannot drift; only the draDriver tree is defined fresh in v1alpha1. There are deliberately no driver, toolkit, or device-plugin blocks, since GPUCluster does not manage them.

3fae53b — Add DRA driver operand and dra-driver-validator init container
The operand renders through the same state.Manager + Go-template engine that NVIDIADriver uses (rather than assets/ + object transforms), with per-capability enablement (gpus, computeDomains) and the served resource.k8s.io apiVersion auto-detected via API discovery (v1 > v1beta2 > v1beta1). The new dra-driver-validator init container probes for a host-installed driver first and falls back to a containerized install, validates with nvidia-smi --version only (safe when GPUs are bound to vfio-pci for passthrough), and writes the minimal driver-ready contract (NVIDIA_DRIVER_ROOT, DRIVER_ROOT_CTR_PATH) that the kubelet-plugin containers source on startup.

40a79cf — Add GFD, DCGM, DCGM Exporter, and DRA validation operands to GPUCluster
These operands share a single configurableState implementation, so each operand file declares only its enablement check, image resolution, and render data; they acquire GPU access through DRA adminAccess ResourceClaims instead of the legacy privileged /run/nvidia mounts. Standalone DCGM defaults to disabled — dcgm-exporter runs its embedded hostengine and re-points to the nvidia-dcgm-dra Service only when DCGM is enabled — and the ServiceMonitor renders only when the Prometheus Operator CRD is actually served, so a default install does not require it.

ed1d381 — Add GPUCluster controller with singleton status and GPU node labeling
The controller mirrors ClusterPolicy's singleton first-wins semantics and relies on owner references for cleanup, so deleting the CR garbage-collects every rendered object. Node labeling applies the DRA operand deploy labels (nvidia.com/gpu.deploy.*) only when absent, preserving the k8s-driver-manager's ability to pause an operand by flipping its label during a driver reload.

c806c0a — Add per-node stack selection between device-plugin and DRA planes
A new nvidia.com/gpu-operator.mode node label (device-plugin | dra) routes each GPU node to exactly one plane, and every operand DaemonSet of both stacks gates its nodeSelector on it, which is what makes ClusterPolicy/GPUCluster coexistence safe. The label is only ever set on unlabeled nodes (when both CRs exist, the default comes from the operator's PREFERRED_MODE env) and is never overwritten or removed, so changing the preference never migrates a node that is already serving GPUs.

6b5ac48 — Add GPUCluster Helm install with ClusterPolicy/NVIDIADriver coexistence
Setting gpuCluster.enabled=true renders the singleton GPUCluster CR from a new template that reuses the existing dcgm/dcgmExporter/gfd values, so one values file drives whichever stack is enabled. The chart-level mutual exclusion with ClusterPolicy from earlier revisions is lifted, since per-node mode selection now provides the isolation.

Moved from #2513 (re-created with the head branch on NVIDIA/gpu-operator instead of a fork, to enable stacked PRs). The earlier review discussion — including the GPUClusterConfig → GPUCluster naming suggestion, since adopted — lives in #2513.

cdesiniotis · 2026-06-24T15:35:40Z

+# NVIDIADriver CR). GPUClusterConfig does not manage the driver or device plugin
+# itself; it waits for driver readiness before deploying the DRA driver.
+gpuClusterConfig:
+  enabled: false


We should think a bit more on the right interface for this. A few questions come to mind:

Is enabled the right name for this field? As currently implemented, setting gpuClusterConfig.enabled=true will create a default GPUClusterConfig CR and will NOT create a default ClusterPolicy CR when the helm chart gets rendered. This may not be clear to the user.

Should the draDriver struct be embedded under the top-level gpuClusterConfig struct?

cdesiniotis · 2026-06-24T15:53:44Z

+      affinity: {{ .KubeletPluginAffinity | toJson }}
+      {{- else }}
+      affinity:
+        nodeAffinity:


Question -- should we add a nodeAntiAffinity rule here to prevent the kubelet-plugin from running on a node where the k8s-device-plugin is running? E.g. don't run on nodes labeled with nvidia.com/gpu.deploy.device-plugin=true

cdesiniotis · 2026-06-24T16:07:54Z

+        {{- else }}
+        deviceClassName: gpu.nvidia.com
+        allocationMode: All
+        adminAccess: true


Question -- does the GPU Operator namespace have to be labeled with resource.kubernetes.io/admin-access: true for this? From https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#admin-access:

Only users authorized to create ResourceClaim or ResourceClaimTemplate objects in namespaces labeled with resource.kubernetes.io/admin-access: "true" (case-sensitive) can use the adminAccess field.

Yes, I believe it does (as the link you found mentioned). We already handle this.

gpu-operator/internal/state/gfd.go

Lines 115 to 133 in 5691bbc

// ensureAdminAccessLabel patches the operator namespace with the label required by the

// kube-scheduler to allow adminAccess: true in ResourceClaim/ResourceClaimTemplate

// objects. The label is deliberately never removed: it is namespace-level configuration

// that other adminAccess consumers in the namespace may rely on.

func (s *stateGFD) ensureAdminAccessLabel(ctx context.Context) error {

ns := &corev1.Namespace{}

if err := s.client.Get(ctx, client.ObjectKey{Name: s.namespace}, ns); err != nil {

return fmt.Errorf("could not get namespace %s: %w", s.namespace, err)

}

if ns.Labels[draAdminNamespaceLabelKey] == "true" {

return nil

}

patch := client.MergeFrom(ns.DeepCopy())

if ns.Labels == nil {

ns.Labels = make(map[string]string)

}

ns.Labels[draAdminNamespaceLabelKey] = "true"

return s.client.Patch(ctx, ns, patch)

}

As it exists right now, it isn't pre-baked in.

coveralls · 2026-07-01T15:31:44Z

coverage: 31.944% (+0.7%) from 31.227% — kv-gpuclusterconfig-crd into main

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>

karthikvetrivel requested review from cdesiniotis and rahulait as code owners June 23, 2026 15:47

karthikvetrivel requested a review from cdesiniotis June 23, 2026 15:47

karthikvetrivel requested review from rajathagasthya and shivamerla as code owners June 23, 2026 15:47

karthikvetrivel requested review from rahulait and shivamerla June 23, 2026 15:47

karthikvetrivel requested a review from tariq1890 as a code owner June 23, 2026 15:47

karthikvetrivel requested a review from tariq1890 June 23, 2026 15:47

This was referenced Jun 23, 2026

Add GPUClusterConfig CRD and controller for DRA-based stack #2513

Closed

Support NVIDIADriver reconciliation without a ClusterPolicy #2572

Draft

cdesiniotis reviewed Jun 24, 2026

View reviewed changes

karthikvetrivel force-pushed the kv-gpuclusterconfig-crd branch from 5691bbc to e5dcecd Compare July 1, 2026 15:24

karthikvetrivel force-pushed the kv-gpuclusterconfig-crd branch 3 times, most recently from f00b187 to a4c09b7 Compare July 1, 2026 19:42

karthikvetrivel added 6 commits July 2, 2026 10:54

Add GPUCluster v1alpha1 API, CRD, and generated client

3e1c3a0

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>

Add DRA driver operand and dra-driver-validator init container

3fae53b

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>

Add GFD, DCGM, DCGM Exporter, and DRA validation operands to GPUCluster

40a79cf

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>

Add GPUCluster controller with singleton status and GPU node labeling

ed1d381

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>

Add per-node stack selection between device-plugin and DRA planes

c806c0a

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>

Add GPUCluster Helm install with ClusterPolicy/NVIDIADriver coexistence

6b5ac48

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>

karthikvetrivel force-pushed the kv-gpuclusterconfig-crd branch from a4c09b7 to 6b5ac48 Compare July 2, 2026 19:50

karthikvetrivel changed the title ~~Add GPUClusterConfig CRD and controller for DRA-based stack~~ Add GPUCluster CRD and controller for DRA-based stack Jul 2, 2026

karthikvetrivel marked this pull request as draft July 2, 2026 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GPUCluster CRD and controller for DRA-based stack#2571

Add GPUCluster CRD and controller for DRA-based stack#2571
karthikvetrivel wants to merge 6 commits into
mainfrom
kv-gpuclusterconfig-crd

karthikvetrivel commented Jun 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cdesiniotis Jun 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cdesiniotis Jun 24, 2026

Uh oh!

Uh oh!

cdesiniotis Jun 24, 2026

Uh oh!

karthikvetrivel Jun 24, 2026 •

edited

Loading

Uh oh!

coveralls commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// ensureAdminAccessLabel patches the operator namespace with the label required by the
	// kube-scheduler to allow adminAccess: true in ResourceClaim/ResourceClaimTemplate
	// objects. The label is deliberately never removed: it is namespace-level configuration
	// that other adminAccess consumers in the namespace may rely on.
	func (s *stateGFD) ensureAdminAccessLabel(ctx context.Context) error {
	ns := &corev1.Namespace{}
	if err := s.client.Get(ctx, client.ObjectKey{Name: s.namespace}, ns); err != nil {
	return fmt.Errorf("could not get namespace %s: %w", s.namespace, err)
	}
	if ns.Labels[draAdminNamespaceLabelKey] == "true" {
	return nil
	}
	patch := client.MergeFrom(ns.DeepCopy())
	if ns.Labels == nil {
	ns.Labels = make(map[string]string)
	}
	ns.Labels[draAdminNamespaceLabelKey] = "true"
	return s.client.Patch(ctx, ns, patch)
	}

Uh oh!

Conversation

karthikvetrivel commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Overview

Change Log

Uh oh!

Uh oh!

Uh oh!

cdesiniotis Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cdesiniotis Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cdesiniotis Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

karthikvetrivel Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karthikvetrivel commented Jun 23, 2026 •

edited

Loading

karthikvetrivel Jun 24, 2026 •

edited

Loading

coveralls commented Jul 1, 2026 •

edited

Loading