Skip to content

Add GPUCluster CRD and controller for DRA-based stack#2571

Draft
karthikvetrivel wants to merge 6 commits into
mainfrom
kv-gpuclusterconfig-crd
Draft

Add GPUCluster CRD and controller for DRA-based stack#2571
karthikvetrivel wants to merge 6 commits into
mainfrom
kv-gpuclusterconfig-crd

Conversation

@karthikvetrivel

@karthikvetrivel karthikvetrivel commented Jun 23, 2026

Copy link
Copy Markdown
Member

1. Overview

We introduce a new CRD named GPUCluster and a new controller for reconciling it. Like ClusterPolicy today, it is a singleton, cluster-scoped CRD that configures the operands needed to enable GPUs in Kubernetes. GPUCluster represents the new DRA-based software-enablement stack; it is an evolution of ClusterPolicy. Unlike ClusterPolicy, it does not manage the driver or a device plugin: the driver is either preinstalled on the host or managed by NVIDIADriver CRs, and GPUs are surfaced to workloads through DRA. A GPUCluster may coexist with a ClusterPolicy in the same cluster, with every GPU node served by exactly one of the two stacks.

Change Log

3e1c3a0Add GPUCluster v1alpha1 API, CRD, and generated client
The operand blocks (dcgm, dcgmExporter, gfd) and shared building blocks (daemonsets, hostPaths) reuse the v1 ClusterPolicy spec types directly instead of mirrored copies, so the two stacks share one definition per operand and cannot drift; only the draDriver tree is defined fresh in v1alpha1. There are deliberately no driver, toolkit, or device-plugin blocks, since GPUCluster does not manage them.

3fae53bAdd DRA driver operand and dra-driver-validator init container
The operand renders through the same state.Manager + Go-template engine that NVIDIADriver uses (rather than assets/ + object transforms), with per-capability enablement (gpus, computeDomains) and the served resource.k8s.io apiVersion auto-detected via API discovery (v1 > v1beta2 > v1beta1). The new dra-driver-validator init container probes for a host-installed driver first and falls back to a containerized install, validates with nvidia-smi --version only (safe when GPUs are bound to vfio-pci for passthrough), and writes the minimal driver-ready contract (NVIDIA_DRIVER_ROOT, DRIVER_ROOT_CTR_PATH) that the kubelet-plugin containers source on startup.

40a79cfAdd GFD, DCGM, DCGM Exporter, and DRA validation operands to GPUCluster
These operands share a single configurableState implementation, so each operand file declares only its enablement check, image resolution, and render data; they acquire GPU access through DRA adminAccess ResourceClaims instead of the legacy privileged /run/nvidia mounts. Standalone DCGM defaults to disabled — dcgm-exporter runs its embedded hostengine and re-points to the nvidia-dcgm-dra Service only when DCGM is enabled — and the ServiceMonitor renders only when the Prometheus Operator CRD is actually served, so a default install does not require it.

ed1d381Add GPUCluster controller with singleton status and GPU node labeling
The controller mirrors ClusterPolicy's singleton first-wins semantics and relies on owner references for cleanup, so deleting the CR garbage-collects every rendered object. Node labeling applies the DRA operand deploy labels (nvidia.com/gpu.deploy.*) only when absent, preserving the k8s-driver-manager's ability to pause an operand by flipping its label during a driver reload.

c806c0aAdd per-node stack selection between device-plugin and DRA planes
A new nvidia.com/gpu-operator.mode node label (device-plugin | dra) routes each GPU node to exactly one plane, and every operand DaemonSet of both stacks gates its nodeSelector on it, which is what makes ClusterPolicy/GPUCluster coexistence safe. The label is only ever set on unlabeled nodes (when both CRs exist, the default comes from the operator's PREFERRED_MODE env) and is never overwritten or removed, so changing the preference never migrates a node that is already serving GPUs.

6b5ac48Add GPUCluster Helm install with ClusterPolicy/NVIDIADriver coexistence
Setting gpuCluster.enabled=true renders the singleton GPUCluster CR from a new template that reuses the existing dcgm/dcgmExporter/gfd values, so one values file drives whichever stack is enabled. The chart-level mutual exclusion with ClusterPolicy from earlier revisions is lifted, since per-node mode selection now provides the isolation.


Moved from #2513 (re-created with the head branch on NVIDIA/gpu-operator instead of a fork, to enable stacked PRs). The earlier review discussion — including the GPUClusterConfigGPUCluster naming suggestion, since adopted — lives in #2513.

Comment thread deployments/gpu-operator/values.yaml Outdated
Comment thread deployments/gpu-operator/templates/cleanup_crd.yaml
# NVIDIADriver CR). GPUClusterConfig does not manage the driver or device plugin
# itself; it waits for driver readiness before deploying the DRA driver.
gpuClusterConfig:
enabled: false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think a bit more on the right interface for this. A few questions come to mind:

  1. Is enabled the right name for this field? As currently implemented, setting gpuClusterConfig.enabled=true will create a default GPUClusterConfig CR and will NOT create a default ClusterPolicy CR when the helm chart gets rendered. This may not be clear to the user.
  2. Should the draDriver struct be embedded under the top-level gpuClusterConfig struct?

Comment thread api/nvidia/v1alpha1/gpuclusterconfig_types.go Outdated
Comment thread api/nvidia/v1alpha1/gpuclusterconfig_types.go Outdated
Comment thread api/nvidia/v1alpha1/gpuclusterconfig_types.go Outdated
Comment thread internal/state/dra_driver.go Outdated
Comment thread manifests/state-dra-driver/0500_daemonset.yaml Outdated
affinity: {{ .KubeletPluginAffinity | toJson }}
{{- else }}
affinity:
nodeAffinity:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- should we add a nodeAntiAffinity rule here to prevent the kubelet-plugin from running on a node where the k8s-device-plugin is running? E.g. don't run on nodes labeled with nvidia.com/gpu.deploy.device-plugin=true

Comment thread manifests/state-dra-driver/0500_daemonset.yaml Outdated
{{- else }}
deviceClassName: gpu.nvidia.com
allocationMode: All
adminAccess: true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- does the GPU Operator namespace have to be labeled with resource.kubernetes.io/admin-access: true for this? From https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#admin-access:

Only users authorized to create ResourceClaim or ResourceClaimTemplate objects in namespaces labeled with resource.kubernetes.io/admin-access: "true" (case-sensitive) can use the adminAccess field.

@karthikvetrivel karthikvetrivel Jun 24, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe it does (as the link you found mentioned). We already handle this.

// ensureAdminAccessLabel patches the operator namespace with the label required by the
// kube-scheduler to allow adminAccess: true in ResourceClaim/ResourceClaimTemplate
// objects. The label is deliberately never removed: it is namespace-level configuration
// that other adminAccess consumers in the namespace may rely on.
func (s *stateGFD) ensureAdminAccessLabel(ctx context.Context) error {
ns := &corev1.Namespace{}
if err := s.client.Get(ctx, client.ObjectKey{Name: s.namespace}, ns); err != nil {
return fmt.Errorf("could not get namespace %s: %w", s.namespace, err)
}
if ns.Labels[draAdminNamespaceLabelKey] == "true" {
return nil
}
patch := client.MergeFrom(ns.DeepCopy())
if ns.Labels == nil {
ns.Labels = make(map[string]string)
}
ns.Labels[draAdminNamespaceLabelKey] = "true"
return s.client.Patch(ctx, ns, patch)
}

As it exists right now, it isn't pre-baked in.

@karthikvetrivel karthikvetrivel force-pushed the kv-gpuclusterconfig-crd branch from 5691bbc to e5dcecd Compare July 1, 2026 15:24
@coveralls

coveralls commented Jul 1, 2026

Copy link
Copy Markdown

Coverage Status

coverage: 31.944% (+0.7%) from 31.227% — kv-gpuclusterconfig-crd into main

@karthikvetrivel karthikvetrivel force-pushed the kv-gpuclusterconfig-crd branch 3 times, most recently from f00b187 to a4c09b7 Compare July 1, 2026 19:42
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
@karthikvetrivel karthikvetrivel force-pushed the kv-gpuclusterconfig-crd branch from a4c09b7 to 6b5ac48 Compare July 2, 2026 19:50
@karthikvetrivel karthikvetrivel changed the title Add GPUClusterConfig CRD and controller for DRA-based stack Add GPUCluster CRD and controller for DRA-based stack Jul 2, 2026
@karthikvetrivel karthikvetrivel marked this pull request as draft July 2, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants