Skip to content

feat(server): support TLS certificate hot-reload#1870

Open
lunarwhite wants to merge 2 commits into
NVIDIA:mainfrom
lunarwhite:cm-renew
Open

feat(server): support TLS certificate hot-reload#1870
lunarwhite wants to merge 2 commits into
NVIDIA:mainfrom
lunarwhite:cm-renew

Conversation

@lunarwhite

@lunarwhite lunarwhite commented Jun 11, 2026

Copy link
Copy Markdown

Summary

Add file-watcher-based TLS certificate hot-reload to the gateway, allowing cert/key/CA rotation without restarting. Uses notify to watch parent directories and ArcSwap for atomic config swapping so in-flight TLS handshakes are never blocked.

Related Issue

Fixes #1836

Changes

  • Replace TlsAcceptor internals with ArcSwap<ServerConfig> for lock-free atomic config swaps
  • Add file-watcher-based reload worker using notify::recommended_watcher — watches cert/key/CA parent directories and reloads on filesystem changes with a 1-second debounce (handles Kubernetes Secret atomic-swap patterns)
  • Add AtomicBool guard to prevent duplicate spawn_reload_worker() calls
  • Emit OCSF ConfigStateChange events on reload success (Informational) and failure (Medium)
  • Add early private key type validation via rustls::crypto::ring::sign — surfaces bad key types at startup instead of handshake time
  • Extract shared TLS test utilities (generate_test_certs_with_ca, install_rustls_provider, write_test_file) into tls_test_utils.rs

Testing

  • mise run pre-commit passes
  • Unit tests added — 7 tests: config build, reload success, reload failure preserves old config, concurrent handshake+reload, cert rotation detection, file-watcher change detection, mTLS CA rotation
  • E2E tests added/updated (if applicable)
  • E2E tests executed manually in local k3s cluster
E2E test record

Step 1: Create cluster

mise run helm:k3s:create

Step 2: Enable TLS

Create deploy/helm/openshell/ci/values-reload-test.yaml:

server:
  disableTls: false

Add to deploy/helm/openshell/skaffold.yaml after - ci/values-skaffold.yaml:

          - ci/values-reload-test.yaml

Step 3: Deploy

mise run helm:skaffold:run

Output:

Build [openshell/gateway] succeeded
Build [openshell/supervisor] succeeded
Images loaded in 9.79 seconds
NAME: openshell
LAST DEPLOYED: Fri Jun 12 14:32:48 2026
NAMESPACE: openshell
STATUS: deployed
REVISION: 1
Deployments stabilized in 15.093 seconds

Step 4: Verify file watcher started

KUBECONFIG=kubeconfig kubectl -n openshell logs openshell-0 | grep -i watcher

Output:

INFO openshell_server::tls: TLS certificate file watcher started dirs=["/etc/openshell-tls/server", "/etc/openshell-tls/client-ca"]

Step 5: Capture initial cert fingerprint

KUBECONFIG=kubeconfig kubectl -n openshell port-forward pod/openshell-0 18443:8080 &>/dev/null &
sleep 3
echo "" | openssl s_client -connect 127.0.0.1:18443 -servername openshell 2>/dev/null \
  | openssl x509 -fingerprint -sha256 -noout
kill %1 2>/dev/null

Output:

sha256 Fingerprint=AC:47:<...>:FB:2C

Step 6: Overwrite TLS secret (simulate cert-manager renewal)

openssl req -x509 -newkey rsa:2048 -keyout /tmp/new-tls.key -out /tmp/new-tls.crt \
  -days 1 -nodes -subj "/CN=openshell"

KUBECONFIG=kubeconfig kubectl -n openshell create secret tls openshell-server-tls \
  --cert=/tmp/new-tls.crt --key=/tmp/new-tls.key \
  --dry-run=client -o yaml | KUBECONFIG=kubeconfig kubectl apply -f -

Output:

secret/openshell-server-tls configured

Step 7: Verify new cert is served — no pod restart

Wait for kubelet sync + watcher detection:

sleep 60
7a. Pod restarts
KUBECONFIG=kubeconfig kubectl -n openshell get pods

Output:

NAME          READY   STATUS    RESTARTS   AGE
openshell-0   1/1     Running   0          10m
7b. Current cert from gateway
KUBECONFIG=kubeconfig kubectl -n openshell port-forward pod/openshell-0 28443:8080 &>/dev/null &
sleep 3
echo "" | openssl s_client -connect 127.0.0.1:28443 -servername openshell 2>/dev/null \
  | openssl x509 -fingerprint -sha256 -noout
kill %1 2>/dev/null

Output:

sha256 Fingerprint=4F:01:<...>:FF
7c. Cert stored in Secret (should match gateway)
KUBECONFIG=kubeconfig kubectl -n openshell get secret openshell-server-tls \
  -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -fingerprint -sha256 -noout

Output:

sha256 Fingerprint=4F:01:<...>:FF
7d. Reload logs
KUBECONFIG=kubeconfig kubectl -n openshell logs openshell-0 | grep -i "watcher\|reload\|TLS cert"

Output:

INFO openshell_server::tls: TLS certificate file watcher started dirs=["/etc/openshell-tls/server", "/etc/openshell-tls/client-ca"]
INFO ocsf: sandbox_id="" CONFIG:RELOADED [INFO] TLS certificate config reloaded successfully

Results

Checkpoint Fingerprint Source
Before renewal AC:47:<...>:2C PKI init job
After renewal 4F:01:<...>:FF Our replacement
Secret matches gateway? ✅ Yes
Pod restarts 0
File watcher active? ✅ watching /etc/openshell-tls/server, /etc/openshell-tls/client-ca
OCSF reload event? CONFIG:RELOADED [INFO]

The file watcher detected the updated certificate in the watched directory, triggered a reload after the 1-second debounce window, and atomically swapped the active TLS config. Zero pod restarts. The OCSF log entry provides observability that was completely missing before.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

@lunarwhite lunarwhite requested review from a team, derekwaynecarr and mrunalp as code owners June 11, 2026 12:14
@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@lunarwhite

Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

@lunarwhite

Copy link
Copy Markdown
Author

recheck

@TaylorMutch

Copy link
Copy Markdown
Collaborator

Thanks for taking this on and working to resolve the cert-manager renewal issue.

One design concern: this currently implements TLS hot-reload as a timed polling loop rather than a file watcher. The worker uses tokio::time::interval() and reloads on every tick, gated by the new reload_interval_secs / reloadIntervalSecs config.

Can we make this event-driven instead? A file watcher over the parent directories of cert_path, key_path, and client_ca_path would remove the interval tuning knob and avoid leaving the fix disabled by default. For Kubernetes Secret volumes, the watcher should watch the containing directory and handle the ..data symlink/file update pattern, ideally with a short debounce/retry before calling reload() so cert/key/CA are read from a consistent update.

That would make renewal detection immediate, reduce configuration surface, and avoid polling races where a tick lands during a projected-volume update and then waits a full interval before recovering.

@TaylorMutch TaylorMutch self-assigned this Jun 11, 2026
Signed-off-by: Yuedong Wu <dwcn22@outlook.com>
Signed-off-by: Yuedong Wu <dwcn22@outlook.com>
@lunarwhite

Copy link
Copy Markdown
Author

@TaylorMutch Appreciate your key design review and pointers.

Now it has been refactored from polling-based to watch-based. PR description has been updated. I have tested the new code changes end-end locally, please find the steps and results in "E2E test record" toggle under Testing in the description.

Refactor summary

Changes from polling-based to watch-based

Aspect Polling Watch
Detection mechanism tokio::time::interval timer notify::recommended_watcher (inotify/kqueue/FSEvents)
Config required reloadIntervalSecs: N (N > 0) None (always-on when TLS configured)
Debounce N/A (fixed interval) 1-second drain window for atomic-swap patterns
Duplicate spawn guard None AtomicBool — warns if called twice
Logging on reload success Silent OCSF CONFIG:RELOADED event
Logging on reload failure warn! with error OCSF CONFIG:RELOADED FAILURE event + warn!

Key behavioral differences vs polling approach

  1. Always-on — no reloadIntervalSecs config needed; watcher starts automatically when TLS is configured.
  2. Event-driven — reload happens within seconds of the filesystem change (after 1s debounce), rather than waiting for the next poll tick.
  3. Observable — OCSF CONFIG:RELOADED events provide audit trail for successful reloads and failures.
  4. Duplicate guardAtomicBool prevents accidentally spawning two watchers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: gateway does not reload TLS certificate after cert-manager renewal

2 participants