feat:Make customers experience better#100
Merged
Merged
Conversation
14df72f to
73492b4
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR significantly expands the tools/cluster_setup/ installer UX and adds a documented air‑gapped installation workflow for k0s-based customer deployments.
Changes:
- Add an air-gap bundle workflow (bundle prep + offline installer runner) with a dedicated guide and a detailed test plan.
- Improve
k0s_cluster_with_stack.shoperator experience (timestamps, log rotation, install plan confirmation, dependency waits, step/phase progress markers, validate + diagnose subcommands). - Update docs/configs to reflect supported GPU tiers (removing H100 from k0s docs) and to document the staged model set.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/cluster_setup/TEST_PLAN.md | Adds a multi-tier test plan covering syntax, docs consistency, prompts, air-gap workflow, and live-cluster checks. |
| tools/cluster_setup/prepare_airgap_bundle.sh | New script to build a transferable tarball of binaries/manifests/charts plus metadata. |
| tools/cluster_setup/install_from_airgap_bundle.sh | New script to extract/verify bundle, set env overrides, and invoke the main installer offline. |
| tools/cluster_setup/AIRGAP.md | New end-to-end documentation for disconnected installs and override variables. |
| tools/cluster_setup/k0s_cluster_with_stack.sh | Adds logging UX, prompts, dependency waits, progress tracking, and new validate/diagnose commands; wires env overrides for several downloads. |
| tools/cluster_setup/k0s-cluster-config.yaml | Documents cluster.airgap and removes the H100 comment from the accelerator type. |
| tools/cluster_setup/K0S_README.md | Adds AIRGAP guide links, documents air-gap mode + overrides, updates accelerator tier docs, and lists staged models. |
| tools/cluster_setup/K0S_QUICKSTART.md | Removes H100 references and updates accelerator tier docs. |
| tools/cluster_setup/EKS_README.md | Documents staged models during EKS install flow. |
| tools/artifacts_download_upload_scripts/README.md | Updates the documented set of pre-configured models (removes llama entries; adds Gemma/GPT OSS). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
spl-arif
approved these changes
Jun 15, 2026
- Remove all H100/p5 references from k0s-specific files (K0S_QUICKSTART.md, K0S_README.md, k0s-cluster-config.yaml, k0s_cluster_with_stack.sh); EKS H100/Capacity Block path is intentionally kept as it has real code support - Add gemma-4-31b-it and full model staging table to K0S_README.md, EKS_README.md, and artifacts/README.md to address missing Gemma references - Sync artifacts/README.md model list to match model_artifacts_configs.yaml (removes stale llama references, adds gemma-4-31b-it and gpt-oss-20b)
- prepare_airgap_bundle.sh: downloads k0s binary, yq, Helm charts (prometheus,
otel, kuberay, metallb), and static manifests (cert-manager, local-path-
provisioner, nvidia-device-plugin) into a versioned tar.gz bundle with
checksums; resolves unpinned chart versions at bundle time; prints container
image list and model-weight staging instructions
- install_from_airgap_bundle.sh: extracts bundle, verifies checksums, installs
k0s/yq binaries, sets up local Helm chart repo, exports env-var overrides,
then delegates to k0s_cluster_with_stack.sh
- k0s_cluster_with_stack.sh: replace every hardcoded internet URL with
${VAR:-default} overrides (K0S_INSTALL_URL, YQ_DOWNLOAD_URL,
CERT_MANAGER_MANIFEST_URL, LOCAL_PATH_MANIFEST_URL,
NVIDIA_DEVICE_PLUGIN_MANIFEST_URL, PROMETHEUS_CHART_PATH, OTEL_CHART_PATH,
KUBERAY_CHART_PATH, METALLB_CHART_PATH) so air-gap installs work without
modifying installer logic
- AIRGAP.md: complete step-by-step air-gap guide covering bundle prep, image mirroring, model staging, file transfer, and install; includes full env-var reference table and troubleshooting section - prepare_airgap_bundle.sh: add --help flag documenting all flags, env-var overrides, bundle contents, and next-steps workflow - install_from_airgap_bundle.sh: add --help flag documenting all flags, what the script does (7 steps), env-var overrides set automatically, and manual-use instructions for advanced users - K0S_README.md: expand Environment Variables section with air-gap URL override table; replace vague Air-Gapped Deployment section with concrete quick-start commands pointing at AIRGAP.md; add ToC callout
…rm prompt, validate/diagnose subcommands
P0 — Timestamps on every log line ([YYYY-MM-DD HH:MM:SS INFO/WARN/ERROR])
P0 — err() now prints log file path and 'run diagnose' hint on every error exit
P0 — delete/clean-all now require typing the cluster name to confirm (bypass: AUTO_APPROVE=true)
P0 — Log rotation: keep last 10 installer log files, delete older ones automatically
P1 — show_install_plan(): pre-install summary table (nodes, registry, object store, steps)
with yes/no confirmation gate before any changes are made
P1 — wait_for_dependency(): interactive pause-and-retry with 30s auto-advance and
Enter-to-retry; exits clearly after configurable max_wait
P1 — Step progress tracker (step_start/step_ok/step_fail/step_skip) wired into
main_install; show_step_summary() prints final pass/fail table at end of install
P1 — phase_start/phase_end section markers for grep-friendly log phases
P2 — need() enriched with per-tool install instructions (kubectl, helm, yq, jq, ssh, curl, aws, git)
P2 — validate subcommand: config completeness checker (IPs, credentials, file paths) — safe, no-op
P2 — diagnose subcommand: collects cluster state, pod logs, events, installer logs into
a tar.gz support bundle; redacts credentials from config copy
…ouch-points
Object store (always external, air-gap or not):
- ensure_s3compat_credentials() now calls wait_for_dependency() before
creating the minio-credentials Secret; probes the endpoint with curl
and retries every 30s for up to 5 min — works for MinIO, SeaweedFS,
and AWS S3 regional endpoints
HuggingFace (model download, online mode only):
- stage_model_artifacts() waits for huggingface.co reachability before
calling download_from_huggingface.sh
- Skipped entirely when AIRGAP_MODE=true (models must be pre-staged)
NVIDIA container-toolkit repos (GPU nodes, online mode only):
- install_nvidia_host_drivers() checks nvidia.github.io reachability
via SSH on each GPU worker before attempting toolkit install
- Skipped when AIRGAP_MODE=true (drivers must be pre-installed)
show_install_plan() updated to reflect air-gap model staging skip message.
TEST_PLAN.md updated with T2-13..T2-15 (code-review) and T3-9..T3-11 (live).
…v var override) Previously AIRGAP_MODE was only set by install_from_airgap_bundle.sh, invisible to customers running k0s_cluster_with_stack.sh directly. Changes: - load_config() now reads cluster.airgap from YAML and exports AIRGAP_MODE; env var AIRGAP_MODE=true still takes precedence (override without editing file) - k0s-cluster-config.yaml: add commented cluster.airgap field with full explanation of what it controls and when to use it - K0S_README.md: new "Air-Gap Mode" section between General and URL Override tables — explains the YAML field, env var, precedence rules, and a 3-row decision table (YAML / env var / neither) - AIRGAP.md: Step 5 now opens with "Configure air-gap mode in your config" showing the cluster.airgap: true snippet; explains why it matters, when install_from_airgap_bundle.sh makes it optional, and updates the env var reference row for AIRGAP_MODE from "reserved for future use" to accurate desc
Adds full air-gap coverage for the OS packages that GPU worker nodes
require (EPEL, DKMS, CUDA repo, nvidia-container-toolkit, python3-pyyaml):
k0s_cluster_with_stack.sh:
- AIRGAP_MODE=true now hard-fails with a clear error when nvidia-smi is
absent on a GPU node instead of attempting internet package downloads
that would time out silently
- EPEL_RPM_URL_OVERRIDE, CUDA_REPO_URL_OVERRIDE, NVIDIA_CTK_REPO_URL_OVERRIDE
env vars added (${VAR:-default} pattern) so customers with a local HTTP
mirror can redirect each repo URL without editing the script
- AIRGAP_PYYAML_WHEEL_PATH: when set, all nodes install pyyaml from the
bundled wheel via pip3 --no-index instead of dnf/apt
prepare_airgap_bundle.sh:
- New packages/ section: downloads EPEL release RPM, CUDA .repo file,
nvidia-container-toolkit .repo file, and PyYAML wheel into the bundle
- New --gpu-os flag (rhel9 | rhel10 | amzn2023) selects the correct
EPEL major and CUDA repo file for the target GPU node OS
- AIRGAP_PYYAML_WHEEL_PATH exported in airgap-env.sh
install_from_airgap_bundle.sh:
- Exports AIRGAP_PYYAML_WHEEL_PATH from bundle at runtime
- Help text updated with GPU node package strategy notes
AIRGAP.md:
- New "GPU Node OS Packages" section: package table, Strategy 1
(pre-install), Strategy 2 (local RPM mirror), Strategy 3 (partial
air-gap), and env-var reference for all 4 new variables
…air-gapped deployments
…ipts Only RHEL 9 (and compatible: Rocky 9, AlmaLinux 9, CentOS Stream 9) is tested and supported. RHEL 10, Amazon Linux 2023, and Debian/Ubuntu code paths are kept in the installer for internal testing but are now gated: k0s_cluster_with_stack.sh: - New _check_node_os() helper: SSHes to each node, reads /etc/os-release, errors if OS family is not RHEL 9. Set FORCE_UNSUPPORTED_OS=1 to downgrade to a warning and continue (for internal testing only). - Called in prepare_nodes_for_k0s (all nodes) and _install_nvidia_on_node (GPU workers) before any package install is attempted. prepare_airgap_bundle.sh: - Added OS validation gate after arg parsing: errors if GPU_NODE_OS != rhel9. - --gpu-os help text and examples updated to reflect rhel9-only support. Docs (K0S_README.md, K0S_QUICKSTART.md, AIRGAP.md, DEPLOYMENT_GUIDE.md): - All RHEL 10, Amazon Linux 2023, Debian/Ubuntu OS claims removed. - Supported OS stated as: RHEL 9 (Rocky 9, AlmaLinux 9, CentOS Stream 9). - Internet dependency tables updated to rhel9 URLs only.
…GiB RAM, 100 Gbps per node)
…gram colors - TEST_PLAN.md: add 16 T1 tests (T1-14..T1-29), 6 T2 tests (T2-16..T2-21), 7 T3 tests (T3-12..T3-18), visual review section for Mermaid diagrams, and 13 new implementation chart rows covering OS gate, airgap GPU packages, URL overrides, DEPLOYMENT_GUIDE, staging requirements, and GPU spec changes - DEPLOYMENT_GUIDE.md: replace unreadable sequenceDiagram Install Flow (dark rect backgrounds with invisible text) with a flowchart TD using explicit light backgrounds and dark text via classDef; fix remaining dark node fills
Updated all references across DEPLOYMENT_GUIDE.md, AIRGAP.md, and K0S_QUICKSTART.md — flowchart node, transfer diagram, staging section, disk requirement table, and internet bandwidth note.
Covers every error and warning path in k0s_cluster_with_stack.sh: preflight failures, SSH/sudo issues, OS compatibility gate, k0s bootstrap, node join/labeling, NVIDIA driver install, Helm chart failures, model staging, object store credentials, MetalLB, AIPlatform CR, air-gap specific errors, config validation, and delete/clean-all failures. Includes diagnostic command reference and first-steps checklist.
Only RHEL 9 is now claimed as supported. Updated error messages in k0s_cluster_with_stack.sh and prepare_airgap_bundle.sh, and removed all Rocky/AlmaLinux/CentOS mentions from K0S_README, K0S_QUICKSTART, TROUBLESHOOTING, and TEST_PLAN. OS detection code kept for internal testing (same approach as RHEL 10 / AL2023 removal).
Covers finding the Splunk Web URL (NodePort, LoadBalancer, port-forward), installing the app via the Splunk UI, air-gapped install via kubectl cp, verifying installation via Standalone status and REST API, configuring the SAIA API endpoint via UI and splunkaiassistant.conf, and troubleshooting common failures.
…LOYMENT_GUIDE - K0S_README.md: add full 'Splunk AI Assistant App' section covering URL discovery, UI install, air-gapped kubectl cp install, verification (deployStatus table), splunkaiassistant.conf config, and troubleshooting - DEPLOYMENT_GUIDE.md: add concise 'Install the Splunk AI Assistant App' section with 4 key steps and a link to K0S_README for full details - Delete SAIA_APP_INSTALL.md (content merged into the above two files)
- Delete AIRGAP.md and K0S_QUICKSTART.md - K0S_README.md becomes the single reference doc: absorbs full air-gap guide (5 phases, GPU strategies, env var reference), S3 bucket directory layout, staging machine requirements, manual upload scripts table, additional utilities, detailed GPU hardware spec - DEPLOYMENT_GUIDE.md keeps all step-by-step deployment content including full air-gap phase detail; config section replaced with reference link to K0S_README; Cluster Architecture and Network Ports diagrams moved - TROUBLESHOOTING.md: update AIRGAP.md cross-links to K0S_README - TEST_PLAN.md: update all test cases referencing deleted files - Mark HF_TOKEN as optional for the current release across all docs
…e link to K0S_README
P1: fix ssh typo in troubleshooting tree, define SAIA on first use, add app tgz provenance, add install time estimates, link to TROUBLESHOOTING.md P2: add namespace quick-reference table, expected output samples, describe diagnose bundle contents, add validate as first troubleshooting step, clarify non-idempotent steps, add version compatibility + licensing note, add upgrade section to Common Operations P3: minimum viable topology note, object storage sizing guidance
- Fix step_ok unconditionally clobbering step_fail in install summary - Fix curl -sf rejecting valid 301/403 responses in object store check - Remove redundant file:// curl in join_workers; prepare_nodes_for_k0s handles both paths - Remove non-existent 'upgrade' subcommand from install_from_airgap_bundle.sh help - Fix misleading 'Sort oldest-first' comment in log rotation (ls -1t is newest-first) - Replace warn-and-continue chart path check with glob resolver that fails fast - Expand checksums to cover all bundle files, not just binaries/manifests/charts - Fix TROUBLESHOOTING.md: remove explicit unsupported OS names (RHEL 10, AL2023, Rocky, AlmaLinux) - Fix TEST_PLAN.md: correct T1-5 YAML field and README target, T1-6 grep -F for braces, T1-8 anchor stripping, T1-29 mermaid count
…LOYMENT_GUIDE T3-16 previously ran a full cluster install to test URL overrides, requiring internet for all non-GPU dependencies. Rewritten to call install_nvidia_host_drivers directly with a local python3 HTTP server as the mirror — no internet needed. DEPLOYMENT_GUIDE scenario table now links each row's Use column to the corresponding section anchor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The two air-gapped rows ("install machine online, cluster isolated" and
"everything offline") follow identical steps — the only difference is which
machine runs prepare_airgap_bundle.sh. Merged into a single row to avoid
implying a different deployment path exists.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
92acbe2 to
dee3d7d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Related Issues
Type of Change
Changes Made
Testing Performed
make test)make lint)Test Environment
Test Steps
Documentation
Checklist
Breaking Changes
Impact:
Migration Path:
Screenshots/Recordings
Additional Notes
Reviewer Notes
Please pay special attention to:
Commit Message Convention: This PR follows Conventional Commits