Skip to content

feat:Make customers experience better#100

Merged
kupratyu-splunk merged 27 commits into
mainfrom
make_customers_exp_better
Jun 15, 2026
Merged

feat:Make customers experience better#100
kupratyu-splunk merged 27 commits into
mainfrom
make_customers_exp_better

Conversation

@kupratyu-splunk

Copy link
Copy Markdown
Collaborator

Description

Related Issues

  • Related to #

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test improvement
  • CI/CD improvement
  • Chore (dependency updates, etc.)

Changes Made

Testing Performed

  • Unit tests pass (make test)
  • Linting passes (make lint)
  • Integration tests pass (if applicable)
  • E2E tests pass (if applicable)
  • Manual testing performed

Test Environment

  • Kubernetes Version:
  • Cloud Provider:
  • Deployment Method:

Test Steps

Documentation

  • Updated inline code comments
  • Updated README.md (if adding features)
  • Updated API documentation
  • Updated deployment guides
  • Updated CHANGELOG.md
  • No documentation needed

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published
  • I have updated the Helm chart version (if applicable)
  • I have updated CRD schemas (if applicable)

Breaking Changes

Impact:

Migration Path:

Screenshots/Recordings

Additional Notes

Reviewer Notes

Please pay special attention to:


Commit Message Convention: This PR follows Conventional Commits

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR significantly expands the tools/cluster_setup/ installer UX and adds a documented air‑gapped installation workflow for k0s-based customer deployments.

Changes:

  • Add an air-gap bundle workflow (bundle prep + offline installer runner) with a dedicated guide and a detailed test plan.
  • Improve k0s_cluster_with_stack.sh operator experience (timestamps, log rotation, install plan confirmation, dependency waits, step/phase progress markers, validate + diagnose subcommands).
  • Update docs/configs to reflect supported GPU tiers (removing H100 from k0s docs) and to document the staged model set.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tools/cluster_setup/TEST_PLAN.md Adds a multi-tier test plan covering syntax, docs consistency, prompts, air-gap workflow, and live-cluster checks.
tools/cluster_setup/prepare_airgap_bundle.sh New script to build a transferable tarball of binaries/manifests/charts plus metadata.
tools/cluster_setup/install_from_airgap_bundle.sh New script to extract/verify bundle, set env overrides, and invoke the main installer offline.
tools/cluster_setup/AIRGAP.md New end-to-end documentation for disconnected installs and override variables.
tools/cluster_setup/k0s_cluster_with_stack.sh Adds logging UX, prompts, dependency waits, progress tracking, and new validate/diagnose commands; wires env overrides for several downloads.
tools/cluster_setup/k0s-cluster-config.yaml Documents cluster.airgap and removes the H100 comment from the accelerator type.
tools/cluster_setup/K0S_README.md Adds AIRGAP guide links, documents air-gap mode + overrides, updates accelerator tier docs, and lists staged models.
tools/cluster_setup/K0S_QUICKSTART.md Removes H100 references and updates accelerator tier docs.
tools/cluster_setup/EKS_README.md Documents staged models during EKS install flow.
tools/artifacts_download_upload_scripts/README.md Updates the documented set of pre-configured models (removes llama entries; adds Gemma/GPT OSS).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/cluster_setup/k0s_cluster_with_stack.sh Outdated
Comment thread tools/cluster_setup/k0s_cluster_with_stack.sh
Comment thread tools/cluster_setup/k0s_cluster_with_stack.sh Outdated
Comment thread tools/cluster_setup/install_from_airgap_bundle.sh Outdated
Comment thread tools/cluster_setup/k0s_cluster_with_stack.sh Outdated
Comment thread tools/cluster_setup/install_from_airgap_bundle.sh Outdated
Comment thread tools/cluster_setup/prepare_airgap_bundle.sh
- Remove all H100/p5 references from k0s-specific files (K0S_QUICKSTART.md,
  K0S_README.md, k0s-cluster-config.yaml, k0s_cluster_with_stack.sh);
  EKS H100/Capacity Block path is intentionally kept as it has real code support
- Add gemma-4-31b-it and full model staging table to K0S_README.md,
  EKS_README.md, and artifacts/README.md to address missing Gemma references
- Sync artifacts/README.md model list to match model_artifacts_configs.yaml
  (removes stale llama references, adds gemma-4-31b-it and gpt-oss-20b)
- prepare_airgap_bundle.sh: downloads k0s binary, yq, Helm charts (prometheus,
  otel, kuberay, metallb), and static manifests (cert-manager, local-path-
  provisioner, nvidia-device-plugin) into a versioned tar.gz bundle with
  checksums; resolves unpinned chart versions at bundle time; prints container
  image list and model-weight staging instructions

- install_from_airgap_bundle.sh: extracts bundle, verifies checksums, installs
  k0s/yq binaries, sets up local Helm chart repo, exports env-var overrides,
  then delegates to k0s_cluster_with_stack.sh

- k0s_cluster_with_stack.sh: replace every hardcoded internet URL with
  ${VAR:-default} overrides (K0S_INSTALL_URL, YQ_DOWNLOAD_URL,
  CERT_MANAGER_MANIFEST_URL, LOCAL_PATH_MANIFEST_URL,
  NVIDIA_DEVICE_PLUGIN_MANIFEST_URL, PROMETHEUS_CHART_PATH, OTEL_CHART_PATH,
  KUBERAY_CHART_PATH, METALLB_CHART_PATH) so air-gap installs work without
  modifying installer logic
- AIRGAP.md: complete step-by-step air-gap guide covering bundle prep,
  image mirroring, model staging, file transfer, and install; includes
  full env-var reference table and troubleshooting section

- prepare_airgap_bundle.sh: add --help flag documenting all flags,
  env-var overrides, bundle contents, and next-steps workflow

- install_from_airgap_bundle.sh: add --help flag documenting all flags,
  what the script does (7 steps), env-var overrides set automatically,
  and manual-use instructions for advanced users

- K0S_README.md: expand Environment Variables section with air-gap URL
  override table; replace vague Air-Gapped Deployment section with
  concrete quick-start commands pointing at AIRGAP.md; add ToC callout
…rm prompt, validate/diagnose subcommands

P0 — Timestamps on every log line ([YYYY-MM-DD HH:MM:SS INFO/WARN/ERROR])
P0 — err() now prints log file path and 'run diagnose' hint on every error exit
P0 — delete/clean-all now require typing the cluster name to confirm (bypass: AUTO_APPROVE=true)
P0 — Log rotation: keep last 10 installer log files, delete older ones automatically

P1 — show_install_plan(): pre-install summary table (nodes, registry, object store, steps)
     with yes/no confirmation gate before any changes are made
P1 — wait_for_dependency(): interactive pause-and-retry with 30s auto-advance and
     Enter-to-retry; exits clearly after configurable max_wait
P1 — Step progress tracker (step_start/step_ok/step_fail/step_skip) wired into
     main_install; show_step_summary() prints final pass/fail table at end of install
P1 — phase_start/phase_end section markers for grep-friendly log phases

P2 — need() enriched with per-tool install instructions (kubectl, helm, yq, jq, ssh, curl, aws, git)
P2 — validate subcommand: config completeness checker (IPs, credentials, file paths) — safe, no-op
P2 — diagnose subcommand: collects cluster state, pod logs, events, installer logs into
     a tar.gz support bundle; redacts credentials from config copy
…ouch-points

Object store (always external, air-gap or not):
  - ensure_s3compat_credentials() now calls wait_for_dependency() before
    creating the minio-credentials Secret; probes the endpoint with curl
    and retries every 30s for up to 5 min — works for MinIO, SeaweedFS,
    and AWS S3 regional endpoints

HuggingFace (model download, online mode only):
  - stage_model_artifacts() waits for huggingface.co reachability before
    calling download_from_huggingface.sh
  - Skipped entirely when AIRGAP_MODE=true (models must be pre-staged)

NVIDIA container-toolkit repos (GPU nodes, online mode only):
  - install_nvidia_host_drivers() checks nvidia.github.io reachability
    via SSH on each GPU worker before attempting toolkit install
  - Skipped when AIRGAP_MODE=true (drivers must be pre-installed)

show_install_plan() updated to reflect air-gap model staging skip message.
TEST_PLAN.md updated with T2-13..T2-15 (code-review) and T3-9..T3-11 (live).
…v var override)

Previously AIRGAP_MODE was only set by install_from_airgap_bundle.sh, invisible
to customers running k0s_cluster_with_stack.sh directly.

Changes:
- load_config() now reads cluster.airgap from YAML and exports AIRGAP_MODE;
  env var AIRGAP_MODE=true still takes precedence (override without editing file)
- k0s-cluster-config.yaml: add commented cluster.airgap field with full
  explanation of what it controls and when to use it
- K0S_README.md: new "Air-Gap Mode" section between General and URL Override
  tables — explains the YAML field, env var, precedence rules, and a 3-row
  decision table (YAML / env var / neither)
- AIRGAP.md: Step 5 now opens with "Configure air-gap mode in your config"
  showing the cluster.airgap: true snippet; explains why it matters, when
  install_from_airgap_bundle.sh makes it optional, and updates the env var
  reference row for AIRGAP_MODE from "reserved for future use" to accurate desc
Adds full air-gap coverage for the OS packages that GPU worker nodes
require (EPEL, DKMS, CUDA repo, nvidia-container-toolkit, python3-pyyaml):

k0s_cluster_with_stack.sh:
- AIRGAP_MODE=true now hard-fails with a clear error when nvidia-smi is
  absent on a GPU node instead of attempting internet package downloads
  that would time out silently
- EPEL_RPM_URL_OVERRIDE, CUDA_REPO_URL_OVERRIDE, NVIDIA_CTK_REPO_URL_OVERRIDE
  env vars added (${VAR:-default} pattern) so customers with a local HTTP
  mirror can redirect each repo URL without editing the script
- AIRGAP_PYYAML_WHEEL_PATH: when set, all nodes install pyyaml from the
  bundled wheel via pip3 --no-index instead of dnf/apt

prepare_airgap_bundle.sh:
- New packages/ section: downloads EPEL release RPM, CUDA .repo file,
  nvidia-container-toolkit .repo file, and PyYAML wheel into the bundle
- New --gpu-os flag (rhel9 | rhel10 | amzn2023) selects the correct
  EPEL major and CUDA repo file for the target GPU node OS
- AIRGAP_PYYAML_WHEEL_PATH exported in airgap-env.sh

install_from_airgap_bundle.sh:
- Exports AIRGAP_PYYAML_WHEEL_PATH from bundle at runtime
- Help text updated with GPU node package strategy notes

AIRGAP.md:
- New "GPU Node OS Packages" section: package table, Strategy 1
  (pre-install), Strategy 2 (local RPM mirror), Strategy 3 (partial
  air-gap), and env-var reference for all 4 new variables
…ipts

Only RHEL 9 (and compatible: Rocky 9, AlmaLinux 9, CentOS Stream 9) is
tested and supported. RHEL 10, Amazon Linux 2023, and Debian/Ubuntu code
paths are kept in the installer for internal testing but are now gated:

k0s_cluster_with_stack.sh:
- New _check_node_os() helper: SSHes to each node, reads /etc/os-release,
  errors if OS family is not RHEL 9. Set FORCE_UNSUPPORTED_OS=1 to
  downgrade to a warning and continue (for internal testing only).
- Called in prepare_nodes_for_k0s (all nodes) and _install_nvidia_on_node
  (GPU workers) before any package install is attempted.

prepare_airgap_bundle.sh:
- Added OS validation gate after arg parsing: errors if GPU_NODE_OS != rhel9.
- --gpu-os help text and examples updated to reflect rhel9-only support.

Docs (K0S_README.md, K0S_QUICKSTART.md, AIRGAP.md, DEPLOYMENT_GUIDE.md):
- All RHEL 10, Amazon Linux 2023, Debian/Ubuntu OS claims removed.
- Supported OS stated as: RHEL 9 (Rocky 9, AlmaLinux 9, CentOS Stream 9).
- Internet dependency tables updated to rhel9 URLs only.
…gram colors

- TEST_PLAN.md: add 16 T1 tests (T1-14..T1-29), 6 T2 tests (T2-16..T2-21),
  7 T3 tests (T3-12..T3-18), visual review section for Mermaid diagrams,
  and 13 new implementation chart rows covering OS gate, airgap GPU packages,
  URL overrides, DEPLOYMENT_GUIDE, staging requirements, and GPU spec changes
- DEPLOYMENT_GUIDE.md: replace unreadable sequenceDiagram Install Flow (dark
  rect backgrounds with invisible text) with a flowchart TD using explicit
  light backgrounds and dark text via classDef; fix remaining dark node fills
Updated all references across DEPLOYMENT_GUIDE.md, AIRGAP.md, and
K0S_QUICKSTART.md — flowchart node, transfer diagram, staging section,
disk requirement table, and internet bandwidth note.
Covers every error and warning path in k0s_cluster_with_stack.sh:
preflight failures, SSH/sudo issues, OS compatibility gate, k0s bootstrap,
node join/labeling, NVIDIA driver install, Helm chart failures, model
staging, object store credentials, MetalLB, AIPlatform CR, air-gap
specific errors, config validation, and delete/clean-all failures.
Includes diagnostic command reference and first-steps checklist.
Only RHEL 9 is now claimed as supported. Updated error messages in
k0s_cluster_with_stack.sh and prepare_airgap_bundle.sh, and removed
all Rocky/AlmaLinux/CentOS mentions from K0S_README, K0S_QUICKSTART,
TROUBLESHOOTING, and TEST_PLAN. OS detection code kept for internal
testing (same approach as RHEL 10 / AL2023 removal).
Covers finding the Splunk Web URL (NodePort, LoadBalancer, port-forward),
installing the app via the Splunk UI, air-gapped install via kubectl cp,
verifying installation via Standalone status and REST API, configuring
the SAIA API endpoint via UI and splunkaiassistant.conf, and
troubleshooting common failures.
…LOYMENT_GUIDE

- K0S_README.md: add full 'Splunk AI Assistant App' section covering URL
  discovery, UI install, air-gapped kubectl cp install, verification
  (deployStatus table), splunkaiassistant.conf config, and troubleshooting
- DEPLOYMENT_GUIDE.md: add concise 'Install the Splunk AI Assistant App'
  section with 4 key steps and a link to K0S_README for full details
- Delete SAIA_APP_INSTALL.md (content merged into the above two files)
- Delete AIRGAP.md and K0S_QUICKSTART.md
- K0S_README.md becomes the single reference doc: absorbs full air-gap
  guide (5 phases, GPU strategies, env var reference), S3 bucket directory
  layout, staging machine requirements, manual upload scripts table,
  additional utilities, detailed GPU hardware spec
- DEPLOYMENT_GUIDE.md keeps all step-by-step deployment content including
  full air-gap phase detail; config section replaced with reference link
  to K0S_README; Cluster Architecture and Network Ports diagrams moved
- TROUBLESHOOTING.md: update AIRGAP.md cross-links to K0S_README
- TEST_PLAN.md: update all test cases referencing deleted files
- Mark HF_TOKEN as optional for the current release across all docs
kupratyu-splunk and others added 4 commits June 15, 2026 11:28
P1: fix ssh typo in troubleshooting tree, define SAIA on first use,
add app tgz provenance, add install time estimates, link to TROUBLESHOOTING.md

P2: add namespace quick-reference table, expected output samples,
describe diagnose bundle contents, add validate as first troubleshooting
step, clarify non-idempotent steps, add version compatibility + licensing
note, add upgrade section to Common Operations

P3: minimum viable topology note, object storage sizing guidance
- Fix step_ok unconditionally clobbering step_fail in install summary
- Fix curl -sf rejecting valid 301/403 responses in object store check
- Remove redundant file:// curl in join_workers; prepare_nodes_for_k0s handles both paths
- Remove non-existent 'upgrade' subcommand from install_from_airgap_bundle.sh help
- Fix misleading 'Sort oldest-first' comment in log rotation (ls -1t is newest-first)
- Replace warn-and-continue chart path check with glob resolver that fails fast
- Expand checksums to cover all bundle files, not just binaries/manifests/charts
- Fix TROUBLESHOOTING.md: remove explicit unsupported OS names (RHEL 10, AL2023, Rocky, AlmaLinux)
- Fix TEST_PLAN.md: correct T1-5 YAML field and README target, T1-6 grep -F for braces, T1-8 anchor stripping, T1-29 mermaid count
…LOYMENT_GUIDE

T3-16 previously ran a full cluster install to test URL overrides, requiring
internet for all non-GPU dependencies. Rewritten to call install_nvidia_host_drivers
directly with a local python3 HTTP server as the mirror — no internet needed.

DEPLOYMENT_GUIDE scenario table now links each row's Use column to the
corresponding section anchor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The two air-gapped rows ("install machine online, cluster isolated" and
"everything offline") follow identical steps — the only difference is which
machine runs prepare_airgap_bundle.sh. Merged into a single row to avoid
implying a different deployment path exists.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kupratyu-splunk kupratyu-splunk force-pushed the make_customers_exp_better branch from 92acbe2 to dee3d7d Compare June 15, 2026 05:58
@kupratyu-splunk kupratyu-splunk self-assigned this Jun 15, 2026
@kupratyu-splunk kupratyu-splunk merged commit 5be2d0e into main Jun 15, 2026
10 checks passed
@kupratyu-splunk kupratyu-splunk deleted the make_customers_exp_better branch June 15, 2026 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants