-
Notifications
You must be signed in to change notification settings - Fork 47
Add k8s prod readiness checklist #1352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughThis PR adds comprehensive Kubernetes production readiness documentation for Redpanda deployments. A new production checklist document is introduced with detailed validation steps covering cluster health, resource configuration, security, storage, monitoring, and operational readiness. Four existing Kubernetes documentation pages are updated with cross-references to the new checklist, creating an integrated guidance flow from requirements through production deployment to readiness validation. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15–30 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
✅ Deploy Preview for redpanda-docs-preview ready!Built without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
@JakeSCahill @vuldin Is this PR still a WIP? |
Yes it is, I need to focus on this over the upcoming week and hopefully it will be in good shape soon. |
8ff9de1 to
5571dbd
Compare
|
This PR is ready for review, thanks! @KavyaShivashankar @JakeSCahill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
modules/deploy/pages/redpanda/kubernetes/k-production-checklist.adoc (2)
18-25: SASL credential placeholders are clear but may give false impression that SASL is required.The cluster health check command includes SASL flags, but it's not immediately clear that these are optional for clusters without SASL enabled. The note at line 23 addresses this, but users with non-SASL clusters may be confused by seeing SASL flags everywhere.
Consider adding a clarification that SASL flags can be omitted for non-SASL deployments, or indicate this more prominently at the beginning of the critical section.
Suggested improvement (optional):
Add a subsection header before cluster health status explaining credential requirements:
=== Authentication Note The commands in this section include SASL authentication flags (--user, --pass, --sasl.mechanism). If your cluster does not use SASL authentication, you can omit these flags from all commands.This helps non-SASL users immediately understand they can simplify commands.
559-581: Security configuration sections (TLS, Authentication, Network) lack detail compared to other checks.While other critical sections provide detailed commands and expected outputs, the TLS, Authentication, and Network Security sections (lines 559-581) only list items without verification commands or expected outputs. This may leave users uncertain about how to validate these critical security configurations.
Consider adding verification commands for TLS and authentication similar to other sections:
**TLS verification:** [,bash] ---- kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config export | grep -A 5 "kafka_api:" ---- Expected output showing tls enabled on listeners.This would provide users with actionable verification steps for security configurations.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Jira integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
modules/deploy/pages/redpanda/kubernetes/k-production-checklist.adoc(1 hunks)modules/deploy/pages/redpanda/kubernetes/k-production-deployment.adoc(1 hunks)modules/deploy/pages/redpanda/kubernetes/k-production-workflow.adoc(1 hunks)modules/deploy/pages/redpanda/kubernetes/k-requirements.adoc(1 hunks)modules/deploy/partials/high-availability.adoc(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-12T16:18:28.126Z
Learnt from: andrewstucki
Repo: redpanda-data/docs PR: 1514
File: modules/troubleshoot/partials/errors-and-solutions.adoc:618-632
Timestamp: 2025-12-12T16:18:28.126Z
Learning: In Redpanda ShadowLink CRD usage, do not copy CA certificates into pods via kubectl cp because they won't persist across pod restarts. Store CA certs in Kubernetes Secrets and reference them in the ShadowLink's staticConfiguration block (or inline the cert data in that block) to ensure persistence and proper secret management.
Applied to files:
modules/deploy/pages/redpanda/kubernetes/k-requirements.adocmodules/deploy/pages/redpanda/kubernetes/k-production-deployment.adocmodules/deploy/pages/redpanda/kubernetes/k-production-checklist.adocmodules/deploy/pages/redpanda/kubernetes/k-production-workflow.adocmodules/deploy/partials/high-availability.adoc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Redirect rules - redpanda-docs-preview
- GitHub Check: Header rules - redpanda-docs-preview
- GitHub Check: Pages changed - redpanda-docs-preview
🔇 Additional comments (16)
modules/deploy/partials/high-availability.adoc (1)
534-541: Cross-references correctly integrated into documentation flow.The conditional Kubernetes production readiness link and Tiered Storage reference are properly added. Line 535 correctly targets the new checklist document using standard xref syntax.
modules/deploy/pages/redpanda/kubernetes/k-production-workflow.adoc (1)
13-13: Logical workflow addition with clear role assignment.Adding validation as a final step involving all users improves workflow completeness and ensures production readiness checks are part of the standard deployment process. The step description is clear and actionable.
modules/deploy/pages/redpanda/kubernetes/k-production-deployment.adoc (1)
780-782: Clear next-step guidance guiding users to production validation.The addition properly positions production readiness validation immediately after deployment. The xref syntax is correct and the descriptive text clearly explains the checklist's purpose. This creates a natural workflow progression.
modules/deploy/pages/redpanda/kubernetes/k-requirements.adoc (1)
14-17: Improved guidance flow from requirements to deployment to validation.Restructuring the next steps section with explicit guidance to both deploy and validate creates a clear path forward. The bullet structure and xref syntax are consistent with documentation standards.
modules/deploy/pages/redpanda/kubernetes/k-production-checklist.adoc (12)
37-47: Pod listing command and expected output are correct.The kubectl get pods command and expected output showing three running brokers are accurate and match standard Kubernetes conventions.
51-83: Tab structure for Helm/Operator deployment comparison is correct.The tabs follow standard AsciiDoc conventions with proper syntax for Helm and Operator options. Content is clearly separated and readable.
144-235: Version pinning section comprehensively covers importance and implementation.The section explains why version pinning matters with clear examples for both Helm and Operator. Examples show realistic version tags (e.g., v24.2.4, v2.4.5) and include verification commands with expected output. The warning about avoiding
latesttags and version ranges is crucial.
823-848: Continuous data balancing section properly explains enterprise feature.The section clearly states this feature should be enabled for "all licensed production clusters" and explains what it does. The command and expected output are correct. Cross-reference to xref:manage:cluster-maintenance/continuous-data-balancing.adoc is appropriate.
888-927: Debug bundle generation section provides excellent proactive validation.This section thoughtfully includes debug bundle generation as a test to verify permissions and configuration before issues occur. The explanation of what bundles collect (lines 911-916) and common issues (lines 918-923) provides valuable context. The xref to diagnostics bundle docs is helpful.
1092-1128: Monitoring and observability section appropriately covers key observability areas.The section covers Prometheus setup, Grafana dashboards, alerting, log aggregation, and health checks at a high level. While less detailed than critical checks, this is appropriate for a checklist that points to items to implement. The structure with bullet points makes it scannable.
1130-1162: Operational readiness section covers important governance and procedure aspects.Sections on deployment automation, non-production environments, upgrade procedures, incident response, and resource quotas address crucial operational readiness areas. While brief, these appropriately serve as reminders of what should be in place before production deployment.
1164-1172: Next steps section provides logical progression after checklist completion.The five post-checklist activities (performance testing, DR testing, security review, operational validation, documentation) create a clear path forward. This helps users understand that completing the checklist is not the end of preparation but the beginning of operational validation.
1-9: Document metadata and reference to Linux checklist are clear.The header provides appropriate description and context links. The note at line 8 directing Linux users to the parallel Linux checklist is helpful for users who might be reading in the wrong context.
661-688: Operator CRDs validation section properly emphasizes criticality of CRD setup.This section appropriately highlights that missing or incompatible CRDs is a CRITICAL issue that will break Operator functionality. The list of required CRDs (lines 681-686) is clear, and the consequences of missing CRDs (line 688) are well explained.
1-1172: Comprehensive production checklist appropriately structured for Kubernetes deployments.The new k-production-checklist.adoc file provides extensive, well-organized guidance covering critical requirements, recommended enhancements, observability, and operational readiness. The file successfully:
- Separates critical (must-have) from recommended (should-have) checks
- Provides actionable verification commands for most checks
- Shows expected outputs to help users validate results
- Includes tabs for both Helm and Operator deployment methods
- Uses appropriate callouts (NOTE, WARNING) for important guidance
- Cross-references remediation and detailed documentation
The file effectively serves as the central validation hub that the workflow documents reference.
14-25: The expected output forrpk cluster healthshould match the actual command format.The command on line 20 is correct, but the expected output should show the structured format that the command actually produces. Update the output to include the labeled fields such as
Healthy:,Controller ID:,All nodes:,Nodes down:,Leaderless partitions:, andUnder-replicated partitions:with example values (e.g.,Healthy: true), rather than generic placeholders.
5571dbd to
0519a3d
Compare
|
I pushed an update to handle the nitpick comments from automated review. |

Description
Adds k8s prod readiness checklist.
Page previews
https://deploy-preview-1352--redpanda-docs-preview.netlify.app/current/deploy/redpanda/kubernetes/k-production-checklist/
Checks