Add k8s prod readiness checklist #1352

vuldin · 2025-09-09T13:19:04Z

Description

Adds k8s prod readiness checklist.

Page previews

https://deploy-preview-1352--redpanda-docs-preview.netlify.app/current/deploy/redpanda/kubernetes/k-production-checklist/

Checks

New feature
Content gap
Support Follow-up
Small fix (typos, links, copyedits, etc)

coderabbitai · 2025-09-09T13:19:12Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This PR adds comprehensive Kubernetes production readiness documentation for Redpanda deployments. A new production checklist document is introduced with detailed validation steps covering cluster health, resource configuration, security, storage, monitoring, and operational readiness. Four existing Kubernetes documentation pages are updated with cross-references to the new checklist, creating an integrated guidance flow from requirements through production deployment to readiness validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15–30 minutes

Areas requiring attention:
- Verify all xref links point to correct document paths and sections
- Validate that kubectl, rpk, and helm example commands in the new checklist are syntactically correct and produce expected outputs
- Confirm the production checklist content aligns with current Redpanda best practices and Kubernetes deployment patterns
- Check that the updated cross-references in k-requirements.adoc, k-production-workflow.adoc, and high-availability.adoc maintain proper document structure and logical flow
- Ensure no broken inter-document references or orphaned sections

Possibly related PRs

v25.2.x release of Redpanda Operator #1271 — Modifies the same k-production-deployment.adoc file to integrate production readiness validation references
fix DR links #1481 — Updates modules/deploy/partials/high-availability.adoc to add and refine cross-reference links for related guidance

Suggested reviewers

Feediver1
JakeSCahill
david-yu

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description is largely incomplete. It lacks the JIRA ticket reference, review deadline, and does not provide actual page previews for the other modified files (k-production-deployment.adoc, k-production-workflow.adoc, k-requirements.adoc, high-availability.adoc). Only one of the five modified files has a preview URL.	Add the missing JIRA ticket reference and review deadline, and include preview URLs for all modified documentation files to enable thorough review of cross-references and related changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add k8s prod readiness checklist' accurately summarizes the main change, which is adding a comprehensive Kubernetes production readiness checklist document to the Redpanda documentation.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

netlify · 2025-09-09T13:19:26Z

✅ Deploy Preview for redpanda-docs-preview ready!

Built without sensitive environment variables

Name	Link
🔨 Latest commit	`aa80805`
🔍 Latest deploy log	https://app.netlify.com/projects/redpanda-docs-preview/deploys/69440e00aaf8c000087720ab
😎 Deploy Preview	https://deploy-preview-1352--redpanda-docs-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Feediver1 · 2025-11-19T19:32:34Z

@JakeSCahill @vuldin Is this PR still a WIP?

vuldin · 2025-11-22T04:59:33Z

@JakeSCahill @vuldin Is this PR still a WIP?

Yes it is, I need to focus on this over the upcoming week and hopefully it will be in good shape soon.

vuldin · 2025-12-18T00:07:50Z

This PR is ready for review, thanks! @KavyaShivashankar @JakeSCahill

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

modules/deploy/pages/redpanda/kubernetes/k-production-checklist.adoc (2)
18-25: SASL credential placeholders are clear but may give false impression that SASL is required.

The cluster health check command includes SASL flags, but it's not immediately clear that these are optional for clusters without SASL enabled. The note at line 23 addresses this, but users with non-SASL clusters may be confused by seeing SASL flags everywhere.

Consider adding a clarification that SASL flags can be omitted for non-SASL deployments, or indicate this more prominently at the beginning of the critical section.

Suggested improvement (optional):

Add a subsection header before cluster health status explaining credential requirements:
=== Authentication Note

The commands in this section include SASL authentication flags (--user, --pass, --sasl.mechanism). 
If your cluster does not use SASL authentication, you can omit these flags from all commands.
This helps non-SASL users immediately understand they can simplify commands.

559-581: Security configuration sections (TLS, Authentication, Network) lack detail compared to other checks.

While other critical sections provide detailed commands and expected outputs, the TLS, Authentication, and Network Security sections (lines 559-581) only list items without verification commands or expected outputs. This may leave users uncertain about how to validate these critical security configurations.

Consider adding verification commands for TLS and authentication similar to other sections:
**TLS verification:**
[,bash]
----
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config export | grep -A 5 "kafka_api:"
----

Expected output showing tls enabled on listeners.
This would provide users with actionable verification steps for security configurations.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3ee30e4 and 5571dbd.

📒 Files selected for processing (5)

modules/deploy/pages/redpanda/kubernetes/k-production-checklist.adoc (1 hunks)
modules/deploy/pages/redpanda/kubernetes/k-production-deployment.adoc (1 hunks)
modules/deploy/pages/redpanda/kubernetes/k-production-workflow.adoc (1 hunks)
modules/deploy/pages/redpanda/kubernetes/k-requirements.adoc (1 hunks)
modules/deploy/partials/high-availability.adoc (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-12-12T16:18:28.126Z

Learnt from: andrewstucki
Repo: redpanda-data/docs PR: 1514
File: modules/troubleshoot/partials/errors-and-solutions.adoc:618-632
Timestamp: 2025-12-12T16:18:28.126Z
Learning: In Redpanda ShadowLink CRD usage, do not copy CA certificates into pods via kubectl cp because they won't persist across pod restarts. Store CA certs in Kubernetes Secrets and reference them in the ShadowLink's staticConfiguration block (or inline the cert data in that block) to ensure persistence and proper secret management.

Applied to files:

modules/deploy/pages/redpanda/kubernetes/k-requirements.adoc
modules/deploy/pages/redpanda/kubernetes/k-production-deployment.adoc
modules/deploy/pages/redpanda/kubernetes/k-production-checklist.adoc
modules/deploy/pages/redpanda/kubernetes/k-production-workflow.adoc
modules/deploy/partials/high-availability.adoc

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Redirect rules - redpanda-docs-preview
GitHub Check: Header rules - redpanda-docs-preview
GitHub Check: Pages changed - redpanda-docs-preview

🔇 Additional comments (16)

modules/deploy/partials/high-availability.adoc (1)

534-541: Cross-references correctly integrated into documentation flow.

The conditional Kubernetes production readiness link and Tiered Storage reference are properly added. Line 535 correctly targets the new checklist document using standard xref syntax.

modules/deploy/pages/redpanda/kubernetes/k-production-workflow.adoc (1)

13-13: Logical workflow addition with clear role assignment.

Adding validation as a final step involving all users improves workflow completeness and ensures production readiness checks are part of the standard deployment process. The step description is clear and actionable.

modules/deploy/pages/redpanda/kubernetes/k-production-deployment.adoc (1)

780-782: Clear next-step guidance guiding users to production validation.

The addition properly positions production readiness validation immediately after deployment. The xref syntax is correct and the descriptive text clearly explains the checklist's purpose. This creates a natural workflow progression.

modules/deploy/pages/redpanda/kubernetes/k-requirements.adoc (1)

14-17: Improved guidance flow from requirements to deployment to validation.

Restructuring the next steps section with explicit guidance to both deploy and validate creates a clear path forward. The bullet structure and xref syntax are consistent with documentation standards.

modules/deploy/pages/redpanda/kubernetes/k-production-checklist.adoc (12)

37-47: Pod listing command and expected output are correct.

The kubectl get pods command and expected output showing three running brokers are accurate and match standard Kubernetes conventions.

51-83: Tab structure for Helm/Operator deployment comparison is correct.

The tabs follow standard AsciiDoc conventions with proper syntax for Helm and Operator options. Content is clearly separated and readable.

144-235: Version pinning section comprehensively covers importance and implementation.

The section explains why version pinning matters with clear examples for both Helm and Operator. Examples show realistic version tags (e.g., v24.2.4, v2.4.5) and include verification commands with expected output. The warning about avoiding latest tags and version ranges is crucial.

823-848: Continuous data balancing section properly explains enterprise feature.

The section clearly states this feature should be enabled for "all licensed production clusters" and explains what it does. The command and expected output are correct. Cross-reference to xref:manage:cluster-maintenance/continuous-data-balancing.adoc is appropriate.

888-927: Debug bundle generation section provides excellent proactive validation.

This section thoughtfully includes debug bundle generation as a test to verify permissions and configuration before issues occur. The explanation of what bundles collect (lines 911-916) and common issues (lines 918-923) provides valuable context. The xref to diagnostics bundle docs is helpful.

1092-1128: Monitoring and observability section appropriately covers key observability areas.

The section covers Prometheus setup, Grafana dashboards, alerting, log aggregation, and health checks at a high level. While less detailed than critical checks, this is appropriate for a checklist that points to items to implement. The structure with bullet points makes it scannable.

1130-1162: Operational readiness section covers important governance and procedure aspects.

Sections on deployment automation, non-production environments, upgrade procedures, incident response, and resource quotas address crucial operational readiness areas. While brief, these appropriately serve as reminders of what should be in place before production deployment.

1164-1172: Next steps section provides logical progression after checklist completion.

The five post-checklist activities (performance testing, DR testing, security review, operational validation, documentation) create a clear path forward. This helps users understand that completing the checklist is not the end of preparation but the beginning of operational validation.

1-9: Document metadata and reference to Linux checklist are clear.

The header provides appropriate description and context links. The note at line 8 directing Linux users to the parallel Linux checklist is helpful for users who might be reading in the wrong context.

661-688: Operator CRDs validation section properly emphasizes criticality of CRD setup.

This section appropriately highlights that missing or incompatible CRDs is a CRITICAL issue that will break Operator functionality. The list of required CRDs (lines 681-686) is clear, and the consequences of missing CRDs (line 688) are well explained.

1-1172: Comprehensive production checklist appropriately structured for Kubernetes deployments.

The new k-production-checklist.adoc file provides extensive, well-organized guidance covering critical requirements, recommended enhancements, observability, and operational readiness. The file successfully:

Separates critical (must-have) from recommended (should-have) checks

Provides actionable verification commands for most checks

Shows expected outputs to help users validate results

Includes tabs for both Helm and Operator deployment methods

Uses appropriate callouts (NOTE, WARNING) for important guidance

Cross-references remediation and detailed documentation

The file effectively serves as the central validation hub that the workflow documents reference.

14-25: The expected output for rpk cluster health should match the actual command format.

The command on line 20 is correct, but the expected output should show the structured format that the command actually produces. Update the output to include the labeled fields such as Healthy:, Controller ID:, All nodes:, Nodes down:, Leaderless partitions:, and Under-replicated partitions: with example values (e.g., Healthy: true), rather than generic placeholders.

modules/deploy/pages/redpanda/kubernetes/k-production-checklist.adoc

vuldin · 2025-12-18T00:23:39Z

I pushed an update to handle the nitpick comments from automated review.

david-yu · 2025-12-18T17:42:38Z

This looks good, what are your thoughts on creating a list that is numbered on production requirements that are in order of importance (perhaps it already is)? That way it should be easier to follow, I am worried about the doc being a bit unwieldy to follow given its length without some form of organization.

vuldin self-assigned this Sep 9, 2025

vuldin force-pushed the k8s-prod-checklist branch 9 times, most recently from 8ff9de1 to 5571dbd Compare December 18, 2025 00:06

vuldin marked this pull request as ready for review December 18, 2025 00:07

vuldin requested a review from a team as a code owner December 18, 2025 00:07

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

modules/deploy/pages/redpanda/kubernetes/k-production-checklist.adoc Show resolved Hide resolved

add k8s prod readiness checklist

0519a3d

vuldin force-pushed the k8s-prod-checklist branch from 5571dbd to 0519a3d Compare December 18, 2025 00:21

add link to choosing replica factor in k8s prod checklist

aa80805

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add k8s prod readiness checklist #1352

Add k8s prod readiness checklist #1352

vuldin commented Sep 9, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Sep 9, 2025 •

edited

Loading

Review skipped

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

netlify bot commented Sep 9, 2025 •

edited

Loading

Uh oh!

Feediver1 commented Nov 19, 2025

Uh oh!

vuldin commented Nov 22, 2025

Uh oh!

vuldin commented Dec 18, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

vuldin commented Dec 18, 2025

Uh oh!

david-yu commented Dec 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add k8s prod readiness checklist #1352

Are you sure you want to change the base?

Add k8s prod readiness checklist #1352

Conversation

vuldin commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Page previews

Checks

Uh oh!

coderabbitai bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

netlify bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for redpanda-docs-preview ready!

Uh oh!

Feediver1 commented Nov 19, 2025

Uh oh!

vuldin commented Nov 22, 2025

Uh oh!

vuldin commented Dec 18, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vuldin commented Dec 18, 2025

Uh oh!

david-yu commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vuldin commented Sep 9, 2025 •

edited

Loading

coderabbitai bot commented Sep 9, 2025 •

edited

Loading

netlify bot commented Sep 9, 2025 •

edited

Loading

david-yu commented Dec 18, 2025 •

edited

Loading