Skip to content

Conversation

@vuldin
Copy link
Member

@vuldin vuldin commented Sep 9, 2025

Description

Adds k8s prod readiness checklist.

Page previews

https://deploy-preview-1352--redpanda-docs-preview.netlify.app/current/deploy/redpanda/kubernetes/k-production-checklist/

Checks

  • New feature
  • Content gap
  • Support Follow-up
  • Small fix (typos, links, copyedits, etc)

@vuldin vuldin self-assigned this Sep 9, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 9, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

This PR adds comprehensive Kubernetes production readiness documentation for Redpanda deployments. A new production checklist document is introduced with detailed validation steps covering cluster health, resource configuration, security, storage, monitoring, and operational readiness. Four existing Kubernetes documentation pages are updated with cross-references to the new checklist, creating an integrated guidance flow from requirements through production deployment to readiness validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15–30 minutes

  • Areas requiring attention:
    • Verify all xref links point to correct document paths and sections
    • Validate that kubectl, rpk, and helm example commands in the new checklist are syntactically correct and produce expected outputs
    • Confirm the production checklist content aligns with current Redpanda best practices and Kubernetes deployment patterns
    • Check that the updated cross-references in k-requirements.adoc, k-production-workflow.adoc, and high-availability.adoc maintain proper document structure and logical flow
    • Ensure no broken inter-document references or orphaned sections

Possibly related PRs

Suggested reviewers

  • Feediver1
  • JakeSCahill
  • david-yu

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description is largely incomplete. It lacks the JIRA ticket reference, review deadline, and does not provide actual page previews for the other modified files (k-production-deployment.adoc, k-production-workflow.adoc, k-requirements.adoc, high-availability.adoc). Only one of the five modified files has a preview URL. Add the missing JIRA ticket reference and review deadline, and include preview URLs for all modified documentation files to enable thorough review of cross-references and related changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add k8s prod readiness checklist' accurately summarizes the main change, which is adding a comprehensive Kubernetes production readiness checklist document to the Redpanda documentation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@netlify
Copy link

netlify bot commented Sep 9, 2025

Deploy Preview for redpanda-docs-preview ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit aa80805
🔍 Latest deploy log https://app.netlify.com/projects/redpanda-docs-preview/deploys/69440e00aaf8c000087720ab
😎 Deploy Preview https://deploy-preview-1352--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@Feediver1
Copy link
Contributor

@JakeSCahill @vuldin Is this PR still a WIP?

@vuldin
Copy link
Member Author

vuldin commented Nov 22, 2025

@JakeSCahill @vuldin Is this PR still a WIP?

Yes it is, I need to focus on this over the upcoming week and hopefully it will be in good shape soon.

@vuldin vuldin force-pushed the k8s-prod-checklist branch 9 times, most recently from 8ff9de1 to 5571dbd Compare December 18, 2025 00:06
@vuldin vuldin marked this pull request as ready for review December 18, 2025 00:07
@vuldin vuldin requested a review from a team as a code owner December 18, 2025 00:07
@vuldin
Copy link
Member Author

vuldin commented Dec 18, 2025

This PR is ready for review, thanks! @KavyaShivashankar @JakeSCahill

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
modules/deploy/pages/redpanda/kubernetes/k-production-checklist.adoc (2)

18-25: SASL credential placeholders are clear but may give false impression that SASL is required.

The cluster health check command includes SASL flags, but it's not immediately clear that these are optional for clusters without SASL enabled. The note at line 23 addresses this, but users with non-SASL clusters may be confused by seeing SASL flags everywhere.

Consider adding a clarification that SASL flags can be omitted for non-SASL deployments, or indicate this more prominently at the beginning of the critical section.

Suggested improvement (optional):

Add a subsection header before cluster health status explaining credential requirements:

=== Authentication Note

The commands in this section include SASL authentication flags (--user, --pass, --sasl.mechanism). 
If your cluster does not use SASL authentication, you can omit these flags from all commands.

This helps non-SASL users immediately understand they can simplify commands.


559-581: Security configuration sections (TLS, Authentication, Network) lack detail compared to other checks.

While other critical sections provide detailed commands and expected outputs, the TLS, Authentication, and Network Security sections (lines 559-581) only list items without verification commands or expected outputs. This may leave users uncertain about how to validate these critical security configurations.

Consider adding verification commands for TLS and authentication similar to other sections:

**TLS verification:**
[,bash]
----
kubectl exec -n <namespace> <pod-name> -c redpanda -- rpk cluster config export | grep -A 5 "kafka_api:"
----

Expected output showing tls enabled on listeners.

This would provide users with actionable verification steps for security configurations.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3ee30e4 and 5571dbd.

📒 Files selected for processing (5)
  • modules/deploy/pages/redpanda/kubernetes/k-production-checklist.adoc (1 hunks)
  • modules/deploy/pages/redpanda/kubernetes/k-production-deployment.adoc (1 hunks)
  • modules/deploy/pages/redpanda/kubernetes/k-production-workflow.adoc (1 hunks)
  • modules/deploy/pages/redpanda/kubernetes/k-requirements.adoc (1 hunks)
  • modules/deploy/partials/high-availability.adoc (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-12T16:18:28.126Z
Learnt from: andrewstucki
Repo: redpanda-data/docs PR: 1514
File: modules/troubleshoot/partials/errors-and-solutions.adoc:618-632
Timestamp: 2025-12-12T16:18:28.126Z
Learning: In Redpanda ShadowLink CRD usage, do not copy CA certificates into pods via kubectl cp because they won't persist across pod restarts. Store CA certs in Kubernetes Secrets and reference them in the ShadowLink's staticConfiguration block (or inline the cert data in that block) to ensure persistence and proper secret management.

Applied to files:

  • modules/deploy/pages/redpanda/kubernetes/k-requirements.adoc
  • modules/deploy/pages/redpanda/kubernetes/k-production-deployment.adoc
  • modules/deploy/pages/redpanda/kubernetes/k-production-checklist.adoc
  • modules/deploy/pages/redpanda/kubernetes/k-production-workflow.adoc
  • modules/deploy/partials/high-availability.adoc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Redirect rules - redpanda-docs-preview
  • GitHub Check: Header rules - redpanda-docs-preview
  • GitHub Check: Pages changed - redpanda-docs-preview
🔇 Additional comments (16)
modules/deploy/partials/high-availability.adoc (1)

534-541: Cross-references correctly integrated into documentation flow.

The conditional Kubernetes production readiness link and Tiered Storage reference are properly added. Line 535 correctly targets the new checklist document using standard xref syntax.

modules/deploy/pages/redpanda/kubernetes/k-production-workflow.adoc (1)

13-13: Logical workflow addition with clear role assignment.

Adding validation as a final step involving all users improves workflow completeness and ensures production readiness checks are part of the standard deployment process. The step description is clear and actionable.

modules/deploy/pages/redpanda/kubernetes/k-production-deployment.adoc (1)

780-782: Clear next-step guidance guiding users to production validation.

The addition properly positions production readiness validation immediately after deployment. The xref syntax is correct and the descriptive text clearly explains the checklist's purpose. This creates a natural workflow progression.

modules/deploy/pages/redpanda/kubernetes/k-requirements.adoc (1)

14-17: Improved guidance flow from requirements to deployment to validation.

Restructuring the next steps section with explicit guidance to both deploy and validate creates a clear path forward. The bullet structure and xref syntax are consistent with documentation standards.

modules/deploy/pages/redpanda/kubernetes/k-production-checklist.adoc (12)

37-47: Pod listing command and expected output are correct.

The kubectl get pods command and expected output showing three running brokers are accurate and match standard Kubernetes conventions.


51-83: Tab structure for Helm/Operator deployment comparison is correct.

The tabs follow standard AsciiDoc conventions with proper syntax for Helm and Operator options. Content is clearly separated and readable.


144-235: Version pinning section comprehensively covers importance and implementation.

The section explains why version pinning matters with clear examples for both Helm and Operator. Examples show realistic version tags (e.g., v24.2.4, v2.4.5) and include verification commands with expected output. The warning about avoiding latest tags and version ranges is crucial.


823-848: Continuous data balancing section properly explains enterprise feature.

The section clearly states this feature should be enabled for "all licensed production clusters" and explains what it does. The command and expected output are correct. Cross-reference to xref:manage:cluster-maintenance/continuous-data-balancing.adoc is appropriate.


888-927: Debug bundle generation section provides excellent proactive validation.

This section thoughtfully includes debug bundle generation as a test to verify permissions and configuration before issues occur. The explanation of what bundles collect (lines 911-916) and common issues (lines 918-923) provides valuable context. The xref to diagnostics bundle docs is helpful.


1092-1128: Monitoring and observability section appropriately covers key observability areas.

The section covers Prometheus setup, Grafana dashboards, alerting, log aggregation, and health checks at a high level. While less detailed than critical checks, this is appropriate for a checklist that points to items to implement. The structure with bullet points makes it scannable.


1130-1162: Operational readiness section covers important governance and procedure aspects.

Sections on deployment automation, non-production environments, upgrade procedures, incident response, and resource quotas address crucial operational readiness areas. While brief, these appropriately serve as reminders of what should be in place before production deployment.


1164-1172: Next steps section provides logical progression after checklist completion.

The five post-checklist activities (performance testing, DR testing, security review, operational validation, documentation) create a clear path forward. This helps users understand that completing the checklist is not the end of preparation but the beginning of operational validation.


1-9: Document metadata and reference to Linux checklist are clear.

The header provides appropriate description and context links. The note at line 8 directing Linux users to the parallel Linux checklist is helpful for users who might be reading in the wrong context.


661-688: Operator CRDs validation section properly emphasizes criticality of CRD setup.

This section appropriately highlights that missing or incompatible CRDs is a CRITICAL issue that will break Operator functionality. The list of required CRDs (lines 681-686) is clear, and the consequences of missing CRDs (line 688) are well explained.


1-1172: Comprehensive production checklist appropriately structured for Kubernetes deployments.

The new k-production-checklist.adoc file provides extensive, well-organized guidance covering critical requirements, recommended enhancements, observability, and operational readiness. The file successfully:

  • Separates critical (must-have) from recommended (should-have) checks
  • Provides actionable verification commands for most checks
  • Shows expected outputs to help users validate results
  • Includes tabs for both Helm and Operator deployment methods
  • Uses appropriate callouts (NOTE, WARNING) for important guidance
  • Cross-references remediation and detailed documentation

The file effectively serves as the central validation hub that the workflow documents reference.


14-25: The expected output for rpk cluster health should match the actual command format.

The command on line 20 is correct, but the expected output should show the structured format that the command actually produces. Update the output to include the labeled fields such as Healthy:, Controller ID:, All nodes:, Nodes down:, Leaderless partitions:, and Under-replicated partitions: with example values (e.g., Healthy: true), rather than generic placeholders.

@vuldin vuldin force-pushed the k8s-prod-checklist branch from 5571dbd to 0519a3d Compare December 18, 2025 00:21
@vuldin
Copy link
Member Author

vuldin commented Dec 18, 2025

I pushed an update to handle the nitpick comments from automated review.

@david-yu
Copy link
Contributor

david-yu commented Dec 18, 2025

This looks good, what are your thoughts on creating a list that is numbered on production requirements that are in order of importance (perhaps it already is)? That way it should be easier to follow, I am worried about the doc being a bit unwieldy to follow given its length without some form of organization.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants