Treat unauthenticated backends as routable in vMCP health#4866
Treat unauthenticated backends as routable in vMCP health#4866
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4866 +/- ##
==========================================
- Coverage 69.12% 69.10% -0.03%
==========================================
Files 531 531
Lines 55183 55215 +32
==========================================
+ Hits 38146 38156 +10
- Misses 14113 14137 +24
+ Partials 2924 2922 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6db33a2 to
a88b0fe
Compare
ChrisJBurns
left a comment
There was a problem hiding this comment.
Multi-Agent Consensus Review
Agents consulted: kubernetes-expert, go-expert-developer, code-reviewer, toolhive-expert
Consensus Summary
| # | Finding | Consensus | Severity | Action |
|---|---|---|---|---|
| 1 | pluralBackend() breaks formatBackendMessage godoc |
10/10 | MEDIUM | Fix |
| 2 | Grammar: "1 require authentication" subject-verb disagreement | 10/10 | MEDIUM | Fix |
| 3 | /status endpoint inconsistency with health monitor |
10/10 | MEDIUM | Discuss |
| 4 | Condition Reason "AllBackendsHealthy" misleading for unauthenticated | 8/10 | LOW | Discuss |
| 5 | CRD YAML manifests may need regeneration | 7/10 | MEDIUM | Verify |
| 6 | Missing TestBuildConditions cases for unauthenticated branches |
10/10 | MEDIUM | Fix |
| 7 | "All 1 backend" phrasing awkward for single-backend case | 7/10 | LOW | Discuss |
Overall
This PR correctly addresses the root cause of #4824: unauthenticated backends are reachable and should be treated as routable for phase determination. The Summary.Routable() abstraction is clean, and the fix is applied consistently across the health monitor, operator controller, and status collector. The test additions cover the key scenarios well.
The findings are primarily polish — grammar in user-facing status messages, a misplaced helper function that breaks godoc, and a gap in test coverage for buildConditions. The /status endpoint inconsistency is acknowledged via TODO and deferred to #4824, which is reasonable for scoping but worth discussing since this PR is what introduces the observable divergence. None of these are blocking in a "will break production" sense, but addressing the grammar and godoc issues before merge would improve the user experience.
Documentation
The BackendCount doc comment in virtualmcpserver_types.go and DiscoveredBackend.Status comment in types.go were updated, but the generated CRD YAML manifests may not have been regenerated. Run task operator-manifests to verify. If the CRD descriptions changed, include the regenerated manifests. (Finding #5)
Generated with Claude Code
| // This returns generic error categories to avoid exposing sensitive error details in status. | ||
| // Detailed errors are logged when they occur (in performHealthCheck) for debugging. | ||
| // pluralBackend returns "backend" or "backends" based on count. | ||
| func pluralBackend(n int) string { |
There was a problem hiding this comment.
[MEDIUM] pluralBackend() breaks formatBackendMessage godoc (Consensus: 10/10)
This function is inserted between the doc comment for formatBackendMessage and its function signature. In the resulting file, the three-line comment block ("formatBackendMessage creates a human-readable message...") is now separated from func formatBackendMessage(...) by the entire pluralBackend function body — formatBackendMessage becomes undocumented in godoc.
Move pluralBackend() above the formatBackendMessage doc comment block, or near formatStatusMessage where it's primarily used.
Raised by: go-expert-developer, code-reviewer, kubernetes-expert, toolhive-expert
| return fmt.Sprintf("All %d %s require authentication", | ||
| summary.Unauthenticated, pluralBackend(summary.Unauthenticated)) | ||
| } | ||
| return fmt.Sprintf("%d %s healthy, %d require authentication", | ||
| summary.Healthy, pluralBackend(summary.Healthy), summary.Unauthenticated) |
There was a problem hiding this comment.
[MEDIUM] Grammar: "1 require authentication" subject-verb disagreement (Consensus: 10/10)
When summary.Unauthenticated == 1, these format strings produce "1 require authentication" (should be "1 requires"). Affects formatStatusMessage (lines 647, 650) and buildConditions (~lines 859, 862). The test at status_builder_test.go:312 encodes this incorrect grammar.
Additionally, "All 1 backend require authentication" (line 647) reads awkwardly — consider dropping "All" when count == 1. (Finding #7, Consensus: 7/10)
Options: (a) add a verb pluralization helper, (b) rephrase to avoid the verb (e.g., "%d unauthenticated"), or (c) restructure as "%d awaiting authentication".
Raised by: go-expert-developer, code-reviewer, kubernetes-expert, toolhive-expert
| } | ||
| backendStatuses = append(backendStatuses, status) | ||
|
|
||
| // TODO(#4824): Consider treating BackendUnauthenticated as routable here too. |
There was a problem hiding this comment.
[MEDIUM] /status endpoint now contradicts health monitor for all-unauthenticated backends (Consensus: 10/10)
After this PR, the health monitor reports PhaseReady for all-unauthenticated backends, but /status still reports healthy: false (since hasHealthyBackend only checks BackendHealthy). This is a new divergence — previously both agreed. External consumers (load balancers, monitoring) could see contradictory signals.
The TODO is appropriate for scoping. Consider either fixing it here (adding || healthStatus == vmcp.BackendUnauthenticated) or calling out the gap explicitly in the PR description so reviewers are aware.
Raised by: go-expert-developer, code-reviewer, kubernetes-expert, toolhive-expert
| @@ -822,18 +849,25 @@ func buildConditions(summary Summary, phase vmcp.Phase, configuredBackendCount i | |||
| case vmcp.PhaseReady: | |||
| readyCondition.Status = metav1.ConditionTrue | |||
| readyCondition.Reason = "AllBackendsHealthy" | |||
There was a problem hiding this comment.
[LOW] Condition Reason "AllBackendsHealthy" misleading for unauthenticated-only vMCPs (Consensus: 8/10)
When all backends are unauthenticated, this Ready condition uses Reason "AllBackendsHealthy" but message "All N backends require authentication" — none are actually healthy. The PR notes mention keeping Reason strings stable for existing tooling, which is reasonable.
Consider: (a) adding a brief comment explaining why Reason was not updated, or (b) since this is v1alpha1, adding "AllBackendsRoutable" for the unauthenticated-only case.
Raised by: code-reviewer, kubernetes-expert, toolhive-expert
| }, | ||
| expectedPhase: vmcp.PhaseDegraded, | ||
| expectedCount: 2, | ||
| }, |
There was a problem hiding this comment.
[MEDIUM] Missing TestBuildConditions cases for unauthenticated branches (Consensus: 10/10)
buildConditions now has three new branches under PhaseReady (all healthy, all unauthenticated, mixed), but TestBuildConditions has no entries exercising the new paths. These are tested indirectly via TestBuildStatus_PhaseLogic, but direct unit tests for condition messages would improve coverage and failure diagnostics.
Add entries for: (1) all unauthenticated (PhaseReady), (2) mixed healthy + unauthenticated (PhaseReady), (3) PhaseDegraded with unauthenticated.
Raised by: go-expert-developer, code-reviewer, kubernetes-expert, toolhive-expert
Backends returning 401 due to missing user-level OAuth tokens (e.g., upstreamInject auth) are reachable and running — they just require per-request user auth. Health probes lack user tokens, so they correctly detect the unauthenticated state, but this should not cause PhaseFailed. Introduce Summary.Routable() (healthy + unauthenticated) and use it for phase determination, BackendCount, and status messages across the health monitor, operator controller, and status collector. Give BackendUnauthenticated its own CRD status "unauthenticated" instead of mapping to "unavailable". Fixes #4824 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix godoc: move pluralBackend() above formatBackendMessage doc comment
- Fix grammar: add pluralRequire()/quantifyBackends() helpers for
correct subject-verb agreement ("1 requires" not "1 require") and
drop "All" prefix for singular counts
- Fix /status endpoint: treat BackendUnauthenticated as routable,
consistent with health monitor
- Rename condition Reason from AllBackendsHealthy to AllBackendsRoutable
(v1alpha1, no stability guarantee)
- Add TestBuildConditions cases for unauthenticated branches
- Regenerate CRD manifests
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
a88b0fe to
6349cc0
Compare
|
This was great. My agent talked to yours lol. |
yrobla
left a comment
There was a problem hiding this comment.
Overall the fix is correct and well-scoped — the pre-PR behavior (PhaseFailed for all-unauthenticated groups) was clearly wrong, and the approach is sound for v1alpha1. A few observations below, none of which are blockers.
I'd suggest merging this and filing a follow-up issue scoped to: "eliminate BackendUnauthenticated state via auth-server health probe tokens once epic stacklok-epics#198 lands" — at that point the health monitor can request short-lived service-account tokens and probe OAuth backends properly, closing the capability-aggregation gap (discovery still excludes unauthenticated backends) and aligning NoHealthyBackends → NoRoutableBackends cleanly.
| readyCondition.Message = fmt.Sprintf("%d/%d backends healthy", summary.Healthy, summary.Total) | ||
| readyCondition.Message = fmt.Sprintf("%d/%d backends routable", summary.Routable(), summary.Total) | ||
| case vmcp.PhaseFailed: | ||
| readyCondition.Reason = "NoHealthyBackends" |
There was a problem hiding this comment.
This reason string is now semantically mismatched: the condition message in this branch says "No routable backends available" but the reason is still NoHealthyBackends. The PR description says condition reasons were intentionally left unchanged to avoid breaking tooling — but AllBackendsHealthy was renamed to AllBackendsRoutable a few lines above (line 868), so the freeze isn't consistent. Worth either aligning this to NoRoutableBackends here too, or documenting explicitly why the PhaseFailed reason is frozen while the PhaseReady one was not. At minimum, a follow-up issue to track the rename would prevent the inconsistency from calcifying.
| backendStatuses = append(backendStatuses, status) | ||
|
|
||
| if healthStatus == vmcp.BackendHealthy { | ||
| if healthStatus == vmcp.BackendHealthy || healthStatus == vmcp.BackendUnauthenticated { |
There was a problem hiding this comment.
Good fix — this aligns the /status HTTP endpoint's healthy field with the CRD phase behavior. One note: the PR description's "Special notes for reviewers" section still says "the status endpoint still reports healthy: false for all-unauthenticated groups" as a remaining TODO, but this commit already resolved it. Worth updating the description so reviewers don't go looking for a problem that no longer exists.
| // or degraded. Backends that are unhealthy, unknown, or unauthenticated are excluded | ||
| // from capability aggregation to prevent exposing tools from unavailable backends. | ||
| // | ||
| // TODO(#4824): Unauthenticated backends are treated as routable for phase determination |
There was a problem hiding this comment.
The exclusion is intentional and the comment explains why clearly. One suggestion: reference a dedicated follow-up issue here rather than #4824 (which is the bug this PR fixes, not the future work). When the auth server (stacklok-epics#198) can provide service-account tokens for health probes, this is the exact place to revisit — a linked issue would make that easier to track and avoids the TODO pointing at a closed bug.
Summary
upstreamInjectauth) permanently reportedPhaseFailedbecause health probes lack user tokens and get 401s, whichdeterminePhase()counted as zero healthy backends. These backends are reachable and running — they just need per-request user auth.Summary.Routable()(healthy + unauthenticated) and use it for phase determination,BackendCount, and status messages across the health monitor, operator controller, and status collector. GiveBackendUnauthenticatedits own CRD status"unauthenticated"instead of mapping to"unavailable".Fixes #4824
Type of change
Test plan
task test)task lint-fix)Changes
pkg/vmcp/health/monitor.goSummary.Routable()method; updatedeterminePhase(),formatStatusMessage(),buildConditions(),BackendCountto use routable count; addpluralBackend()helperpkg/vmcp/types.goToCRDStatus()mapsBackendUnauthenticatedto"unauthenticated"instead of"unavailable"; updateDiscoveredBackend.Statusdoccmd/thv-operator/api/v1alpha1/virtualmcpserver_types.goBackendStatusUnauthenticatedconstant; updateBackendCountdoccmd/thv-operator/controllers/virtualmcpserver_controller.gocountBackendHealthtreats unauthenticated as routablecmd/thv-operator/pkg/virtualmcpserverstatus/collector.goBackendCountpkg/vmcp/discovery/middleware.gopkg/vmcp/server/status.go/statusendpoint healthy field; update Health field docpkg/vmcp/health/status_builder_test.gopkg/vmcp/types_test.go"unauthenticated"mappingpkg/vmcp/status/k8s_reporter_test.gotest/e2e/.../virtualmcp_yardstick_base_test.goDoes this introduce a user-facing change?
Yes. VirtualMCPServer CRD status changes:
"unauthenticated"instead of"unavailable"BackendCountnow includes unauthenticated backends (they are routable)Readyinstead ofFailedwhen all backends require user-level authThis is a v1alpha1 API — additive status value change.
Implementation plan
Approved implementation plan
When all backends in an MCPGroup require upstream OAuth, health check probes lack user tokens so the health checker marks them as BackendUnauthenticated. However, determinePhase() only counted BackendHealthy backends toward availability, causing PhaseFailed.
The fix introduces Summary.Routable() (healthy + unauthenticated) and uses it for phase determination. BackendUnauthenticated gets its own CRD status "unauthenticated" instead of being lumped with "unavailable".
Code review by 5 agents identified critical bugs in the operator controller and status collector (independent backend counting that did not account for the new CRD status), stale comments, grammar issues, and DRY violations — all addressed.
Special notes for reviewers
countBackendHealthand status collector had independent backend-counting logic that needed the same fix — without those changes, the operator would override the vMCP-reported Ready status with Degraded during reconciliation./statusendpoint and discovery middleware have TODO comments for follow-up: the status endpoint still reportshealthy: falsefor all-unauthenticated groups, and discovery excludes unauthenticated backends from capability aggregation (intentional — discovery probes cannot carry user tokens).AllBackendsHealthy,NoHealthyBackends) are intentionally unchanged to avoid breaking existing tooling.Generated with Claude Code