Skip to content

Treat unauthenticated backends as routable in vMCP health#4866

Open
lorr1 wants to merge 2 commits intomainfrom
fix/vmcp-unauthenticated-backends-routable
Open

Treat unauthenticated backends as routable in vMCP health#4866
lorr1 wants to merge 2 commits intomainfrom
fix/vmcp-unauthenticated-backends-routable

Conversation

@lorr1
Copy link
Copy Markdown

@lorr1 lorr1 commented Apr 15, 2026

Summary

  • vMCPs with OAuth-protected backends (e.g., upstreamInject auth) permanently reported PhaseFailed because health probes lack user tokens and get 401s, which determinePhase() counted as zero healthy backends. These backends are reachable and running — they just need per-request user auth.
  • Introduce Summary.Routable() (healthy + unauthenticated) and use it for phase determination, BackendCount, and status messages across the health monitor, operator controller, and status collector. Give BackendUnauthenticated its own CRD status "unauthenticated" instead of mapping to "unavailable".

Fixes #4824

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)

Changes

File Change
pkg/vmcp/health/monitor.go Add Summary.Routable() method; update determinePhase(), formatStatusMessage(), buildConditions(), BackendCount to use routable count; add pluralBackend() helper
pkg/vmcp/types.go ToCRDStatus() maps BackendUnauthenticated to "unauthenticated" instead of "unavailable"; update DiscoveredBackend.Status doc
cmd/thv-operator/api/v1alpha1/virtualmcpserver_types.go Add BackendStatusUnauthenticated constant; update BackendCount doc
cmd/thv-operator/controllers/virtualmcpserver_controller.go countBackendHealth treats unauthenticated as routable
cmd/thv-operator/pkg/virtualmcpserverstatus/collector.go Count unauthenticated backends in BackendCount
pkg/vmcp/discovery/middleware.go TODO comment noting unauthenticated exclusion from discovery is intentional
pkg/vmcp/server/status.go TODO comment for /status endpoint healthy field; update Health field doc
pkg/vmcp/health/status_builder_test.go Add 4 test cases for unauthenticated phase logic
pkg/vmcp/types_test.go Update ToCRDStatus tests for new "unauthenticated" mapping
pkg/vmcp/status/k8s_reporter_test.go Update fixture strings
test/e2e/.../virtualmcp_yardstick_base_test.go Update stale BackendCount comment

Does this introduce a user-facing change?

Yes. VirtualMCPServer CRD status changes:

  • Backends requiring per-request OAuth now report status "unauthenticated" instead of "unavailable"
  • BackendCount now includes unauthenticated backends (they are routable)
  • Phase is Ready instead of Failed when all backends require user-level auth

This is a v1alpha1 API — additive status value change.

Implementation plan

Approved implementation plan

When all backends in an MCPGroup require upstream OAuth, health check probes lack user tokens so the health checker marks them as BackendUnauthenticated. However, determinePhase() only counted BackendHealthy backends toward availability, causing PhaseFailed.

The fix introduces Summary.Routable() (healthy + unauthenticated) and uses it for phase determination. BackendUnauthenticated gets its own CRD status "unauthenticated" instead of being lumped with "unavailable".

Code review by 5 agents identified critical bugs in the operator controller and status collector (independent backend counting that did not account for the new CRD status), stale comments, grammar issues, and DRY violations — all addressed.

Special notes for reviewers

  • The operator controller countBackendHealth and status collector had independent backend-counting logic that needed the same fix — without those changes, the operator would override the vMCP-reported Ready status with Degraded during reconciliation.
  • The /status endpoint and discovery middleware have TODO comments for follow-up: the status endpoint still reports healthy: false for all-unauthenticated groups, and discovery excludes unauthenticated backends from capability aggregation (intentional — discovery probes cannot carry user tokens).
  • Condition reason strings (AllBackendsHealthy, NoHealthyBackends) are intentionally unchanged to avoid breaking existing tooling.

Generated with Claude Code

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Apr 15, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 72.88136% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.10%. Comparing base (12f25e5) to head (6349cc0).

Files with missing lines Patch % Lines
...perator/controllers/virtualmcpserver_controller.go 0.00% 9 Missing ⚠️
...v-operator/pkg/virtualmcpserverstatus/collector.go 0.00% 5 Missing ⚠️
pkg/vmcp/health/monitor.go 95.12% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4866      +/-   ##
==========================================
- Coverage   69.12%   69.10%   -0.03%     
==========================================
  Files         531      531              
  Lines       55183    55215      +32     
==========================================
+ Hits        38146    38156      +10     
- Misses      14113    14137      +24     
+ Partials     2924     2922       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lorr1 lorr1 force-pushed the fix/vmcp-unauthenticated-backends-routable branch from 6db33a2 to a88b0fe Compare April 15, 2026 21:23
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Apr 15, 2026
Copy link
Copy Markdown
Collaborator

@ChrisJBurns ChrisJBurns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Agent Consensus Review

Agents consulted: kubernetes-expert, go-expert-developer, code-reviewer, toolhive-expert

Consensus Summary

# Finding Consensus Severity Action
1 pluralBackend() breaks formatBackendMessage godoc 10/10 MEDIUM Fix
2 Grammar: "1 require authentication" subject-verb disagreement 10/10 MEDIUM Fix
3 /status endpoint inconsistency with health monitor 10/10 MEDIUM Discuss
4 Condition Reason "AllBackendsHealthy" misleading for unauthenticated 8/10 LOW Discuss
5 CRD YAML manifests may need regeneration 7/10 MEDIUM Verify
6 Missing TestBuildConditions cases for unauthenticated branches 10/10 MEDIUM Fix
7 "All 1 backend" phrasing awkward for single-backend case 7/10 LOW Discuss

Overall

This PR correctly addresses the root cause of #4824: unauthenticated backends are reachable and should be treated as routable for phase determination. The Summary.Routable() abstraction is clean, and the fix is applied consistently across the health monitor, operator controller, and status collector. The test additions cover the key scenarios well.

The findings are primarily polish — grammar in user-facing status messages, a misplaced helper function that breaks godoc, and a gap in test coverage for buildConditions. The /status endpoint inconsistency is acknowledged via TODO and deferred to #4824, which is reasonable for scoping but worth discussing since this PR is what introduces the observable divergence. None of these are blocking in a "will break production" sense, but addressing the grammar and godoc issues before merge would improve the user experience.

Documentation

The BackendCount doc comment in virtualmcpserver_types.go and DiscoveredBackend.Status comment in types.go were updated, but the generated CRD YAML manifests may not have been regenerated. Run task operator-manifests to verify. If the CRD descriptions changed, include the regenerated manifests. (Finding #5)


Generated with Claude Code

// This returns generic error categories to avoid exposing sensitive error details in status.
// Detailed errors are logged when they occur (in performHealthCheck) for debugging.
// pluralBackend returns "backend" or "backends" based on count.
func pluralBackend(n int) string {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] pluralBackend() breaks formatBackendMessage godoc (Consensus: 10/10)

This function is inserted between the doc comment for formatBackendMessage and its function signature. In the resulting file, the three-line comment block ("formatBackendMessage creates a human-readable message...") is now separated from func formatBackendMessage(...) by the entire pluralBackend function body — formatBackendMessage becomes undocumented in godoc.

Move pluralBackend() above the formatBackendMessage doc comment block, or near formatStatusMessage where it's primarily used.

Raised by: go-expert-developer, code-reviewer, kubernetes-expert, toolhive-expert

Comment thread pkg/vmcp/health/monitor.go Outdated
Comment on lines +647 to +651
return fmt.Sprintf("All %d %s require authentication",
summary.Unauthenticated, pluralBackend(summary.Unauthenticated))
}
return fmt.Sprintf("%d %s healthy, %d require authentication",
summary.Healthy, pluralBackend(summary.Healthy), summary.Unauthenticated)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Grammar: "1 require authentication" subject-verb disagreement (Consensus: 10/10)

When summary.Unauthenticated == 1, these format strings produce "1 require authentication" (should be "1 requires"). Affects formatStatusMessage (lines 647, 650) and buildConditions (~lines 859, 862). The test at status_builder_test.go:312 encodes this incorrect grammar.

Additionally, "All 1 backend require authentication" (line 647) reads awkwardly — consider dropping "All" when count == 1. (Finding #7, Consensus: 7/10)

Options: (a) add a verb pluralization helper, (b) rephrase to avoid the verb (e.g., "%d unauthenticated"), or (c) restructure as "%d awaiting authentication".

Raised by: go-expert-developer, code-reviewer, kubernetes-expert, toolhive-expert

Comment thread pkg/vmcp/server/status.go Outdated
}
backendStatuses = append(backendStatuses, status)

// TODO(#4824): Consider treating BackendUnauthenticated as routable here too.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] /status endpoint now contradicts health monitor for all-unauthenticated backends (Consensus: 10/10)

After this PR, the health monitor reports PhaseReady for all-unauthenticated backends, but /status still reports healthy: false (since hasHealthyBackend only checks BackendHealthy). This is a new divergence — previously both agreed. External consumers (load balancers, monitoring) could see contradictory signals.

The TODO is appropriate for scoping. Consider either fixing it here (adding || healthStatus == vmcp.BackendUnauthenticated) or calling out the gap explicitly in the PR description so reviewers are aware.

Raised by: go-expert-developer, code-reviewer, kubernetes-expert, toolhive-expert

Comment thread pkg/vmcp/health/monitor.go Outdated
@@ -822,18 +849,25 @@ func buildConditions(summary Summary, phase vmcp.Phase, configuredBackendCount i
case vmcp.PhaseReady:
readyCondition.Status = metav1.ConditionTrue
readyCondition.Reason = "AllBackendsHealthy"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] Condition Reason "AllBackendsHealthy" misleading for unauthenticated-only vMCPs (Consensus: 8/10)

When all backends are unauthenticated, this Ready condition uses Reason "AllBackendsHealthy" but message "All N backends require authentication" — none are actually healthy. The PR notes mention keeping Reason strings stable for existing tooling, which is reasonable.

Consider: (a) adding a brief comment explaining why Reason was not updated, or (b) since this is v1alpha1, adding "AllBackendsRoutable" for the unauthenticated-only case.

Raised by: code-reviewer, kubernetes-expert, toolhive-expert

},
expectedPhase: vmcp.PhaseDegraded,
expectedCount: 2,
},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Missing TestBuildConditions cases for unauthenticated branches (Consensus: 10/10)

buildConditions now has three new branches under PhaseReady (all healthy, all unauthenticated, mixed), but TestBuildConditions has no entries exercising the new paths. These are tested indirectly via TestBuildStatus_PhaseLogic, but direct unit tests for condition messages would improve coverage and failure diagnostics.

Add entries for: (1) all unauthenticated (PhaseReady), (2) mixed healthy + unauthenticated (PhaseReady), (3) PhaseDegraded with unauthenticated.

Raised by: go-expert-developer, code-reviewer, kubernetes-expert, toolhive-expert

lorr1 and others added 2 commits April 16, 2026 00:43
Backends returning 401 due to missing user-level OAuth tokens (e.g.,
upstreamInject auth) are reachable and running — they just require
per-request user auth. Health probes lack user tokens, so they correctly
detect the unauthenticated state, but this should not cause PhaseFailed.

Introduce Summary.Routable() (healthy + unauthenticated) and use it for
phase determination, BackendCount, and status messages across the health
monitor, operator controller, and status collector. Give
BackendUnauthenticated its own CRD status "unauthenticated" instead of
mapping to "unavailable".

Fixes #4824

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix godoc: move pluralBackend() above formatBackendMessage doc comment
- Fix grammar: add pluralRequire()/quantifyBackends() helpers for
  correct subject-verb agreement ("1 requires" not "1 require") and
  drop "All" prefix for singular counts
- Fix /status endpoint: treat BackendUnauthenticated as routable,
  consistent with health monitor
- Rename condition Reason from AllBackendsHealthy to AllBackendsRoutable
  (v1alpha1, no stability guarantee)
- Add TestBuildConditions cases for unauthenticated branches
- Regenerate CRD manifests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lorr1 lorr1 force-pushed the fix/vmcp-unauthenticated-backends-routable branch from a88b0fe to 6349cc0 Compare April 16, 2026 07:43
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Apr 16, 2026
@lorr1
Copy link
Copy Markdown
Author

lorr1 commented Apr 16, 2026

This was great. My agent talked to yours lol.

Audit: Review Findings vs. Follow-up Commit (6349cc0d)                                                                                   
   
  ┌─────┬───────────────────────────────┬────────┬─────────────────────────────────────────────────────────────────────────────────────┐   
  │  #  │            Finding            │ Status │                                       Details                                       │ 
  ├─────┼───────────────────────────────┼────────┼─────────────────────────────────────────────────────────────────────────────────────┤
  │     │                               │        │ Helpers (pluralBackend, pluralRequire, quantifyBackends) are now placed above the   │
  │ 1   │ pluralBackend() breaks        │ Fixed  │ formatBackendMessage doc comment (lines 735-757), each with their own doc comment.  │
  │     │ formatBackendMessage godoc    │        │ formatBackendMessage's godoc at line 759-761 is directly above its signature at     │   
  │     │                               │        │ line 762.                                                                           │
  ├─────┼───────────────────────────────┼────────┼─────────────────────────────────────────────────────────────────────────────────────┤   
  │     │ Grammar: "1 require           │        │ Added pluralRequire() (returns "requires"/"require") and quantifyBackends()         │ 
  │ 2   │ authentication" subject-verb  │ Fixed  │ (returns "1 backend" vs "All N backends"). Test expectations updated: "1 backend    │   
  │     │ disagreement                  │        │ healthy, 1 requires authentication" — correct grammar.                              │
  ├─────┼───────────────────────────────┼────────┼─────────────────────────────────────────────────────────────────────────────────────┤   
  │     │ /status endpoint              │        │                                                                                     │ 
  │ 3   │ inconsistency with health     │ Fixed  │ status.go:96 now includes `                                                         │   
  │     │ monitor                       │        │                                                                                     │
  ├─────┼───────────────────────────────┼────────┼─────────────────────────────────────────────────────────────────────────────────────┤   
  │     │ Condition Reason              │        │                                                                                     │   
  │ 4   │ "AllBackendsHealthy"          │ Fixed  │ Changed to "AllBackendsRoutable" (line 868). Tests updated across                   │
  │     │ misleading for                │        │ status_builder_test.go and k8s_reporter_test.go.                                    │   
  │     │ unauthenticated               │        │                                                                                     │ 
  ├─────┼───────────────────────────────┼────────┼─────────────────────────────────────────────────────────────────────────────────────┤   
  │     │ CRD YAML manifests may need   │        │ Both CRD YAML files updated in the diff (deploy/charts/operator-crds/files/crds/    │
  │ 5   │ regeneration                  │ Fixed  │ and deploy/charts/operator-crds/templates/). BackendCount description and status    │   
  │     │                               │        │ field enum updated.                                                                 │ 
  ├─────┼───────────────────────────────┼────────┼─────────────────────────────────────────────────────────────────────────────────────┤
  │     │ Missing TestBuildConditions   │        │ 3 new test cases added: "all backends unauthenticated" (PhaseReady), "mixed healthy │
  │ 6   │ cases for unauthenticated     │ Fixed  │  and unauthenticated" (PhaseReady), "degraded with unauthenticated and degraded     │   
  │     │ branches                      │        │ backend" (PhaseDegraded).                                                           │
  ├─────┼───────────────────────────────┼────────┼─────────────────────────────────────────────────────────────────────────────────────┤   
  │ 7   │ "All 1 backend" phrasing      │ Fixed  │ quantifyBackends(1) returns "1 backend" (no "All" prefix); quantifyBackends(n>1)    │ 
  │     │ awkward for singular case     │        │ returns "All N backends".                                                           │   
  └─────┴───────────────────────────────┴────────┴─────────────────────────────────────────────────────────────────────────────────────┘
                                                                                                                                           
  All 7 findings from the review are addressed. The follow-up commit went beyond what was strictly required — findings #3 and #4 were      
  tagged "Discuss" (not "Fix"), but both got actual code fixes rather than just comments. Finding #7 was LOW severity but was also fixed.

Copy link
Copy Markdown
Contributor

@yrobla yrobla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the fix is correct and well-scoped — the pre-PR behavior (PhaseFailed for all-unauthenticated groups) was clearly wrong, and the approach is sound for v1alpha1. A few observations below, none of which are blockers.

I'd suggest merging this and filing a follow-up issue scoped to: "eliminate BackendUnauthenticated state via auth-server health probe tokens once epic stacklok-epics#198 lands" — at that point the health monitor can request short-lived service-account tokens and probe OAuth backends properly, closing the capability-aggregation gap (discovery still excludes unauthenticated backends) and aligning NoHealthyBackendsNoRoutableBackends cleanly.

readyCondition.Message = fmt.Sprintf("%d/%d backends healthy", summary.Healthy, summary.Total)
readyCondition.Message = fmt.Sprintf("%d/%d backends routable", summary.Routable(), summary.Total)
case vmcp.PhaseFailed:
readyCondition.Reason = "NoHealthyBackends"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reason string is now semantically mismatched: the condition message in this branch says "No routable backends available" but the reason is still NoHealthyBackends. The PR description says condition reasons were intentionally left unchanged to avoid breaking tooling — but AllBackendsHealthy was renamed to AllBackendsRoutable a few lines above (line 868), so the freeze isn't consistent. Worth either aligning this to NoRoutableBackends here too, or documenting explicitly why the PhaseFailed reason is frozen while the PhaseReady one was not. At minimum, a follow-up issue to track the rename would prevent the inconsistency from calcifying.

Comment thread pkg/vmcp/server/status.go
backendStatuses = append(backendStatuses, status)

if healthStatus == vmcp.BackendHealthy {
if healthStatus == vmcp.BackendHealthy || healthStatus == vmcp.BackendUnauthenticated {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix — this aligns the /status HTTP endpoint's healthy field with the CRD phase behavior. One note: the PR description's "Special notes for reviewers" section still says "the status endpoint still reports healthy: false for all-unauthenticated groups" as a remaining TODO, but this commit already resolved it. Worth updating the description so reviewers don't go looking for a problem that no longer exists.

// or degraded. Backends that are unhealthy, unknown, or unauthenticated are excluded
// from capability aggregation to prevent exposing tools from unavailable backends.
//
// TODO(#4824): Unauthenticated backends are treated as routable for phase determination
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exclusion is intentional and the comment explains why clearly. One suggestion: reference a dedicated follow-up issue here rather than #4824 (which is the bug this PR fixes, not the future work). When the auth server (stacklok-epics#198) can provide service-account tokens for health probes, this is the exact place to revisit — a linked issue would make that easier to track and avoids the TODO pointing at a closed bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: vMCP marks unauthenticated backends as unhealthy, causing PhaseFailed for OAuth-protected groups

3 participants