Skip to content

Add policy_stopped workload status and enforce policy gate on restart#4857

Open
reyortiz3 wants to merge 8 commits intomainfrom
reyortiz3/policy-stopped-status
Open

Add policy_stopped workload status and enforce policy gate on restart#4857
reyortiz3 wants to merge 8 commits intomainfrom
reyortiz3/policy-stopped-status

Conversation

@reyortiz3
Copy link
Copy Markdown
Contributor

@reyortiz3 reyortiz3 commented Apr 15, 2026

Summary

  • Policy enforcement currently only gates workload creation — running workloads are unaffected when a policy activates, and restart paths have no policy check at all. This PR adds the upstream groundwork to enforce the non_registry_servers policy on existing workloads.
  • Adds WorkloadStatusPolicyStopped as a first-class status (following the precedent of WorkloadStatusUnauthenticated) so the policy layer can set it when stopping a non-compliant workload, giving the UI and CLI a programmatic signal rather than freeform text.
  • Calls EagerCheckCreateServer in restartSingleWorkload so the policy gate blocks any restart attempt for workloads that violate the active policy.

Changes

File Change
pkg/container/runtime/types.go Add WorkloadStatusPolicyStopped constant
pkg/core/workload.go Add policy_stopped to enums tag for swagger
pkg/workloads/manager.go Map policy_stoppedBackendUnhealthy; call EagerCheckCreateServer in restart path
cmd/thv/app/list.go, status.go Add 🚫 indicator for policy_stopped in CLI output
docs/server/swagger.*, docs.go Regenerated to include new enum value

Type of change

  • New feature (non-breaking change which adds functionality)

Test plan

  • task build passes
  • task test passes (pre-existing failures in pkg/client and pkg/telemetry are unrelated to this change and reproduce on main)
  • Swagger enum updated and regenerated via task docs

Does this introduce a user-facing change?

CLI list and status commands will show 🚫 policy_stopped for workloads stopped by policy, making the reason immediately visible rather than showing a plain stopped or error.

Generated with Claude Code

Introduces WorkloadStatusPolicyStopped as a first-class workload status
to support enterprise policy enforcement. When non-registry servers are
blocked by policy, workloads can be set to this status so the UI and CLI
can surface the reason clearly, rather than showing an opaque "error".

- Add WorkloadStatusPolicyStopped constant to pkg/container/runtime/types.go
- Update enums tag in pkg/core/workload.go to include policy_stopped
- Map policy_stopped to BackendUnhealthy in mapWorkloadStatusToVMCPHealth
- Call EagerCheckCreateServer in restartSingleWorkload so the policy gate
  blocks restart of workloads that violate the active policy
- Add 🚫 indicator for policy_stopped in CLI list and status output,
  following the existing ⚠️ pattern for unauthenticated workloads
- Regenerate swagger docs to include the new status enum value

Part of stacklok/stacklok-enterprise-platform#406

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the size/XS Extra small PR: < 100 lines changed label Apr 15, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.07%. Comparing base (e92dba4) to head (d1d1dfb).

Files with missing lines Patch % Lines
pkg/workloads/manager.go 0.00% 9 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4857      +/-   ##
==========================================
+ Coverage   69.06%   69.07%   +0.01%     
==========================================
  Files         531      531              
  Lines       55172    55182      +10     
==========================================
+ Hits        38104    38117      +13     
+ Misses      14140    14138       -2     
+ Partials     2928     2927       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread docs/server/docs.go
Comment thread pkg/workloads/manager.go
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Apr 15, 2026
Add all WorkloadStatus cases to the switches in list and status
commands to satisfy the exhaustive linter.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Apr 15, 2026
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Apr 15, 2026
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Apr 15, 2026
@reyortiz3
Copy link
Copy Markdown
Contributor Author

@claude can you verify if the comments were addressed?

@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Apr 15, 2026
- Add policy_stopped to workloadStatusResponse enums tag so the swagger
  spec for the status endpoint includes the new value
- Add EagerCheckCreateServer check in maybeSetupContainerWorkload after
  loadRunnerFromState to cover the path where the outer LoadState call
  fails on a partial name but succeeds after label-based name resolution
- Regenerate swagger docs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Apr 15, 2026
@stacklok stacklok deleted a comment from claude bot Apr 15, 2026
@reyortiz3
Copy link
Copy Markdown
Contributor Author

@claude can you review if the comments have been addressed?

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude finished @reyortiz3's task in 56s —— View job


Review comment verification

  • Read pkg/api/v1/workload_types.go — check enums tag on workloadStatusResponse
  • Read pkg/workloads/manager.go — check placement of EagerCheckCreateServer call
  • Report findings

Both previous review comments have been addressed. ✅

Comment 1 — Missing policy_stopped in workloadStatusResponse enums tag

Status: Fixed

pkg/api/v1/workload_types.go:35 now includes policy_stopped at the end of the enums tag:

Status runtime.WorkloadStatus `json:"status" enums:"running,stopped,error,starting,stopping,unhealthy,removing,unknown,unauthenticated,policy_stopped"`

The swagger-generated files should now consistently emit all 10 enum values for the workloadStatusResponse schema, matching the other blocks.


Comment 2 — EagerCheckCreateServer bypass when LoadState fails

Status: Fixed

A second call to EagerCheckCreateServer was added inside maybeSetupContainerWorkload (pkg/workloads/manager.go:1306-1311), which is invoked by restartContainerWorkload. This covers the path where the outer runner.LoadState fails in restartSingleWorkload and execution falls through to restartContainerWorkload:

// Check policy gates before restarting. This covers the case where the caller
// could not load state via the original name but we resolved the canonical name
// from container labels above, so the check must happen here.
if err := runner.EagerCheckCreateServer(ctx, mcpRunner.Config); err != nil {
    return "", nil, fmt.Errorf("server restart blocked by policy: %w", err)
}

The policy gate is now enforced regardless of whether LoadState succeeds or fails in the outer function, closing the bypass described in the review.


Both raised issues are resolved. The PR looks good from a review-comment-resolution perspective.

rdimitrov
rdimitrov previously approved these changes Apr 15, 2026
Comment thread pkg/workloads/manager.go
// Check policy gates before restarting — the loaded RunConfig carries the same
// fields (RegistryAPIURL, RegistryURL, RemoteURL) that the gate evaluates on create.
if err := runner.EagerCheckCreateServer(ctx, runConfig); err != nil {
return fmt.Errorf("server restart blocked by policy: %w", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both policy gate check sites return an error here but never call SetWorkloadStatus(ctx, name, rt.WorkloadStatusPolicyStopped, ...). The constant is defined, the CLI renders the indicator, and the swagger docs list it, but nothing ever transitions a workload into this status. A blocked restart just leaves the workload as stopped.

Is this intentional -- e.g. the operator or some downstream consumer will set this? If so it might be worth a comment saying so. Otherwise the display code and health mapping for policy_stopped are unreachable.

Comment thread pkg/workloads/manager.go
return vmcp.BackendUnauthenticated
case rt.WorkloadStatusPolicyStopped:
return vmcp.BackendUnhealthy
default:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping PolicyStopped to BackendUnhealthy means vMCP can't distinguish "crashed" from "blocked by policy". Compare with Unauthenticated which gets its own BackendUnauthenticated variant. Is the generic mapping intentional here, or would a dedicated BackendPolicyStopped be worth adding for parity?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this, but it's the same pattern as above, for example:

case rt.WorkloadStatusStopped, rt.WorkloadStatusError, rt.WorkloadStatusStopping, rt.WorkloadStatusRemoving:
		return vmcp.BackendUnhealthy

Comment thread cmd/thv/app/list.go Outdated
case rt.WorkloadStatusPolicyStopped:
status = "🚫 " + status
case rt.WorkloadStatusRunning, rt.WorkloadStatusStopped, rt.WorkloadStatusError,
rt.WorkloadStatusStarting, rt.WorkloadStatusStopping, rt.WorkloadStatusUnhealthy,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this switch block is duplicated verbatim in status.go. Might be worth extracting a small helper in common.go so future status indicators only need updating in one place.

…ndicator

- Call SetWorkloadStatus with WorkloadStatusPolicyStopped in both policy
  gate failure paths in restartSingleWorkload and maybeSetupContainerWorkload,
  so a blocked restart transitions the workload into the new status instead
  of leaving it as stopped
- Extract the status indicator switch (⚠️ / 🚫) from list.go and status.go
  into a single workloadStatusIndicator helper in common.go

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Apr 15, 2026
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Apr 15, 2026
@reyortiz3
Copy link
Copy Markdown
Contributor Author

@claude can you review what comments are pending address?

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 15, 2026

Claude finished @reyortiz3's task in 1m 18s —— View job


Pending comment review

  • Gather context from review comments
  • Check if SetWorkloadStatus(policy_stopped) is called anywhere (jhrozek comment)
  • Check if switch block deduplication was done in common.go (jhrozek nit)
  • Check jhrozek's BackendUnhealthy mapping comment status
  • Report findings

All three of @jhrozek's review comments have been addressed. Here's the breakdown:

Comment 1 — SetWorkloadStatus never transitions to policy_stopped (manager.go:1097)

Status: Addressed ✅

Both policy gate sites now call SetWorkloadStatus with WorkloadStatusPolicyStopped:

  • pkg/workloads/manager.go:1094 — in restartSingleWorkload when the outer LoadState path blocks restart
  • pkg/workloads/manager.go:1313 — in maybeSetupContainerWorkload when the inner path blocks restart
if statusErr := d.statuses.SetWorkloadStatus(ctx, name, rt.WorkloadStatusPolicyStopped, err.Error()); statusErr != nil {
    slog.Warn("Failed to set workload status to policy_stopped", ...)
}

The status is now reachable and will be set when a restart is blocked by policy.


Comment 2 — PolicyStopped maps to BackendUnhealthy instead of a dedicated variant (manager.go:248)

Status: Author provided justification, no code change

@reyortiz3 replied noting this follows the same pattern as Stopped, Error, Stopping, and Removing — all of which map to BackendUnhealthy. The comment did not request a change, it asked whether the generic mapping was intentional. The author's response clarifies intent. This is still open as an unresolved conversation thread; @jhrozek has not yet replied to confirm whether the justification is acceptable.

If you want this to be considered resolved, a follow-up reply from @jhrozek (or dismissing the thread) would close it out.


Comment 3 (Nit) — Duplicated switch block in list.go and status.go

Status: Addressed ✅

The switch was extracted to workloadStatusIndicator() in cmd/thv/app/common.go:162–174. Both list.go:168 and status.go:102 now call the helper:

status := workloadStatusIndicator(c.Status)   // list.go
status := workloadStatusIndicator(workload.Status)  // status.go

Summary

Comment Commenter Status
SetWorkloadStatus never called for policy_stopped jhrozek ✅ Fixed
BackendUnhealthy vs dedicated BackendPolicyStopped jhrozek 💬 Author justified; awaiting reviewer acknowledgement
Duplicated switch block in list.go/status.go jhrozek ✅ Fixed (extracted to common.go)

The only item that is technically still "open" is the BackendUnhealthy mapping question — but it's a design discussion, not a required change. The thread needs @jhrozek to respond or be dismissed.

Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. The status-setting and deduplication fixes address the review comments well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Extra small PR: < 100 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants