Skip to content

fix(agentex): bootstrap OTel auto-instrumentation in uvicorn spawn workers#305

Merged
james-cardenas merged 9 commits into
mainfrom
jamesc-fix-auto-intrumentation
Jun 15, 2026
Merged

fix(agentex): bootstrap OTel auto-instrumentation in uvicorn spawn workers#305
james-cardenas merged 9 commits into
mainfrom
jamesc-fix-auto-intrumentation

Conversation

@james-cardenas

@james-cardenas james-cardenas commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Bootstrap OpenTelemetry auto-instrumentation at import time so uvicorn spawn workers get HTTP/library instrumentation and custom metrics, not just the parent process
  • Assign a per-worker service.instance.id by patching OTEL_RESOURCE_ATTRIBUTES before initialize(), fixing shared timeseries when multiple workers inherit the same pod-level resource attrs (opentelemetry-python#4390)
  • Move otel_metrics to the first import in app.py so instrumentors patch FastAPI/httpx/SQLAlchemy at import time; init_otel_metrics() at lifespan startup attaches custom instruments to the existing global MeterProvider

Problem

The OTel Operator injects auto-instrumentation via sitecustomize, which runs initialize() in the parent process and then strips auto-instrumentation from PYTHONPATH. Uvicorn spawn workers are fresh Python processes without sitecustomize, so they previously served plain FastAPI with no OTel HTTP middleware or metrics.

Separately, spawn workers share the pod-level OTEL_RESOURCE_ATTRIBUTES env var. Auto-instrumentation builds provider resources from env at initialize() time via Resource.create() — so all workers would emit on the same service.instance.id without a per-process suffix (opentelemetry-python#4390).

Solution

  1. bootstrap_auto_instrumentation() — runs on otel_metrics import; syncs service.instance.id.<pid> into env, then calls initialize() to create global TracerProvider/MeterProvider and load instrumentors (auto-instrumentation reference)
  2. init_otel_metrics() — unchanged coexistence model: attaches custom app metrics (auth_cache_*, db_*) to the bootstrap provider when present; standalone OTLP pipeline only when no global provider exists
  3. Import orderapp.py imports otel_metrics first, before FastAPI and other auto-instrumented libraries

References

Notes

  • Helm (single worker): bootstrap runs in the worker process; operator k8s resource labels are preserved; duplicate initialize() on --workers 1 only produces set-once warnings
  • Dockerfile (--workers 4): each spawn worker bootstraps independently with a distinct pid-suffixed service.instance.id
  • ddtrace coexistence: documented in module docstring; Helm uses ddtrace-run only when datadog.env is set
  • Removed ineffective NoOpMeterProvider reset on shutdown — OTel global MeterProvider is set-once and cannot be replaced

Test plan

  • pytest agentex/tests/unit/utils/test_otel_metrics.py (24 tests)
  • Deploy to cluster; confirm _InstrumentedFastAPI in spawn workers
  • Confirm distinct service.instance.id per worker when --workers > 1
  • Confirm FastAPI HTTP metrics include http_route and k8s resource labels from operator env
  • Confirm custom metrics (auth_cache_*, db_*) export on the same provider resource

Made with Cursor

Greptile Summary

This PR bootstraps OTel auto-instrumentation in uvicorn spawn workers by importing otel_metrics first in app.py and calling bootstrap_auto_instrumentation() before any auto-instrumented library is imported. It also assigns a per-worker service.instance.id by appending the process PID to avoid shared timeseries across workers.

  • Bootstrap flow: bootstrap_auto_instrumentation() syncs a PID-suffixed service.instance.id into OTEL_RESOURCE_ATTRIBUTES, then calls initialize() to install global TracerProvider/MeterProvider and patch libraries; failures are caught and logged without crashing the service.
  • Custom metrics coexistence: init_otel_metrics() attaches to the bootstrap provider when present, or creates a standalone OTLP pipeline when no global provider exists; the NoOpMeterProvider reset on shutdown is removed because OTel globals are set-once.
  • Test coverage: 24 unit tests added for bootstrap, idempotency, failure recovery, and resource attribute construction.

Confidence Score: 5/5

Safe to merge — the bootstrap path is well-guarded with try/except and idempotency checks; the import-order change in app.py achieves the intended patching window.

Bootstrap failures are caught and logged without crashing the service. The per-worker service.instance.id logic correctly handles the idempotency case and the env-mutation is scoped to resource attribute construction before initialize(). The two observations noted are edge cases with no current practical impact given how init_otel_metrics is called today.

agentex/src/utils/otel_metrics.py — _sync_instance_id_to_env comma splitting and the resource merge precedence in init_otel_metrics standalone mode.

Important Files Changed

Filename Overview
agentex/src/utils/otel_metrics.py Adds bootstrap_auto_instrumentation() with proper exception handling and idempotency; introduces _sync_instance_id_to_env that splits OTEL_RESOURCE_ATTRIBUTES by bare comma without handling quoted values, and a resource merge pattern where env-detected attributes in the pid_resource can silently override explicit service_name/service_version/environment args passed to init_otel_metrics in standalone mode.
agentex/src/api/app.py otel_metrics moved to first import with bootstrap_auto_instrumentation() call before FastAPI/httpx/SQLAlchemy; ruff E402 suppressed with a clear comment; no functional regressions.
agentex/tests/unit/utils/test_otel_metrics.py Comprehensive bootstrap test coverage: import blocking, failure/retry semantics, idempotency, env mutation, and PID-suffixed resource attributes; autouse fixture correctly resets bootstrap flag between tests.

Sequence Diagram

sequenceDiagram
    participant UV as uvicorn spawn worker
    participant APP as app.py
    participant OTEL as otel_metrics.py
    participant SDK as OTel SDK
    participant LS as FastAPI lifespan

    UV->>APP: import app
    APP->>OTEL: import otel_metrics
    APP->>OTEL: bootstrap_auto_instrumentation()
    OTEL->>OTEL: _sync_instance_id_to_env with pid suffix
    OTEL->>SDK: initialize()
    SDK-->>OTEL: TracerProvider + MeterProvider set globally
    Note over SDK: FastAPI, httpx, SQLAlchemy patched
    OTEL-->>APP: True
    APP->>APP: import FastAPI and other libraries

    UV->>LS: startup lifespan
    LS->>OTEL: init_otel_metrics()
    alt Bootstrap provider exists
        OTEL-->>LS: attach custom instruments to bootstrap provider
    else No global provider standalone mode
        OTEL->>SDK: MeterProvider with pid-suffixed resource
        OTEL-->>LS: new standalone MeterProvider
    end

    UV->>LS: shutdown lifespan
    LS->>OTEL: shutdown_otel_metrics()
    Note over OTEL: Only shuts down module-created provider
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
agentex/src/utils/otel_metrics.py:110-114
The comma-split on `OTEL_RESOURCE_ATTRIBUTES` doesn't account for quoted values (e.g., `some.key="hello,world"`). The OTel spec allows commas inside quoted values, and the `OTELResourceDetector` handles them correctly. This bare `split(",")` would shatter a quoted entry into fragments and reassemble a corrupted string. If the operator ever injects a label whose value contains a comma, the attribute will be silently mangled in the environment for every subsequent call that reads it (including `initialize()` itself).

```suggestion
    import re

    # Split only on commas that are NOT inside double-quoted values.
    raw = os.environ.get("OTEL_RESOURCE_ATTRIBUTES", "")
    parts = [
        part.strip()
        for part in re.split(r',(?=(?:[^"]*"[^"]*")*[^"]*$)', raw)
        if part.strip() and not part.strip().startswith(f"{key}=")
    ]
```

### Issue 2 of 2
agentex/src/utils/otel_metrics.py:273-277
**Resource merge silently drops explicit constructor args**

`Resource.create({"service.instance.id": ...})` runs `OTELResourceDetector` internally, so the second resource carries ALL env attributes at "env" priority. Because `Resource.merge` gives the `other` argument (the pid resource) precedence, any env value for `service.name`, `service.version`, or `deployment.environment` in `OTEL_RESOURCE_ATTRIBUTES` will silently override the `service_name`, `service_version`, and `environment` keyword arguments passed to `init_otel_metrics`. Today's call site passes no args so the values are identical, but a caller passing `service_name="custom"` with `OTEL_SERVICE_NAME=env-name` set would find `custom` discarded. Consider creating the pid resource with only the one explicit attribute and merging the detected env resource separately so precedence is unambiguous.

Reviews (5): Last reviewed commit: "Merge branch 'main' into jamesc-fix-auto..." | Re-trigger Greptile

@james-cardenas james-cardenas requested a review from a team as a code owner June 12, 2026 05:14
Comment thread agentex/src/utils/otel_metrics.py Outdated
Comment thread agentex/src/utils/otel_metrics.py

@smoreinis smoreinis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me - my main question is whether this actually works in a deployed environment? I see that we try to import opentelemetry.instrumentation.auto_instrumentation but the agentex pyproject.toml only lists opentelemetry-api, opentelemetry-sdk, and opentelemetry-exporter-otlp so I'm wondering if this would just still end up being a no-op or if there's any other mechanism that helps ensure that this import works (like a config on the helm chart?)

@james-cardenas

Copy link
Copy Markdown
Contributor Author

Thanks for the review @smoreinis ! And to answer your questions, this works in our deployed environment when the deployment has auto-instrumentation enabled via the appropriate annotations, in which case the OTel Operator copies a pre-packaged bundle of all the OpenTelemetry Python libraries (the SDK, Exporters, and auto-instrumentors, etc. ) into the pod via an initContainer.

@james-cardenas james-cardenas merged commit d65011a into main Jun 15, 2026
31 checks passed
@james-cardenas james-cardenas deleted the jamesc-fix-auto-intrumentation branch June 15, 2026 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants