fix(agentex): bootstrap OTel auto-instrumentation in uvicorn spawn workers#305
Conversation
smoreinis
left a comment
There was a problem hiding this comment.
this looks good to me - my main question is whether this actually works in a deployed environment? I see that we try to import opentelemetry.instrumentation.auto_instrumentation but the agentex pyproject.toml only lists opentelemetry-api, opentelemetry-sdk, and opentelemetry-exporter-otlp so I'm wondering if this would just still end up being a no-op or if there's any other mechanism that helps ensure that this import works (like a config on the helm chart?)
|
Thanks for the review @smoreinis ! And to answer your questions, this works in our deployed environment when the deployment has auto-instrumentation enabled via the appropriate annotations, in which case the OTel Operator copies a pre-packaged bundle of all the OpenTelemetry Python libraries (the SDK, Exporters, and auto-instrumentors, etc. ) into the pod via an initContainer. |
Summary
service.instance.idby patchingOTEL_RESOURCE_ATTRIBUTESbeforeinitialize(), fixing shared timeseries when multiple workers inherit the same pod-level resource attrs (opentelemetry-python#4390)otel_metricsto the first import inapp.pyso instrumentors patch FastAPI/httpx/SQLAlchemy at import time;init_otel_metrics()at lifespan startup attaches custom instruments to the existing globalMeterProviderProblem
The OTel Operator injects auto-instrumentation via
sitecustomize, which runsinitialize()in the parent process and then strips auto-instrumentation fromPYTHONPATH. Uvicorn spawn workers are fresh Python processes withoutsitecustomize, so they previously served plainFastAPIwith no OTel HTTP middleware or metrics.Separately, spawn workers share the pod-level
OTEL_RESOURCE_ATTRIBUTESenv var. Auto-instrumentation builds provider resources from env atinitialize()time viaResource.create()— so all workers would emit on the sameservice.instance.idwithout a per-process suffix (opentelemetry-python#4390).Solution
bootstrap_auto_instrumentation()— runs onotel_metricsimport; syncsservice.instance.id.<pid>into env, then callsinitialize()to create globalTracerProvider/MeterProviderand load instrumentors (auto-instrumentation reference)init_otel_metrics()— unchanged coexistence model: attaches custom app metrics (auth_cache_*,db_*) to the bootstrap provider when present; standalone OTLP pipeline only when no global provider existsapp.pyimportsotel_metricsfirst, before FastAPI and other auto-instrumented librariesReferences
initialize()creates providers viaResource.create()from env, not patching-onlyservice.instance.idfor multi-worker processesNotes
initialize()on--workers 1only produces set-once warnings--workers 4): each spawn worker bootstraps independently with a distinct pid-suffixedservice.instance.idddtrace-runonly whendatadog.envis setNoOpMeterProviderreset on shutdown — OTel globalMeterProvideris set-once and cannot be replacedTest plan
pytest agentex/tests/unit/utils/test_otel_metrics.py(24 tests)_InstrumentedFastAPIin spawn workersservice.instance.idper worker when--workers > 1http_routeand k8s resource labels from operator envauth_cache_*,db_*) export on the same provider resourceMade with Cursor
Greptile Summary
This PR bootstraps OTel auto-instrumentation in uvicorn spawn workers by importing
otel_metricsfirst inapp.pyand callingbootstrap_auto_instrumentation()before any auto-instrumented library is imported. It also assigns a per-workerservice.instance.idby appending the process PID to avoid shared timeseries across workers.bootstrap_auto_instrumentation()syncs a PID-suffixedservice.instance.idintoOTEL_RESOURCE_ATTRIBUTES, then callsinitialize()to install globalTracerProvider/MeterProviderand patch libraries; failures are caught and logged without crashing the service.init_otel_metrics()attaches to the bootstrap provider when present, or creates a standalone OTLP pipeline when no global provider exists; theNoOpMeterProviderreset on shutdown is removed because OTel globals are set-once.Confidence Score: 5/5
Safe to merge — the bootstrap path is well-guarded with try/except and idempotency checks; the import-order change in app.py achieves the intended patching window.
Bootstrap failures are caught and logged without crashing the service. The per-worker service.instance.id logic correctly handles the idempotency case and the env-mutation is scoped to resource attribute construction before initialize(). The two observations noted are edge cases with no current practical impact given how init_otel_metrics is called today.
agentex/src/utils/otel_metrics.py — _sync_instance_id_to_env comma splitting and the resource merge precedence in init_otel_metrics standalone mode.
Important Files Changed
Sequence Diagram
sequenceDiagram participant UV as uvicorn spawn worker participant APP as app.py participant OTEL as otel_metrics.py participant SDK as OTel SDK participant LS as FastAPI lifespan UV->>APP: import app APP->>OTEL: import otel_metrics APP->>OTEL: bootstrap_auto_instrumentation() OTEL->>OTEL: _sync_instance_id_to_env with pid suffix OTEL->>SDK: initialize() SDK-->>OTEL: TracerProvider + MeterProvider set globally Note over SDK: FastAPI, httpx, SQLAlchemy patched OTEL-->>APP: True APP->>APP: import FastAPI and other libraries UV->>LS: startup lifespan LS->>OTEL: init_otel_metrics() alt Bootstrap provider exists OTEL-->>LS: attach custom instruments to bootstrap provider else No global provider standalone mode OTEL->>SDK: MeterProvider with pid-suffixed resource OTEL-->>LS: new standalone MeterProvider end UV->>LS: shutdown lifespan LS->>OTEL: shutdown_otel_metrics() Note over OTEL: Only shuts down module-created providerPrompt To Fix All With AI
Reviews (5): Last reviewed commit: "Merge branch 'main' into jamesc-fix-auto..." | Re-trigger Greptile