refactor: optimize Spanner batching and standardize OpenTelemetry deployment#2570
Open
jcscottiii wants to merge 2 commits into
Open
refactor: optimize Spanner batching and standardize OpenTelemetry deployment#2570jcscottiii wants to merge 2 commits into
jcscottiii wants to merge 2 commits into
Conversation
…loyment This commit optimizes Spanner write performance and standardizes OpenTelemetry (OTel) deployment across the backend API, background workers, and ingestion jobs. 1. Spanner Batching: * Refactored UpsertWPTRunFeatureMetrics in lib/gcpspanner/wpt_run_feature_metric.go to execute two batch queries at the transaction start instead of executing queries inside a loop, reducing database roundtrips. * Extracted mutation building to buildWPTRunFeatureMetricMutations to reduce cognitive complexity. 2. Centralized Terraform Telemetry Configuration: * Created root-level infra/telemetry.tf to define a single shared Secret Manager secret containing the OTel collector configuration. * Swapped the custom in-tree OTel collector image for the official Google-managed image (otelcol-google:0.151.0). This is to match this new documentation: https://docs.cloud.google.com/stackdriver/docs/instrumentation/opentelemetry-collector-cloud-run * Defined a centralized otel_collector_config_mount_path local variable in infra/telemetry.tf set to "/etc/otelcol-google" and propagated it to all submodules, replacing hardcoded paths. 3. OTel Sidecar Deployment (Go & Terraform): * Added opentelemetry.MaybeSetup helper to lib/opentelemetry/setup.go to encapsulate environment checks and OTel SDK initialization. * Refactored backend/cmd/server/main.go and all 11 background workers and daily ingestion scraper entrypoints to call MaybeSetup and defer shutdown. * Deployed the OTel sidecar container and mounted the shared config secret across all 4 worker pools and the reusable job module. * Granted roles/cloudtrace.agent, roles/monitoring.metricWriter, and roles/logging.logWriter to the backend and worker service accounts. 4. GCP Error Reporting & Structured Logging: * Updated the custom slog handler in lib/opentelemetry/slog.go to capture and append runtime/debug.Stack() to ERROR logs. * Structured logs to enable automatic GCP Error Reporting aggregation and trace-log linking using the trace field. 5. Go Startup Logging & Refactoring: * Added verbose BOOT: log statements before each client initialization phase (Datastore, Spanner, Valkey, OTel) in backend/cmd/server/main.go to provide startup phase visibility. * Refactored the inline OpenTelemetry setup block in main.go to use the new opentelemetry.MaybeSetup helper, keeping the startup sequence synchronous. 6. Repository Cleanup: * Deleted the unused custom in-tree OTel collector Dockerfile (otel/Dockerfile). * Removed the /otel Docker package-ecosystem update entry from .github/dependabot.yml. BUG=526562255
ef066fd to
531e508
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit optimizes Spanner write performance and standardizes OpenTelemetry (OTel) deployment across the backend API, background workers, and ingestion jobs.
BUG=526562255