Skip to content

refactor: optimize Spanner batching and standardize OpenTelemetry deployment#2570

Open
jcscottiii wants to merge 2 commits into
mainfrom
fixup/june22-final
Open

refactor: optimize Spanner batching and standardize OpenTelemetry deployment#2570
jcscottiii wants to merge 2 commits into
mainfrom
fixup/june22-final

Conversation

@jcscottiii

Copy link
Copy Markdown
Collaborator

This commit optimizes Spanner write performance and standardizes OpenTelemetry (OTel) deployment across the backend API, background workers, and ingestion jobs.

  1. Spanner Batching:
  • Refactored UpsertWPTRunFeatureMetrics in lib/gcpspanner/wpt_run_feature_metric.go to execute two batch queries at the transaction start instead of executing queries inside a loop, reducing database roundtrips.
  • Extracted mutation building to buildWPTRunFeatureMetricMutations to reduce cognitive complexity.
  1. Centralized Terraform Telemetry Configuration:
  • Created root-level infra/telemetry.tf to define a single shared Secret Manager secret containing the OTel collector configuration.
  • Swapped the custom in-tree OTel collector image for the official Google-managed image (otelcol-google:0.151.0). This is to match this new documentation: https://docs.cloud.google.com/stackdriver/docs/instrumentation/opentelemetry-collector-cloud-run
  • Defined a centralized otel_collector_config_mount_path local variable in infra/telemetry.tf set to "/etc/otelcol-google" and propagated it to all submodules, replacing hardcoded paths.
  1. OTel Sidecar Deployment (Go & Terraform):
  • Added opentelemetry.MaybeSetup helper to lib/opentelemetry/setup.go to encapsulate environment checks and OTel SDK initialization.
  • Refactored backend/cmd/server/main.go and all 11 background workers and daily ingestion scraper entrypoints to call MaybeSetup and defer shutdown.
  • Deployed the OTel sidecar container and mounted the shared config secret across all 4 worker pools and the reusable job module.
  • Granted roles/cloudtrace.agent, roles/monitoring.metricWriter, and roles/logging.logWriter to the backend and worker service accounts.
  1. GCP Error Reporting & Structured Logging:
  • Updated the custom slog handler in lib/opentelemetry/slog.go to capture and append runtime/debug.Stack() to ERROR logs.
  • Structured logs to enable automatic GCP Error Reporting aggregation and trace-log linking using the trace field.
  1. Go Startup Logging & Refactoring:
  • Added verbose BOOT: log statements before each client initialization phase (Datastore, Spanner, Valkey, OTel) in backend/cmd/server/main.go to provide startup phase visibility.
  • Refactored the inline OpenTelemetry setup block in main.go to use the new opentelemetry.MaybeSetup helper, keeping the startup sequence synchronous.
  1. Repository Cleanup:
  • Deleted the unused custom in-tree OTel collector Dockerfile (otel/Dockerfile).
  • Removed the /otel Docker package-ecosystem update entry from .github/dependabot.yml.

BUG=526562255

@jcscottiii jcscottiii requested a review from neilv-g June 23, 2026 20:43
…loyment

This commit optimizes Spanner write performance and standardizes OpenTelemetry (OTel) deployment across the backend API, background workers, and ingestion jobs.

1. Spanner Batching:
* Refactored UpsertWPTRunFeatureMetrics in lib/gcpspanner/wpt_run_feature_metric.go to execute two batch queries at the transaction start instead of executing queries inside a loop, reducing database roundtrips.
* Extracted mutation building to buildWPTRunFeatureMetricMutations to reduce cognitive complexity.

2. Centralized Terraform Telemetry Configuration:
* Created root-level infra/telemetry.tf to define a single shared Secret Manager secret containing the OTel collector configuration.
* Swapped the custom in-tree OTel collector image for the official Google-managed image (otelcol-google:0.151.0). This is to match this new documentation: https://docs.cloud.google.com/stackdriver/docs/instrumentation/opentelemetry-collector-cloud-run
* Defined a centralized otel_collector_config_mount_path local variable in infra/telemetry.tf set to "/etc/otelcol-google" and propagated it to all submodules, replacing hardcoded paths.

3. OTel Sidecar Deployment (Go & Terraform):
* Added opentelemetry.MaybeSetup helper to lib/opentelemetry/setup.go to encapsulate environment checks and OTel SDK initialization.
* Refactored backend/cmd/server/main.go and all 11 background workers and daily ingestion scraper entrypoints to call MaybeSetup and defer shutdown.
* Deployed the OTel sidecar container and mounted the shared config secret across all 4 worker pools and the reusable job module.
* Granted roles/cloudtrace.agent, roles/monitoring.metricWriter, and roles/logging.logWriter to the backend and worker service accounts.

4. GCP Error Reporting & Structured Logging:
* Updated the custom slog handler in lib/opentelemetry/slog.go to capture and append runtime/debug.Stack() to ERROR logs.
* Structured logs to enable automatic GCP Error Reporting aggregation and trace-log linking using the trace field.

5. Go Startup Logging & Refactoring:
* Added verbose BOOT: log statements before each client initialization phase (Datastore, Spanner, Valkey, OTel) in backend/cmd/server/main.go to provide startup phase visibility.
* Refactored the inline OpenTelemetry setup block in main.go to use the new opentelemetry.MaybeSetup helper, keeping the startup sequence synchronous.

6. Repository Cleanup:
* Deleted the unused custom in-tree OTel collector Dockerfile (otel/Dockerfile).
* Removed the /otel Docker package-ecosystem update entry from .github/dependabot.yml.

BUG=526562255
@jcscottiii jcscottiii force-pushed the fixup/june22-final branch from ef066fd to 531e508 Compare June 24, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant