Skip to content

[CELEBORN-2316] Introduce metadata operation metrics#3673

Open
AmandeepSingh285 wants to merge 2 commits intoapache:mainfrom
AmandeepSingh285:adding-metadata-metrics
Open

[CELEBORN-2316] Introduce metadata operation metrics#3673
AmandeepSingh285 wants to merge 2 commits intoapache:mainfrom
AmandeepSingh285:adding-metadata-metrics

Conversation

@AmandeepSingh285
Copy link
Copy Markdown

What changes were proposed in this pull request?

Adding metrics around rocks DB operations. Metrics to have success and failure count for metadata operations for rocksDB observability.

Why are the changes needed?

Adding metrics around metadata operations. Current implementation, metadata operations do not have any observability added. RocksDB goes into a read only mode when any critical errors are encountered which results in all write operations failing.
Observability around metadata operations is critical and failures could result in metadata entering an inconsistent state.

Does this PR resolve a correctness bug?

No

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested in local staging setup -
image

@SteNicholas SteNicholas changed the title [CELEBORN-2316] Added metrics for metadata operations [CELEBORN-2316] Introduce metadata operation metrics Apr 30, 2026
@SteNicholas
Copy link
Copy Markdown
Member

SteNicholas commented Apr 30, 2026

@AmandeepSingh285, thanks for first contribution. Please run dev/reformat to format code changes. Then add introduced metrics into celeborn-dashboard.json.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds worker-side metrics to improve observability of RocksDB-backed metadata operations by counting read/write successes and failures.

Changes:

  • Introduces MetadataMetrics and wires it into RocksDB put/get/delete and iterator reads/seeks to track success/failure counts.
  • Registers a new labeled counter in WorkerSource for metadata operation status (operation = read/write, status = success/fail).
  • Threads WorkerSource / AbstractSource through DBProvider.initDB(...) call sites so RocksDB can emit metrics.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
worker/src/test/java/org/apache/celeborn/service/deploy/worker/shuffledb/DBProviderSuiteJ.java Updates DB init calls to pass a WorkerSource after initDB signature change.
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManager.scala Passes workerSource into DB initialization for graceful-shutdown recovery DB.
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/WorkerSource.scala Adds labeled counters for metadata read/write success/failure.
worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/PartitionFilesSorter.java Passes source into DB initialization for sorted-files recovery DB.
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDBIterator.java Wraps iterator seek/next in metric-recording reads (including status() checks).
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDB.java Wraps RocksDB put/get/delete with metric recording and passes metrics into iterator.
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/MetadataMetrics.java New helper to record labeled success/failure counters around RocksDB operations.
worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/DBProvider.java Extends initDB to accept an AbstractSource and constructs RocksDB with it.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@AmandeepSingh285
Copy link
Copy Markdown
Author

@SteNicholas will work on updating the celeborn dashboard. Could you help with an initial review for the change. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants