[CELEBORN-2316] Introduce metadata operation metrics#3673
Open
AmandeepSingh285 wants to merge 2 commits intoapache:mainfrom
Open
[CELEBORN-2316] Introduce metadata operation metrics#3673AmandeepSingh285 wants to merge 2 commits intoapache:mainfrom
AmandeepSingh285 wants to merge 2 commits intoapache:mainfrom
Conversation
Member
|
@AmandeepSingh285, thanks for first contribution. Please run |
There was a problem hiding this comment.
Pull request overview
This PR adds worker-side metrics to improve observability of RocksDB-backed metadata operations by counting read/write successes and failures.
Changes:
- Introduces
MetadataMetricsand wires it into RocksDBput/get/deleteand iterator reads/seeks to track success/failure counts. - Registers a new labeled counter in
WorkerSourcefor metadata operation status (operation= read/write,status= success/fail). - Threads
WorkerSource/AbstractSourcethroughDBProvider.initDB(...)call sites so RocksDB can emit metrics.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| worker/src/test/java/org/apache/celeborn/service/deploy/worker/shuffledb/DBProviderSuiteJ.java | Updates DB init calls to pass a WorkerSource after initDB signature change. |
| worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManager.scala | Passes workerSource into DB initialization for graceful-shutdown recovery DB. |
| worker/src/main/scala/org/apache/celeborn/service/deploy/worker/WorkerSource.scala | Adds labeled counters for metadata read/write success/failure. |
| worker/src/main/java/org/apache/celeborn/service/deploy/worker/storage/PartitionFilesSorter.java | Passes source into DB initialization for sorted-files recovery DB. |
| worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDBIterator.java | Wraps iterator seek/next in metric-recording reads (including status() checks). |
| worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/RocksDB.java | Wraps RocksDB put/get/delete with metric recording and passes metrics into iterator. |
| worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/MetadataMetrics.java | New helper to record labeled success/failure counters around RocksDB operations. |
| worker/src/main/java/org/apache/celeborn/service/deploy/worker/shuffledb/DBProvider.java | Extends initDB to accept an AbstractSource and constructs RocksDB with it. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Author
|
@SteNicholas will work on updating the celeborn dashboard. Could you help with an initial review for the change. Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Adding metrics around rocks DB operations. Metrics to have success and failure count for metadata operations for rocksDB observability.
Why are the changes needed?
Adding metrics around metadata operations. Current implementation, metadata operations do not have any observability added. RocksDB goes into a read only mode when any critical errors are encountered which results in all write operations failing.
Observability around metadata operations is critical and failures could result in metadata entering an inconsistent state.
Does this PR resolve a correctness bug?
No
Does this PR introduce any user-facing change?
No
How was this patch tested?
Tested in local staging setup -
