-
Notifications
You must be signed in to change notification settings - Fork 412
disagg: Add metrics about disaggreated arch #10631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f9db8ff to
a716d65
Compare
Signed-off-by: JaySon-Huang <[email protected]>
Signed-off-by: JaySon-Huang <[email protected]>
Signed-off-by: JaySon-Huang <[email protected]>
Signed-off-by: JaySon-Huang <[email protected]>
a716d65 to
9520699
Compare
Signed-off-by: JaySon-Huang <[email protected]>
Signed-off-by: JaySon-Huang <[email protected]>
9520699 to
b91487e
Compare
Signed-off-by: JaySon-Huang <[email protected]>
…d task Signed-off-by: JaySon-Huang <[email protected]>
Signed-off-by: JaySon-Huang <[email protected]>
Signed-off-by: JaySon-Huang <[email protected]>
Signed-off-by: JaySon-Huang <[email protected]>
Signed-off-by: JaySon-Huang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This work-in-progress PR adds comprehensive metrics and monitoring capabilities for TiFlash's disaggregated storage architecture. The changes focus on improving observability for S3 operations, delta index placement, and segment read tasks.
Key Changes:
- Added new HTTP API endpoint
/tiflash/remote/infoto fetch remote storage summary statistics - Introduced metrics for S3 operations (request counts, errors, retries, durations)
- Added metrics for delta index placement operations (reuse, placement counts, row/delete statistics)
- Added current metrics for segment read task pools and active tasks
- Enhanced error tracking for S3 RandomAccessFile operations
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| metrics/grafana/tiflash_summary.json | Updated Grafana dashboard with new S3 and storage metrics panels, added quantile tracking for subtask durations |
| docs/tiflash_http_api.md | Documented new /tiflash/remote/info API endpoint for fetching storage summary |
| dbms/src/Storages/S3/tests/gtest_s3gcmanager.cpp | Added test case for getStorageSummary functionality |
| dbms/src/Storages/S3/S3RandomAccessFile.h | Added destructor declaration for metrics cleanup |
| dbms/src/Storages/S3/S3RandomAccessFile.cpp | Added S3RandomAccessFile current metric tracking and error event metrics |
| dbms/src/Storages/S3/S3GCManager.h | Added S3StoreStorageSummary and S3StorageSummary structures with JSON serialization |
| dbms/src/Storages/S3/S3GCManager.cpp | Implemented getStoreStorageSummary and getS3StorageSummary methods |
| dbms/src/Storages/S3/S3Common.h | Removed unnecessary include of S3RandomAccessFile.h |
| dbms/src/Storages/S3/S3Common.cpp | Added include of S3RandomAccessFile.h where actually needed |
| dbms/src/Storages/S3/CheckpointManifestS3Set.h | Added size() method for manifest set querying |
| dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.h | Added parseStoreIds helper function declaration |
| dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.cpp | Implemented HandleHttpRequestRemoteInfo and parseStoreIds |
| dbms/src/Storages/DeltaMerge/SegmentReadTaskPool.cpp | Added current metrics tracking for read task pools and active tasks |
| dbms/src/Storages/DeltaMerge/Segment.cpp | Added metrics for place index operations (reuse, placement, statistics) |
| dbms/src/Storages/DeltaMerge/File/DMFilePackFilter.cpp | Added include for S3RandomAccessFile.h |
| dbms/src/Storages/DeltaMerge/File/DMFile.h | Removed unnecessary include of S3RandomAccessFile.h |
| dbms/src/Storages/DeltaMerge/File/ColumnStream.cpp | Added include for S3RandomAccessFile.h |
| dbms/src/Common/TiFlashMetrics.h | Defined new metrics for place index operations and S3 request error tracking |
| dbms/src/Common/ProfileEvents.cpp | Added S3IOReadError and S3IOSeekError profile events |
| dbms/src/Common/CurrentMetrics.cpp | Added DT_SegmentReadTaskPool, DT_SegmentReadTasksActive, and S3RandomAccessFile metrics |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: JaySon-Huang <[email protected]>
Signed-off-by: JaySon-Huang <[email protected]>
Signed-off-by: JaySon-Huang <[email protected]>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CalvinNeo, JinheLin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherry-pick release-nextgen-20251011 |
|
@JaySon-Huang: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
Signed-off-by: ti-chi-bot <[email protected]>
ref #10634 * Performance metrics * Metrics when tiflash-compute doing delta-index place, whether the delta-index are reuse or updated, the rows and deletes updated * Metrics when S3RandomAccessFile read/seek meet error * The number of active SegmentReadTask * Remote Object Storage summary * Add a http api "http://${TIFLASH_IP}:${TIFLASH_STATUS_PORT}/tiflash/remote/info" for fetching the object storage summary from tiflash-write node Signed-off-by: ti-chi-bot <[email protected]> Signed-off-by: JaySon-Huang <[email protected]> Co-authored-by: JaySon <[email protected]> Co-authored-by: JaySon-Huang <[email protected]>
What problem does this PR solve?
Issue Number: ref #10634
Problem Summary:
What is changed and how it works?
New added panels
Check List
Tests
Side effects
Documentation
Release note