Skip to content

Conversation

@JaySon-Huang
Copy link
Contributor

@JaySon-Huang JaySon-Huang commented Dec 24, 2025

What problem does this PR solve?

Issue Number: ref #10634

Problem Summary:

What is changed and how it works?

* Performance metrics
  * Metrics when tiflash-compute doing delta-index place, whether the delta-index are reuse or updated, the rows and deletes updated
  * Metrics when S3RandomAccessFile read/seek meet error
  * The number of active SegmentReadTask 
* Remote Object Storage summary
  * Add a http api "http://${TIFLASH_IP}:${TIFLASH_STATUS_PORT}/tiflash/remote/info" for fetching the object storage summary from tiflash-write node

New added panels

  • "Storage Read Pool & Data Sharing", panels about read tasks
image * "Disaggregated-Compute", panels about place index image * "S3", panels about S3RandomeAccessFile image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
# Run chbenchmark on chbenchmark 8000
tiup bench ch --host 10.2.12.81 -P 8081 --warehouses 8000 run -D chbenchmark8k -T 50 -t 2 --time 60m --ignore-error --queries q1
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 24, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This work-in-progress PR adds comprehensive metrics and monitoring capabilities for TiFlash's disaggregated storage architecture. The changes focus on improving observability for S3 operations, delta index placement, and segment read tasks.

Key Changes:

  • Added new HTTP API endpoint /tiflash/remote/info to fetch remote storage summary statistics
  • Introduced metrics for S3 operations (request counts, errors, retries, durations)
  • Added metrics for delta index placement operations (reuse, placement counts, row/delete statistics)
  • Added current metrics for segment read task pools and active tasks
  • Enhanced error tracking for S3 RandomAccessFile operations

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
metrics/grafana/tiflash_summary.json Updated Grafana dashboard with new S3 and storage metrics panels, added quantile tracking for subtask durations
docs/tiflash_http_api.md Documented new /tiflash/remote/info API endpoint for fetching storage summary
dbms/src/Storages/S3/tests/gtest_s3gcmanager.cpp Added test case for getStorageSummary functionality
dbms/src/Storages/S3/S3RandomAccessFile.h Added destructor declaration for metrics cleanup
dbms/src/Storages/S3/S3RandomAccessFile.cpp Added S3RandomAccessFile current metric tracking and error event metrics
dbms/src/Storages/S3/S3GCManager.h Added S3StoreStorageSummary and S3StorageSummary structures with JSON serialization
dbms/src/Storages/S3/S3GCManager.cpp Implemented getStoreStorageSummary and getS3StorageSummary methods
dbms/src/Storages/S3/S3Common.h Removed unnecessary include of S3RandomAccessFile.h
dbms/src/Storages/S3/S3Common.cpp Added include of S3RandomAccessFile.h where actually needed
dbms/src/Storages/S3/CheckpointManifestS3Set.h Added size() method for manifest set querying
dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.h Added parseStoreIds helper function declaration
dbms/src/Storages/KVStore/FFI/ProxyFFIStatusService.cpp Implemented HandleHttpRequestRemoteInfo and parseStoreIds
dbms/src/Storages/DeltaMerge/SegmentReadTaskPool.cpp Added current metrics tracking for read task pools and active tasks
dbms/src/Storages/DeltaMerge/Segment.cpp Added metrics for place index operations (reuse, placement, statistics)
dbms/src/Storages/DeltaMerge/File/DMFilePackFilter.cpp Added include for S3RandomAccessFile.h
dbms/src/Storages/DeltaMerge/File/DMFile.h Removed unnecessary include of S3RandomAccessFile.h
dbms/src/Storages/DeltaMerge/File/ColumnStream.cpp Added include for S3RandomAccessFile.h
dbms/src/Common/TiFlashMetrics.h Defined new metrics for place index operations and S3 request error tracking
dbms/src/Common/ProfileEvents.cpp Added S3IOReadError and S3IOSeekError profile events
dbms/src/Common/CurrentMetrics.cpp Added DT_SegmentReadTaskPool, DT_SegmentReadTasksActive, and S3RandomAccessFile metrics

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: JaySon-Huang <[email protected]>
@JaySon-Huang JaySon-Huang changed the title [WIP] disagg: Add metrics [WIP] disagg: Add metrics about disaggreated arch Dec 27, 2025
@JaySon-Huang JaySon-Huang changed the title [WIP] disagg: Add metrics about disaggreated arch disagg: Add metrics about disaggreated arch Dec 28, 2025
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 28, 2025
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Dec 29, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 29, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CalvinNeo, JinheLin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 29, 2025
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Dec 29, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-12-29 01:28:16.939818177 +0000 UTC m=+2646041.753595739: ☑️ agreed by JinheLin.
  • 2025-12-29 02:07:10.448790933 +0000 UTC m=+2648375.262568495: ☑️ agreed by CalvinNeo.

@ti-chi-bot ti-chi-bot bot merged commit 965cc8c into pingcap:master Dec 29, 2025
7 checks passed
@JaySon-Huang JaySon-Huang deleted the add_metrics branch December 29, 2025 02:14
@JaySon-Huang
Copy link
Contributor Author

/cherry-pick release-nextgen-20251011

@ti-chi-bot
Copy link
Member

@JaySon-Huang: new pull request created to branch release-nextgen-20251011: #10635.
But this PR has conflicts, please resolve them!

Details

In response to this:

/cherry-pick release-nextgen-20251011

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot pushed a commit to ti-chi-bot/tiflash that referenced this pull request Dec 29, 2025
ti-chi-bot bot pushed a commit that referenced this pull request Dec 29, 2025
ref #10634

* Performance metrics
  * Metrics when tiflash-compute doing delta-index place, whether the delta-index are reuse or updated, the rows and deletes updated
  * Metrics when S3RandomAccessFile read/seek meet error
  * The number of active SegmentReadTask 
* Remote Object Storage summary
  * Add a http api "http://${TIFLASH_IP}:${TIFLASH_STATUS_PORT}/tiflash/remote/info" for fetching the object storage summary from tiflash-write node

Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: JaySon-Huang <[email protected]>

Co-authored-by: JaySon <[email protected]>
Co-authored-by: JaySon-Huang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants