Skip to content

Reduce SQL Server queue length monitoring query volume#5555

Open
ramonsmits wants to merge 2 commits into
masterfrom
spike/sql-queuelength-configurable-and-single-query
Open

Reduce SQL Server queue length monitoring query volume#5555
ramonsmits wants to merge 2 commits into
masterfrom
spike/sql-queuelength-configurable-and-single-query

Conversation

@ramonsmits

@ramonsmits ramonsmits commented Jun 24, 2026

Copy link
Copy Markdown
Member

Spike addressing high SQL query volume from queue-length monitoring (one IF EXISTS + max-min(RowVersion) statement per tracked queue, every 200ms).

Changes

  • Single query. Read approximate row counts for all tracked queues from sys.partitions (one query per catalog) instead of probing each queue table. No queue-table reads, no locks, only metadata visibility required. Accuracy is comparable to the existing max-min(RowVersion) estimate (both approximate; the catalog counter avoids identity-gap over-counting).
  • Concurrent pacing. The poll interval now runs concurrently with the query, so the effective cadence is max(interval, queryDuration) instead of interval + queryDuration. That additive query-time term is what let the cadence drift past the 1s monitoring bucket and starve buckets of samples — the SQL Server and PostgreSQL can indicate report 0 as the queue length value #4556 false-zero "sawtooth". Removing the drift supersedes the SQL Server and PostgreSQL can indicate report 0 as the queue length value #4557 200ms oversampling workaround and lets the cadence return to a sane default.
  • Configurable interval via QueueLengthQueryDelayInterval connection string part (default 1s, matching the finest monitoring bucket: 1-minute history / 60 = 1s per point), following the existing ASB convention.
  • Adaptive back-off up to QueueLengthQueryMaxDelayInterval (default 10s) while every monitored queue is empty; now on by default. The base interval is always used while any queue has work, so the fix for SQL Server and PostgreSQL can indicate report 0 as the queue length value #4556 is preserved. Set the max equal to the base to disable back-off.
  • SqlTable cleanup. Identifier properties renamed to Unquoted* to make the quoted-vs-unquoted contract explicit, converted to a primary constructor, and the duplicated per-table length-query branches collapsed into helpers.

Permissions

The bulk query reads only catalog views, so it needs less permission than the per-queue probe it replaces:

  • Relies on standard metadata visibility — any permission the monitoring user holds on a queue table (e.g. SELECT) makes that table's row visible. No VIEW DATABASE STATE required (hence sys.partitions rather than the sys.dm_db_partition_stats DMV).
  • The previous query read table data (SELECT max([RowVersion]) … FROM <queue>), which already requires SELECT on each queue table — strictly more — so any user that worked before continues to work.
  • A queue table the user has no permission on simply returns no row and is treated as "does not exist" (same effective outcome as the old IF EXISTS → -1).

Replace the per-queue length probe (one IF EXISTS + max-min(RowVersion)
statement per tracked queue, every 200ms) with a single catalog-view query
per catalog that reads approximate row counts from sys.partitions. This reads
no queue table data, takes no locks, and needs only SELECT on the queue tables.

Also make the poll interval configurable via the QueueLengthQueryDelayInterval
connection string part (default 200ms), and add optional adaptive back-off up
to QueueLengthQueryMaxDelayInterval while all monitored queues are empty
(disabled by default; base interval is always used while any queue has work,
preserving the fix for #4556).

@mauroservienti mauroservienti left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, @ramonsmits. Otheriwise, it's good to go.

Comment thread src/ServiceControl.Transports.SqlServer/SqlTable.cs Outdated

@johnsimons johnsimons left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

SqlTable: rename Name/Schema/Catalog to Unquoted* to make the
quoted-vs-unquoted contract explicit (review feedback), and convert to a
primary constructor. The now-dead per-table LengthQuery's duplicated
if/else is collapsed into BuildFullTableName/BuildLengthQuery helpers.

QueueLengthProvider: pace the poll interval concurrently with the query
so the effective cadence is max(interval, queryDuration) instead of
interval + queryDuration. The additive query-time term was what let the
cadence drift past the 1s monitoring bucket and starve buckets of
samples (#4556 false-zero sawtooth). With drift removed, restore sane
defaults — base 1s (matches the finest monitoring bucket) ramping to a
10s idle ceiling — superseding the #4557 200ms oversampling workaround.
Adaptive back-off is now ON by default.
@ramonsmits ramonsmits marked this pull request as ready for review June 25, 2026 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants