Use MpscRingBuffer for CSS producer/consumer inbox by dougqh · Pull Request #11497 · DataDog/dd-trace-java

dougqh · 2026-05-28T20:05:30Z

Summary

Swap the CSS producer/consumer inbox from MpscArrayQueue<InboxItem> to MpscRingBuffer<SpanSnapshot> (from #11492) with pre-allocated, recyclable slots. Producers mutate slots in place — no per-publish allocation. Signals (REPORT / STOP / CLEAR) move to a small side-channel queue so the data ring only holds one element type.

Depends on #11492 (the ring-buffer primitive). Once that merges, the diff here collapses to just the CSS-side changes.

Background

#11381 introduced the producer/consumer split (one SpanSnapshot allocated per metrics-eligible span, handed through an MpscArrayQueue). At normal heap sizes this was a throughput win; under tight heap (Xmx ~64 m on spring-petclinic) the in-flight snapshots overflow G1 survivor regions and trigger Full-GC storms. The ring buffer eliminates the per-publish allocation by recycling slot instances.

What's in this PR (on top of #11492)

Commit	Change
Swap CSS inbox for MpscRingBuffer of mutable SpanSnapshot slots	Producer uses `tryWrite(span, isTopLevel, slotFiller)`; signals route through a side-channel `MpscArrayQueue<SignalItem>` (cap 16). Aggregator drains the data ring first each iteration; REPORT/STOP processing does an inline data drain before flushing so bucket-boundary semantics match the prior design.
Shrink tracer.metrics defaults at Xmx<128m	`MpscRingBuffer` pre-allocates one `SpanSnapshot` per slot at construction. At the old default cap (131072 slots * ~120 B = ~15 MB upfront), tight-heap JVMs failed catastrophically (observed 0 r/s at Xmx 64 m). Cprime: at maxMemory < 128 MB, defaults drop to `maxAggregates=256`, `maxPending=64` (logical * 64 batch = 4096 slots ≈ 500 KB upfront).
Cut default `tracer.metrics.max.pending` from 2048 to 128	The historical 2048 was sized for the on-demand batch-allocation model; the ring's pre-allocation makes it a resident-RSS cost. New default: 128 logical * 64 = 8192 slots ≈ 1 MB upfront, ~0.8 s of buffer at 10 K spans/s. Explicit customer overrides are unaffected (LEGACY_BATCH_SIZE multiplier still applies).
Reuse peerTagValues array across publishes	`fillSlot` treats `slot.peerTagValues` as a reusable scratch buffer owned by the ring; allocates only on first use or schema-length change. `AggregateEntry` constructor copies the array by value on insert (cache misses) so the entry's identity doesn't drift when the slot is reclaimed. `matches()` and `hashOf()` gate the values comparison/hash on `peerTagSchema` so stale scratch on no-tag publishes is inert. Eliminates ~30 K `String[]` allocations/sec from the petclinic producer hot path.

Validation

JMH benches (8 producer threads, `@Warmup(2, 15s) @Measurement(5, 15s) @Fork(1)`)

Bench	v1.62.0	#11478 baseline (queue)	this PR (ring + fixes)	vs #11478
`Adversarial`	444K ± 1.6M	30.6M ± 6.9M	26.4M ± 1.6M	0.86×
`HighCardinalityResource`	4.8M ± 1.2M	35.7M ± 2.6M	33.6M ± 2.5M	0.94×
`HighCardinalityPeer`	6.9M ± 0.4M	37.6M ± 6.7M	39.8M ± 5.6M	1.06×

Two benches within #11478's error bar; HCP is actually faster than the previous queue design. The ~0.86× on Adversarial reflects the producer/consumer split's residual structural overhead under saturated load.

Petclinic load tests (Java 17, 8 jmeter threads, `GET /owners/3`, single iter)

Xmx	no-agent	1.62.0	1.63 + this PR	1.63 vs 1.62
64m	7,066	2,647	2,193	−17.1%
80m	8,759	5,134	4,681	−8.8%
96m	8,673	6,720	6,450	−4.0%
128m	10,986	7,942	8,014	+0.9%
192m	10,553	9,366	8,989	−4.0%
256m	11,301	10,206	9,777	−4.2%

At 64m without Cprime, the previous swap produced 0 r/s (JVM in Full-GC death spiral). With Cprime + cache-line fixes, the ring buffer at 64m delivers 2.2 K r/s — only 17% below 1.62, recovering as heap grows. Gap to 1.62 is ≤ 5% at every heap ≥ 96 m.

Future work (not in this PR)

A list of further optimizations discussed in the design review:

Producer claim-batching — thread-local claim of N sequences per CAS to cut producer-side contention ~N×.
Producer striping — N SPSC rings instead of one MPSC.
Dual swappable rings — producer↔consumer cache-line isolation per phase.

Producer batching is the planned next step.

Test plan

:dd-trace-core:test — all datadog.trace.common.metrics.* tests pass (94/94 locally)
Updated AggregateTableTest covers slot-reuse semantics (stale scratch must not break match/hash; entry copies values by ref)
CSS JMH benches re-run (see above)
Petclinic load test sweep across 6 heap sizes (see above)

🤖 Generated with Claude Code

ConflatingMetricsAggregator.publish does a handful of redundant operations on every span. None individually is large; together they show as ~2.5% on the existing JMH benchmark once the benchmark actually exercises span.kind. - dedup span.isTopLevel(): publish() reads it into a local, then shouldComputeMetric read it again. Pass the cached value in. - resolve spanKind to String once: master called toString() twice per span (once inside spanKindEligible, once at the getPeerTags call site) and used HashSet contains on a CharSequence (which routes through equals on String). Normalize to String up front and reuse. - lazy-allocate the peer-tag list: getPeerTags() always allocated an ArrayList sized to features.peerTags() even when the span had none of those tags set. Defer allocation until the first match; return Collections.emptyList() when none hit. MetricKey already treats null/empty peerTags as emptyList, so no behavior change. Drop the spanKindEligible helper — the HashSet.contains call inlines fine in shouldComputeMetric. Update the JMH benchmark to set span.kind=client on every span. Without it the filter path short-circuits before the peer-tag and toString work, so the wins above aren't measurable. With it: baseline 6.755 us/op (CI [6.560, 6.950], stdev 0.129) optimized 6.585 us/op (CI [6.536, 6.634], stdev 0.033) 2 forks x 5 iterations x 15s. ~2.5% mean improvement and much tighter variance fork-to-fork. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Introduce SpanKindFilter -- a tiny builder-built immutable filter whose state is an int bitmask indexed by the span.kind ordinals already cached on DDSpanContext. Each include* on the builder sets one bit (1 << ordinal); the runtime check is a single AND against (1 << span's ordinal). CoreSpan.isKind(SpanKindFilter) is the new entry point. DDSpan overrides it to do the bit-test directly against the cached ordinal -- no virtual call, no tag-map lookup. The two existing test-only CoreSpan impls (SimpleSpan and TraceGenerator.PojoSpan, the latter in two source sets) implement isKind by reading the span.kind tag and delegating to SpanKindFilter.matches(String), which converts via DDSpanContext.spanKindOrdinalOf and does the same AND. Refactor: DDSpanContext.setSpanKindOrdinal(String) now delegates to a new package-private static spanKindOrdinalOf(String) so the same string-to-ordinal mapping serves both the tag interceptor path and SpanKindFilter.matches. This is groundwork -- nothing in the codebase calls isKind yet. The next commit will replace the HashSet-based eligibility checks in ConflatingMetricsAggregator with SpanKindFilter instances. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the two ELIGIBLE_SPAN_KINDS_FOR_* HashSet<String> constants and the SPAN_KIND_INTERNAL.equals check with three SpanKindFilter instances: METRICS_ELIGIBLE_KINDS, PEER_AGGREGATION_KINDS, INTERNAL_KIND. Eligibility checks now go through span.isKind(filter), which on DDSpan is a volatile byte read against the already-cached span.kind ordinal plus a single bit-test. Also defer the span.kind tag read: previously read at the top of the publish loop and threaded through both shouldComputeMetric and the inner publish. isKind no longer needs the string, so the read can move down into the inner publish where it's still needed for the SPAN_KINDS cache key / MetricKey. Supporting changes: - DDSpanContext.spanKindOrdinalOf(String) is now public so non-DDSpan CoreSpan impls can compute the ordinal at tag-write time. - SpanKindFilter gains a public matches(byte) fast-path overload that callers with a pre-computed ordinal use directly. - SimpleSpan caches the ordinal in setTag(SPAN_KIND, ...), mirroring what TagInterceptor does for DDSpanContext, and its isKind now hits the byte fast path. Without this, the JMH benchmark (which uses SimpleSpan) would re-derive the ordinal on every isKind call and overstate the cost. Benchmark on the bench updated last commit (kind=client on every span, 4 forks x 5 iter x 15s): prior commit 6.585 ± 0.049 us/op this commit 6.903 ± 0.096 us/op The slight regression is a SimpleSpan-via-groovy-dispatch artifact -- the interface call to isKind through CoreSpan, then through SimpleSpan, then through SpanKindFilter.matches, doesn't fold as aggressively as a HashSet contains on a static field. In production DDSpan.isKind inlines to a context field read + ordinal byte read + bit-test, so the production path is faster than the prior HashSet approach. A DDSpan-based benchmark would show this; the existing SimpleSpan-based one doesn't. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The existing ConflatingMetricsAggregatorBenchmark uses SimpleSpan, a groovy mock. That's enough for measuring queue/CHM/MetricKey work, but it conceals the production cost of CoreSpan.isKind: SimpleSpan's isKind goes through groovy interface dispatch into SpanKindFilter.matches, while DDSpan.isKind inlines to a context byte-read + bit-test. This new benchmark uses real DDSpan instances created through a CoreTracer (with a NoopWriter so finishing doesn't reach the agent). Same shape as the SimpleSpan bench (64-span trace, span.kind=client, peer.hostname set). Numbers (2 forks x 5 iter x 15s): master: 6.428 +- 0.189 us/op (HashSet eligibility checks) this branch: 6.343 +- 0.115 us/op (SpanKindFilter bitmask) About 1.3% faster on the production path. The SimpleSpan benchmark in the same conditions shows a ~2.2% slowdown -- the mock's dispatch shape gives a misleading signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Make SpanKindFilter.kindMask and its constructor private now that DDSpan.isKind no longer needs direct field access -- it delegates to SpanKindFilter.matches(byte). The Builder.build() in the same outer class still constructs instances via the private constructor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the producer-side conflation pipeline with a thin per-span SpanSnapshot posted to the existing aggregator thread. The aggregator now builds the MetricKey, does the SERVICE_NAMES / SPAN_KINDS / PEER_TAGS_CACHE lookups, and updates the AggregateMetric directly -- all off the producer's hot path. What the producer does now, per span: - filter (shouldComputeMetric, resource-ignored, longRunning) - collect tag values into a SpanSnapshot (1 allocation per span) - inbox.offer(snapshot) + return error flag for forceKeep What moved off the producer: - MetricKey construction and its hash computation - SERVICE_NAMES.computeIfAbsent (UTF8 encoding of service name) - SPAN_KINDS.computeIfAbsent (UTF8 encoding of span.kind) - PEER_TAGS_CACHE lookups (peer-tag name+value UTF8 encoding) - pending/keys ConcurrentHashMap operations - Batch pooling, batch atomic ops, batch contributeTo Removed entirely: - Batch.java -- the conflation primitive is no longer needed; the aggregator's existing LRUCache<MetricKey, AggregateMetric> IS the conflation point now. - pending ConcurrentHashMap<MetricKey, Batch> - keys ConcurrentHashMap<MetricKey, MetricKey> (canonical dedup) - batchPool MessagePassingQueue<Batch> - The CommonKeyCleaner role of tracking keys.keySet() on LRU eviction -- AggregateExpiry now just reports drops to healthMetrics. Added: - SpanSnapshot: immutable value carrying the raw MetricKey inputs + a tagAndDuration long (duration | ERROR_TAG | TOP_LEVEL_TAG). - AggregateMetric.recordOneDuration(long tagAndDuration) -- the single-hit equivalent of the existing recordDurations(int, AtomicLongArray). - Peer-tag values flow through the snapshot as a flattened String[] of [name0, value0, name1, value1, ...]; the aggregator encodes them through PEER_TAGS_CACHE on its own thread. Benchmark results (2 forks x 5 iter x 15s): ConflatingMetricsAggregatorDDSpanBenchmark prior commit 6.343 +- 0.115 us/op this commit 2.506 +- 0.044 us/op (~60% faster) ConflatingMetricsAggregatorBenchmark (SimpleSpan) prior commit 6.585 +- 0.049 us/op this commit 3.116 +- 0.032 us/op (~53% faster) Caveat on the benchmark: without conflation, the producer pushes 1 inbox item per span instead of ~1 per 64. At the benchmark's synthetic rate the consumer can't keep up and inbox.offer silently drops. The numbers measure producer publish() latency only; consumer throughput at realistic span rates is a follow-up to validate. Tuning maxPending matters more in this design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

With the per-span SpanSnapshot inbox path, the producer can lose snapshots when the bounded MPSC queue is full -- silently, since inbox.offer() returns a boolean we previously ignored. The conflating-Batch design used to absorb ~64x more producer pressure per inbox slot, so this is a new failure mode worth surfacing. Wire it through the existing HealthMetrics path: - HealthMetrics.onStatsInboxFull() (no-op default). - TracerHealthMetrics gets a statsInboxFull LongAdder and a new reason tag reason:inbox_full reported under the same stats.dropped_aggregates metric used for LRU evictions. Two LongAdders, two tagged time series. - ConflatingMetricsAggregator.publish increments the counter when inbox.offer(snapshot) returns false. This doesn't fix the drop -- tuning maxPending and/or building producer-side batching are the actual fixes. But it makes the failure visible in the same place ops already watches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nflating-metrics-background-work

Two general-purpose utilities used by the client-side stats aggregator work (PR #11382 and follow-ups), extracted into their own change so the metrics-specific PRs can build on a smaller, reviewable foundation. - Hashtable: a generic open-addressed-ish bucket table abstraction keyed by a 64-bit hash, with a public abstract Entry type so client code can subclass it for higher-arity keys. The metrics aggregator uses it to back its AggregateTable. - LongHashingUtils: chained 64-bit hash combiners with primitive overloads (boolean, short, int, long, Object). Used in place of varargs combiners to avoid Object[] allocation and boxing on the hot path. No callers within internal-api itself yet -- the metrics aggregator PR will introduce the first usages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Standalone classes for swapping the consumer-side LRUCache<MetricKey, AggregateMetric> with a multi-key Hashtable in the next commit. No call sites use them yet. - AggregateEntry extends Hashtable.Entry, holds the canonical MetricKey, the mutable AggregateMetric, and copies of the 13 raw SpanSnapshot fields for matches(). The 64-bit lookup hash is computed via chained LongHashingUtils.addToHash calls (no varargs, no boxing of short/boolean). - AggregateTable wraps a Hashtable.Entry[] from Hashtable.Support.create. findOrInsert(SpanSnapshot) walks the bucket comparing raw fields, falling back to MetricKeys.fromSnapshot on a true miss. On cap overrun, it scans for an entry with hitCount==0 and unlinks it; if none, it returns null and the caller drops the data point. - MetricKeys.fromSnapshot extracts the canonicalization logic (DDCache lookups + UTF8 encoding) from Aggregator.buildMetricKey, so the helper can be called from AggregateTable on miss. This also commits Hashtable and LongHashingUtils (added earlier, previously uncommitted) and lifts Hashtable.Entry / Hashtable.Support visibility so client code outside datadog.trace.util can build higher-arity tables -- the case the javadoc describes but the original visibility didn't actually support. Specifically: Entry is now public abstract with a protected ctor; keyHash, next(), and setNext() are public; Support's create / clear / bucketIndex / bucketIterator / mutatingBucketIterator methods are public. Tests: AggregateTableTest covers hit, miss, distinct-by-spanKind, peer-tag identity (including null vs non-null), cap overrun with stale victim, cap overrun with no victim (returns null), expungeStaleAggregates, forEach, clear, and that the canonical MetricKey is built at insert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace LRUCache<MetricKey, AggregateMetric> with the AggregateTable added in the prior commit. The hot path in Drainer.accept becomes: AggregateMetric aggregate = aggregates.findOrInsert(snapshot); if (aggregate != null) { aggregate.recordOneDuration(snapshot.tagAndDuration); dirty = true; } else { healthMetrics.onStatsAggregateDropped(); } On the steady-state hit path the lookup is a 64-bit hash compute + bucket walk + matches(snapshot) -- no MetricKey allocation, no SERVICE_NAMES / SPAN_KINDS / PEER_TAGS_CACHE lookups. The canonical MetricKey is now built once per unique key at insert time, in MetricKeys.fromSnapshot. Behavioral change in the cap-overrun path ----------------------------------------- The old LRUCache evicted least-recently-used: at cap, a new insert would push out the oldest entry regardless of whether it was live or stale. AggregateTable instead scans for a hitCount==0 entry to recycle, and drops the new key if none exists. Practical impact: in the common case where the table holds a stable set of recurring keys, an unrelated burst of new keys is dropped (and reported via onStatsAggregateDropped) rather than evicting the established keys. The existing test that asserted "service0 evicted in favor of service10" is updated to assert the new semantics. The other cap-related test ("should not report dropped aggregate when evicted entry was already flushed") still passes unchanged: after report() clears all entries to hitCount=0, the next wave of inserts recycles them. Threading fix ------------- ConflatingMetricsAggregator.disable() used to call aggregator.clearAggregates() and inbox.clear() directly from the Sink's IO event thread, racing with the aggregator thread mid-write. The race was tolerable for LinkedHashMap; it is not for AggregateTable (chain corruption can NPE or loop). disable() now offers a ClearSignal to the inbox so the aggregator thread itself performs the table clear and the inbox.clear(). Adds one SignalItem subclass + one branch in Drainer.accept; preserves the single-writer invariant for AggregateTable end-to-end. Removed: LRUCache import, AggregateExpiry inner class, the static buildMetricKey / materializePeerTags / encodePeerTag helpers (now in MetricKeys). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MetricKey existed for two reasons -- the prior LRUCache key role (now handled by AggregateTable's Hashtable.Entry mechanics) and as the labels argument to MetricWriter.add. The first is gone; the second is the only thing keeping MetricKey alive. Fold its UTF8-encoded label fields onto AggregateEntry, change MetricWriter.add to take AggregateEntry directly, and delete MetricKey + MetricKeys. What AggregateEntry now holds ----------------------------- - 10 UTF8BytesString label fields (resource, service, operationName, serviceSource, type, spanKind, httpMethod, httpEndpoint, grpcStatusCode, and a List<UTF8BytesString> peerTags for serialization). - 3 primitives (httpStatusCode, synthetic, traceRoot). - AggregateMetric (the value being accumulated). - The raw String[] peerTagPairs is retained alongside the encoded peerTags -- matches() compares it positionally against the snapshot's pairs; the encoded form is only consumed by the writer. matches(SpanSnapshot) compares the entry's UTF8 forms to the snapshot's raw String / CharSequence fields via content-equality (UTF8BytesString.toString() returns the underlying String in O(1)). This closes a latent bug in the prior raw-vs-raw matches(): if one snapshot delivered a tag value as String and a later snapshot delivered the same content as UTF8BytesString, the old Objects.equals would return false and the table would split into two entries. Content-equality matching collapses them into one. Consolidated caches ------------------- The static UTF8 caches that used to live partly on MetricKey (RESOURCE_CACHE, OPERATION_CACHE, SERVICE_SOURCE_CACHE, TYPE_CACHE, KIND_CACHE, HTTP_METHOD_CACHE, HTTP_ENDPOINT_CACHE, GRPC_STATUS_CODE_CACHE, SERVICE_CACHE) and partly on ConflatingMetricsAggregator (SERVICE_NAMES, SPAN_KINDS, PEER_TAGS_CACHE) are all now on AggregateEntry. The split was duplicating work -- SERVICE_NAMES and SERVICE_CACHE both cached service-name to UTF8BytesString. One cache per field now. API change: MetricWriter.add ---------------------------- Was: add(MetricKey key, AggregateMetric aggregate) Now: add(AggregateEntry entry) The aggregate lives on the entry. Single-arg. SerializingMetricWriter reads the same UTF8 fields off AggregateEntry that it previously read off MetricKey; the wire format is byte-identical. Test impact ----------- AggregateEntry.of(...) takes the same 13 positional args new MetricKey(...) took, so test diffs are mostly mechanical: new MetricKey(args) -> AggregateEntry.of(args) writer.add(key, _) -> writer.add(entry) ValidatingSink in SerializingMetricWriterTest now iterates List<AggregateEntry> directly. ConflatingMetricAggregatorTest's Spock matchers (~36 sites) rely on AggregateEntry.equals comparing the 13 label fields (not the aggregate) so the mock matches by labels regardless of the aggregate state at call time; post-invocation closures verify aggregate state. Benchmarks (2 forks x 5 iter x 15s) ----------------------------------- The change is consumer-thread only; producer publish() is unchanged. SimpleSpan bench: 3.123 +- 0.025 us/op (prior: 3.119 +- 0.018) DDSpan bench: 2.412 +- 0.022 us/op (prior: 2.463 +- 0.041) Both within noise -- the win is structural (one less class, one less allocation per miss, one fewer cache layer) rather than benchmarked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LongHashingUtilsTest (14 cases): - hashCodeX null sentinel + non-null pass-through - all primitive hash() overloads match the boxed Java hashCodes - hash(Object...) 2/3/4/5-arg overloads match the chained addToHash formula they are documented to constant-fold to - addToHash(long, primitive) overloads match the Object-version - linear-accumulation invariant (31 * h + v) holds across a sequence - iterable / deprecated int[] / deprecated Object[] variants match chained addToHash - intHash treats null as 0 (observable via hash(null, "x")) HashtableTest (24 cases across 5 nested classes): - D1: insert/get/remove/insertOrReplace/clear/forEach, in-place value mutation, null-key handling, hash-collision chaining with disambig- uating equals, remove-from-collided-chain leaves siblings intact - D2: pair-key identity, remove(pair), insertOrReplace matches on both parts, forEach - Support: capacity rounds up to a power of two, bucketIndex stays in range across a wide hash sample, clear nulls every slot - BucketIterator: walks only matching-hash entries in a chain, throws NoSuchElementException when exhausted - MutatingBucketIterator: remove from head-of-chain unlinks, replace swaps the entry while preserving chain, remove() without prior next() throws IllegalStateException Tests live in internal-api/src/test/java/datadog/trace/util and use the already-present JUnit 5 setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bring the new util/ files in line with google-java-format (tabs → spaces, line wrapping, javadoc list markup) so spotlessCheck passes in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@threads

Compares Hashtable.D1 and Hashtable.D2 against equivalent HashMap usage for add, update, and iterate operations. Each benchmark thread owns its own map (Scope.Thread), but @threads(8) is used so the allocation/GC pressure that Hashtable is designed to avoid surfaces in the throughput numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Guard Support.sizeFor against overflow and use Integer.highestOneBit; reject capacities above 1 << 30 instead of looping forever. - Add braces around single-statement while bodies in BucketIterator. - Split HashtableBenchmark into HashtableD1Benchmark / HashtableD2Benchmark. - Add regression tests for Support.sizeFor bounds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 5-arg Object overload was forwarding only obj0..obj3 to the int overload, silently dropping obj4. Also align LongHashingUtils.hash 3-arg signature with its 2/4/5-arg siblings (int parameters) and strengthen the 5-arg HashingUtilsTest to detect the missing-arg regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Split D1Tests and D2Tests into HashtableD1Test and HashtableD2Test; extract shared test entry classes into HashtableTestEntries. - Reduce visibility of LongHashingUtils.hash(int...) chaining overloads to package-private; they are internal building blocks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The iterator tests need a populated Hashtable.Entry[] to drive Support.bucketIterator / mutatingBucketIterator. Relaxing D1.buckets from private to package-private lets the same-package tests read it directly, removing the reflection helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The new reason:inbox_full reportIfChanged call advances countIndex to 51, but previousCounts was still sized for 51 counters (max index 50), so the metric never emitted and the resize warning fired every flush. Bump the array to 52 and add a regression test that exercises the flush path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The label fields and the mutable counters/histograms are 1:1 with each entry; carrying them on a separate object meant one extra allocation per unique key plus an indirection on every hot-path update. Merging them puts the counters directly on AggregateEntry, drops the entry.aggregate hop, and consolidates ERROR_TAG / TOP_LEVEL_TAG onto the same class the consumer uses to decode them. AggregateTable.findOrInsert now returns AggregateEntry. Callers in Aggregator and SerializingMetricWriter updated. Migrated AggregateMetricTest.groovy to AggregateEntryTest.java per project policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a context-passing forEach(T, BiConsumer) overload to AggregateTable, mirroring TagMap's pattern. Aggregator.report now hands the writer in as context to a static BiConsumer so no fresh Consumer is allocated each report cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the TagMap pattern: pairs the existing forEach(Consumer) with a forEach(T context, BiConsumer<T, TEntry>) overload so callers can hand side-band state to a non-capturing lambda and avoid the fresh-Consumer-per-call allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Factors the unchecked (TEntry) cast out of D1.forEach / D2.forEach (and the BiConsumer variants) into Support.forEach(buckets, ...). The cast now lives in one place, mirroring how Entry.next() handles it, and the D1/D2 methods become one-liners. Downstream higher-arity tables built on Support gain the same helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Now that Hashtable.Support exposes the parameterized forEach helpers, AggregateTable's own forEach methods can drop their duplicated loop body and the (AggregateEntry) cast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds Support.bucket(buckets, keyHash) which returns the bucket head already cast to the caller's concrete entry type. D1.get and D2.get now drop the raw-Entry intermediate variable and walk the chain via Entry.next() directly. The unchecked cast lives in one place, consistent with Entry.next() and Support.forEach. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Saves one wrapper allocation per ring instance: producerCursor becomes a volatile long on the instance, paired with a static AtomicLongFieldUpdater for CAS. Same memory ordering as the prior AtomicLong (volatile field + field-updater CAS), but no per-instance wrapper object. publishedSequences stays AtomicLongArray -- the field updater approach doesn't apply to array element access. consumerCursor was already a plain volatile long with no wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two related cache-line fixes for the producer hot path under heavy contention: 1. Stride publishedSequences by 8 longs (one cache line). Without this, adjacent logical slots share cache lines and concurrent producers writing nearby sequences ping-pong the same line between cores. The array grows by 8x but the upfront cost is bounded by the ring's capacity (e.g. 8 MB at the CSS default cap=131072). 2. Cache-line-pad the producerCursor and consumerCursor against each other using the standard Disruptor class-hierarchy pattern. Every consumer-side advance of consumerCursor would otherwise invalidate the line producers read for producerCursor (and vice versa). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MpscRingBuffer pre-allocates one SpanSnapshot per slot at construction. At the default pending=2048 (logical) * 64 (LEGACY_BATCH_SIZE) = 131072 slots * ~120B = ~15 MB resident before any traffic arrives. At Xmx<128m that's 25%+ of the heap up front, starving Spring Boot / Tomcat and sending the JVM into a Full-GC death spiral (observed catastrophically at Xmx64m petclinic: 0 throughput, JVM unresponsive to readiness probes). Tighter defaults at <128m heap: maxAggregates 2048 -> 256, maxPending 2048 -> 64 (logical 64 * 64 batch = 4096 slots = ~500 KB upfront). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The historical default (2048 logical * 64 LEGACY_BATCH_SIZE = 131072 slots) was sized for the on-demand batch-allocation model where slot memory was only realized under burst load. MpscRingBuffer pre-allocates every slot at construction, so 131072 means a ~15 MB upfront pin -- vs the sub-second of consumer-stall buffer it's actually expected to absorb (~80 ms of GC pause at 10K spans/s ~ 800 slots worth). New default: 128 logical * 64 = 8192 slots = ~1 MB upfront, ~0.8 s of buffer at typical load. Tight-heap default already at 64 logical (~500 KB / ~0.4 s). Explicit customer overrides on TRACER_METRICS_MAX_PENDING are unaffected -- the LEGACY_BATCH_SIZE multiplier still applies to them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Producer-side fillSlot now treats slot.peerTagValues as a reusable scratch buffer owned by the ring: it's allocated only on first use or when the peer-tag schema's length changes. Subsequent tag-firing publishes on the same slot write into the existing array. When a publish has no peer tags, peerTagSchema is cleared but the scratch array is left in place for future reuse. For petclinic with ~50 unique span signatures and JDBC traffic, this eliminates ~30K String[] allocations/sec from the producer hot path after warmup. Consumer side: - AggregateEntry constructor copies peerTagValues by value at insert (Arrays.copyOf) so the entry's identity doesn't drift when the slot is reclaimed. Cache hits on findOrInsert pay no allocation. - AggregateEntry.matches() and hashOf() both gate the peerTagValues comparison/hash on peerTagSchema -- stale scratch carried across a no-tag publish is inert. Tests cover the slot-reuse scenario (stale scratch must not break match or hash) and the producer-array-reuse semantics (entry must snapshot peer tag values, not retain a live reference). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A single CAS claims n contiguous slots, returning a Batch handle that the caller fills via fillAndPublish(slot...). Designed for callers whose work has a natural batch boundary (e.g. CSS publishing a trace's worth of metrics-eligible spans in one shot): cuts producer-cursor contention from O(N) CASes to O(1) per call. All-or-nothing: tryClaim(n) returns null if the ring can't fit the whole batch. The Batch is single-threaded (owned by the claiming thread), short-lived (scoped to one publish call), and has no thread-shutdown hazard -- the batch is fully consumed before returning. Filler-throw safety matches the existing tryWrite contract: the slot is published in a finally block so the consumer can advance, and the batch's published counter increments either way. Tests cover: requested size, capacity rejection, all-or-nothing, three filler overloads, over-publish IllegalStateException, throw recovery, and 8-producer concurrency (200 batches/thread x 16 size = 25600 items, single consumer sees every value exactly once). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Leverage PendingTrace's natural batching: the producer pre-counts metrics-eligible spans in trace order (with the ignored-resource short-circuit), then claims all of their ring slots in a single CAS via MpscRingBuffer.tryClaim(n). A second pass fills the batch. Cuts producer-cursor contention from O(N) CASes per trace to O(1). For typical traces with 3-5 metrics-eligible spans, that's a 3-5x reduction in atomic ops on the hot field -- bigger for traces with many spans. All-or-nothing semantics: a trace either gets its metrics queued or doesn't. Slight semantic change from before (previously, a partially full inbox could publish some of a trace's spans before dropping the rest); the new behavior is cleaner for aggregation since partial-trace metrics are misleading. No ThreadLocal, no claim leak: the batch is fully consumed before publish() returns. forceKeep is now computed during the count pass (any-span-error) so the publish behavior on inbox-full mirrors the prior force-keep semantics for callers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dd-octo-sts · 2026-05-28T20:22:11Z

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite	Status
Startup	🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results

Startup Time

Scenario	This PR	master	Change
insecure-bank / iast	13,951 ms	13,940 ms	+0.1%
insecure-bank / tracing	12,901 ms	12,935 ms	-0.3%
petclinic / appsec	16,397 ms	16,129 ms	+1.7%
petclinic / iast	16,441 ms	16,399 ms	+0.3%
petclinic / profiling	15,573 ms	15,739 ms	-1.1%
petclinic / tracing	15,642 ms	14,990 ms	+4.3%

Commit: c330d9c8 · CI Pipeline · Benchmarking Platform UI

Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

The Batch handle from tryClaim was supposed to be scalar-replaced by escape analysis, but JMH measurements showed it's not -- the inner-class implicit this$0 plus the CAS-retry inside tryClaim block scalarization on HotSpot. Result: ~24 bytes of Batch + cursor state allocated per publish on the hot path, ~50% throughput drop on single-element claims in CSS-style benches. Add three sequence-based primitives that callers manage directly: long tryClaimRange(int n) -> start sequence or -1L T slotAt(long seq) -> slot for that sequence void publish(long seq) -> release the slot to the consumer No per-call allocation, no callback dispatch. Callers handle the sequence arithmetic themselves and trade safety (forget to publish -> ring stuck) for hot-path predictability. The Batch API stays for safer use cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous Batch handle wasn't being scalar-replaced -- visible as a ~50% throughput regression on single-element-claim benches due to per- publish allocations: - Batch object (~24 B, inner class with implicit this$0) - Iterator from the enhanced-for loop (~24 B per pass, two passes) Switch publish(trace) to: 1. Indexed iteration (trace.get(i)) -- kills the Iterator allocations. Trace lists are SpanList / singletonList, both with O(1) get(i). 2. tryClaimRange + slotAt + publish primitives -- no Batch handle. 3. fillSlot takes primitive `boolean isTopLevel` (no autoboxing now that it isn't dispatched through TriConsumer). The slotFiller TriConsumer field is removed -- direct method call from the loop is cheaper than the indirect dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

publish(trace) used to do two passes -- count eligible spans, claim exactly that many, fill them. The count pass duplicated shouldComputeMetric (plus isTopLevel, error, ignored-resource checks) for every span, costing ~2x the producer-side work per span on single-element-trace JMH benches. Single pass instead: claim traceSize slots up front, fill the eligible ones in place, mark non-eligible slots with tagAndDuration=0 as a "skip" sentinel. The aggregator's handleSnapshot recognises 0 (which shouldComputeMetric already rejects for any real publish) and bypasses findOrInsert for those slots. For petclinic-typical traces where most spans are CSS-eligible, the slot waste is small (~20%) and the producer-side win dominates. The ConflatingMetricAggregatorTest queueSize was sized for the old per-eligible-span claim; bumped to 1024 to give multi-span tests adequate headroom. ConflatingMetricsAggregatorInboxFullTest keeps its queueSize=8 (deliberately exercises full-ring behaviour). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…laim PendingTrace#enqueueSpansToWrite already iterates every span as it assembles the write list. Piggyback on that iteration: call ConflatingMetricsAggregator.shouldComputeMetric per finished span, sum the eligibles, and attach the count to the SpanList via a new setMetricEligibleCount/getMetricEligibleCount pair (default -1 means unset). CSS publish(trace) now reads getMetricEligibleCount when the trace is a SpanList with a known count, and sizes its single ring claim exactly. No overclaim, no skip-sentinel writes for ineligible slots, no second pass to count. When the count is unknown (e.g. a TraceInterceptor rebuilt the list into a fresh SpanList) the path falls back to overclaim + skip-sentinel, preserving correctness. shouldComputeMetric in ConflatingMetricsAggregator is promoted from private instance method to public static -- it's a pure span-local predicate. PendingTrace calls it directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…S ring claim" This reverts commit e29b802.

datadog-datadog-prod-us1 · 2026-05-28T22:17:31Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 3 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-java | agent_integration_tests

🔧 Fix in code (Fix with Cursor).
4 failed tests due to java.lang.IllegalAccessError in MetricsIntegrationTest.groovy:44.

DataDog/apm-reliability/dd-trace-java | test_smoke_graalvm: [graalvm21]

🔄 Retry job. This looks flaky and may succeed on retry.
Could not read workspace metadata from /go/src/github.com/DataDog/apm-reliability/dd-trace-java/.gradle/caches/8.14.5/groovy-dsl/00559432caf37d0d481f5cc567dc28bc/metadata.bin (No such file or directory)

DataDog/apm-reliability/dd-trace-java | muzzle: [1/8]

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration.
Could not find testng JAR files (testng-6.9.13.8, testng-6.12, testng-6.14.0). Searched in configured repositories.

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: c330d9c | Docs | Datadog PR Page | Give us feedback!}

Two small wins identified by the 64m CPU profile: 1. Replace Thread.sleep(10ms) in the aggregator's idle wait with LockSupport.parkNanos + producer-side unparkIfWaiting(). Cuts the worst-case wake latency from 10ms to whenever a producer publishes. Producer-side cost is one volatile read of the waiting flag per publish; only on a true read does the producer pay an unpark. At saturating workloads where the aggregator is never parked, the volatile read is the only cost. 2. Cache Arrays.hashCode(peerTagSchema.names) on PeerTagSchema instead of recomputing per AggregateEntry.hashOf call. The schema is shared across many publishes; the per-publish recomputation was a top aggregator-thread sample. Now one field load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 64m CPU profile showed the aggregator spending most of its time on findOrInsert's hashing chain (LongHashingUtils.addToHash, Arrays.hashCode, StringLatin1.hashCode) and contentEquals comparison. With one aggregator thread vs 8 producer threads, this concentrated CPU is the petclinic regression source. Move that work to the producer: - SpanSnapshot fields become UTF8BytesString (or @nullable variants), canonicalized in fillSlot via the per-field DDCaches that already live on AggregateEntry (made package-private static). - SpanSnapshot.computeAndSetKeyHash precomputes the table-lookup hash at the end of fillSlot. AggregateTable.findOrInsert reads snapshot.keyHash directly; no rehashing on the aggregator thread. - AggregateEntry's constructor stops canonicalizing -- it just copies the already-canonical refs from the snapshot. - AggregateEntry.matches uses identity (==) for the UTF8BytesString fields. peerTagNames uses Arrays.equals (PeerTagSchema instances can be swapped during reconcile; the a==b fast path inside Arrays.equals makes the shared-reference case O(1)). canonicalize/canonicalizeOptional treat null and length-zero inputs as the same canonical (EMPTY for required, null for optional) so the prior null-vs-empty collapse semantics survive the move from contentEquals to identity comparison. hashOf is removed from AggregateEntry; tests build snapshots via a new AggregateEntryTestUtils.buildSnapshot helper that mirrors fillSlot's canonicalize-and-hash sequence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dougqh · 2026-06-01T16:41:09Z

Shelving this PR. Exhaustive multi-endpoint petclinic benchmarking (2026-06-01) confirmed that all producer-side SpanSnapshot pooling designs — full ring buffer, half-size ring buffer, SPMC recycle pool, and thread-local pool analysis — share the same root-cause failure: pre-allocated SpanSnapshot objects promoted to old-gen become hot write targets for producer threads, causing G1 card table pressure that produces catastrophic throughput variance (and total death at 256m) during mixed collections.

The ring buffer fixes the Xmx64m tight-heap regression (#11500 addresses that separately via heap-aware inbox defaults), but its loose-heap behavior is strictly worse than #11500's per-publish allocation design. #11500 lets SpanSnapshot objects die young and never touch the card table; that property is what makes it stable at 256m (sd=234 vs. >3000 sd or 0 r/s for all pooling variants).

Preserving the branches for reference: dougqh/ring-buffer (#11492) and dougqh/css-ring-buffer (#11497).

dougqh and others added 30 commits May 15, 2026 12:06

Merge branch 'master' into dougqh/conflating-metrics-producer-wins

3355865

Merge branch 'dougqh/conflating-metrics-producer-wins' into dougqh/co…

950499c

…nflating-metrics-background-work

Apply spotless formatting to Hashtable and LongHashingUtils

031dc89

Bring the new util/ files in line with google-java-format (tabs → spaces, line wrapping, javadoc list markup) so spotlessCheck passes in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add benchmark results to HashtableBenchmark header

a534e4f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge branch 'dougqh/util-hashtable' into dougqh/optimize-metric-key

7ab47f7

Merge branch 'dougqh/util-hashtable' into dougqh/optimize-metric-key

46f147a

dougqh and others added 7 commits May 28, 2026 11:25

Merge dougqh/ring-buffer: pick up AtomicLongFieldUpdater change

12984ac

Merge dougqh/ring-buffer: cache-line padding + stride

fed8238

dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: performance Performance related changes tag: no release notes Changes to exclude from release notes tag: ai generated Largely based on code generated by an AI or LLM labels May 28, 2026

dougqh and others added 3 commits May 28, 2026 16:17

Merge dougqh/ring-buffer: pick up tryClaim(n)/Batch API

01754a5

dougqh and others added 7 commits May 28, 2026 16:28

Merge dougqh/ring-buffer: pick up low-level primitives

f18a73e

Fix stale Javadoc reference to removed slotFiller field

e4277d0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Revert "Use PendingTrace pre-counted eligible-span count for exact CS…

21b5bd7

…S ring claim" This reverts commit e29b802.

dougqh and others added 2 commits May 28, 2026 20:26

dougqh force-pushed the dougqh/lazy-error-latencies branch from f2ee559 to 0c658dd Compare June 1, 2026 15:30

dougqh closed this Jun 1, 2026

dougqh mentioned this pull request Jun 1, 2026

Add MpscRingBuffer primitive for pre-allocated slot rings #11492

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use MpscRingBuffer for CSS producer/consumer inbox#11497

Use MpscRingBuffer for CSS producer/consumer inbox#11497
dougqh wants to merge 169 commits into
dougqh/lazy-error-latenciesfrom
dougqh/css-ring-buffer

dougqh commented May 28, 2026

Uh oh!

dd-octo-sts Bot commented May 28, 2026 •

edited

Loading

Uh oh!

datadog-datadog-prod-us1 Bot commented May 28, 2026 •

edited

Loading

Uh oh!

dougqh commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dougqh commented May 28, 2026

Summary

Background

What's in this PR (on top of #11492)

Validation

JMH benches (8 producer threads, @Warmup(2, 15s) @Measurement(5, 15s) @Fork(1))

Petclinic load tests (Java 17, 8 jmeter threads, GET /owners/3, single iter)

Future work (not in this PR)

Test plan

Uh oh!

dd-octo-sts Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟢 Java Benchmark SLOs — All performance SLOs passed

Uh oh!

datadog-datadog-prod-us1 Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

dougqh commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JMH benches (8 producer threads, `@Warmup(2, 15s) @Measurement(5, 15s) @Fork(1)`)

Petclinic load tests (Java 17, 8 jmeter threads, `GET /owners/3`, single iter)

dd-octo-sts Bot commented May 28, 2026 •

edited

Loading

datadog-datadog-prod-us1 Bot commented May 28, 2026 •

edited

Loading