Skip to content

[Bug] NoSuchElementException in InMemoryDelayedDeliveryTracker.nextDeliveryTime broke message dispatching until broker was restarted #25617

@chamons

Description

@chamons

Search before reporting

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

User environment

Broker Version

$ bin/pulsar version
Current version of pulsar is: 4.1.2
Git Revision 5f587c1204d13ec6b657c7eef472f8427368a9f6
Git Branch branch-4.1
Built by Lari Hotari <lhotari@apache.org> on Laris-MBP-2.lan at 2025-11-11T23:36:57+0200

Broker OS:

$ uname -a
Linux pulsar-broker-0 6.6.113+ #1 SMP Wed Jan 28 03:26:59 UTC 2026 aarch64 GNU/Linux

Broker Java:

$ java -version
openjdk version "21.0.9" 2025-10-21 LTS
OpenJDK Runtime Environment Corretto-21.0.9.11.1 (build 21.0.9+11-LTS)
OpenJDK 64-Bit Server VM Corretto-21.0.9.11.1 (build 21.0.9+11-LTS, mixed mode)

Client Library:
https://github.com/streamnative/pulsar-rs

pulsar = { version = "6.7.2", features = [
    "compression",
    "tokio-runtime",
], default-features = false }

Client OS Version - Same as Broker OS

Issue Description

We have a production broker and bookie cluster of 6 nodes each. While scaling up a new feature of our service that wrote high volume messages to a topic with a deliver_at_time mostly set for the upcoming minute, we saw this exposed pulsar metric increase over time:

sum by (topic)(pulsar_subscription_back_log_no_delayed{namespace=~"pulsar-$environment", subscription="delivery-subscription"})

Image

This was despite the fact that our consumers were consuming from that topic actively and we had plenty of headroom to consume more messages from our consumers.

After an investigation, we restarted our consumer pool to no effect, and then restarted the brokers (a rolling restart). We then saw a massive spike from 42.7 thousand to to 2.64 million entries on that topic, which then started to be processed by our consumer group.

Image

Investigating the messages now being sent, we found messages from more than a week ago, including some that we have evidence were previously processed and acked successfully.

Thankfully we had added a stale message detection to our consumer, so this was only a data loss event and not something worse (sending messages from a week ago).

Investigating the logs, we found 165 instances of java.util.NoSuchElementException coming from Long2ObjectAVLTreeMap/InMemoryDelayedDeliveryTracker (see error message attached), with some of those correlating pretty closely to increases in the "stuck" backlog or high volume writes to pulsar.

There are three bugs here from my perspective:

  • Why did the broker cluster stop dispatching messages that had passed their deliver_at_time for this topic until the broker was restarted?
  • Why did restarting the broker cause multiple order of magnitude more messages to be (re)? send to our consumers?
  • Why did pulsar have uncaught exceptions that appear to be the source of this bug only reported by an info level log? Why did the broker not fail a health check or otherwise notify us via metrics?

This was a very serious bug for us, causing non-trivial data loss as our messages have a QoS timing window for prompt delivery that we missed.

If an engineer has not manually noticed that a reported metric was increasing over time, this could have gone on for much longer than a week. If we had not had a stale message detector, we could have reprocessed and resent messages we sent multiple days ago (that our standard pulsar idempotency detector would have missed). The fact that this issue was manually discovered by a lucky glance as a dashboard and not due to failing health checks or other standard metrics also has us concerned about the suitability for our high performance workload.

Error messages

java.util.NoSuchElementException: null
	at it.unimi.dsi.fastutil.longs.Long2ObjectAVLTreeMap.firstLongKey(Long2ObjectAVLTreeMap.java:934) ~[it.unimi.dsi-fastutil-8.5.16.jar:?]
	at org.apache.pulsar.broker.delayed.InMemoryDelayedDeliveryTracker.nextDeliveryTime(InMemoryDelayedDeliveryTracker.java:273) ~[org.apache.pulsar-pulsar-broker-4.1.2.jar:4.1.2]
	at org.apache.pulsar.broker.delayed.AbstractDelayedDeliveryTracker.updateTimer(AbstractDelayedDeliveryTracker.java:99) ~[org.apache.pulsar-pulsar-broker-4.1.2.jar:4.1.2]
	at org.apache.pulsar.broker.delayed.InMemoryDelayedDeliveryTracker.getScheduledMessages(InMemoryDelayedDeliveryTracker.java:229) ~[org.apache.pulsar-pulsar-broker-4.1.2.jar:4.1.2]
	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.getMessagesToReplayNow(PersistentDispatcherMultipleConsumers.java:1327) ~[org.apache.pulsar-pulsar-broker-4.1.2.jar:4.1.2]
	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readMoreEntries(PersistentDispatcherMultipleConsumers.java:385) ~[org.apache.pulsar-pulsar-broker-4.1.2.jar:4.1.2]
	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.lambda$readMoreEntriesAsync$4(PersistentDispatcherMultipleConsumers.java:324) ~[org.apache.pulsar-pulsar-broker-4.1.2.jar:4.1.2]
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) ~[io.netty-netty-common-4.1.128.Final.jar:4.1.128.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) ~[io.netty-netty-common-4.1.128.Final.jar:4.1.128.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) ~[io.netty-netty-common-4.1.128.Final.jar:4.1.128.Final]
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405) ~[io.netty-netty-transport-classes-epoll-4.1.128.Final.jar:4.1.128.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:998) ~[io.netty-netty-common-4.1.128.Final.jar:4.1.128.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty-netty-common-4.1.128.Final.jar:4.1.128.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.128.Final.jar:4.1.128.Final]
	at java.base/java.lang.Thread.run(Unknown Source) [?:?]

Reproducing the issue

We are attempting to reproduce this in a synthetic testing environments and will update the issue if we can come up with one.

Our average write and process speed is on the order of 1.5k / second on roughly 20 topics. This feature would write a much smaller amount (order 10/second) with short but intense spikes of 4k / second or more with deliver_after_time that are very similar or identical times set.

See https://github.com/chamons/pulsar-scheduled-exception-repro

Additional information

I found #25174 which has some of the same hallmarks:

  • A broker exception
  • That seems to cause scheduled messages to not be dispatched
  • That is resolved by restarting the broker

However, the stack trace is different.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugThe PR fixed a bug or issue reported a bug

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions