Skip to content

fix(tests): prevent HashedWheelTimer leak in integration tests#97

Draft
dkasimovskiy wants to merge 13 commits into
masterfrom
fix/box-integration-tests
Draft

fix(tests): prevent HashedWheelTimer leak in integration tests#97
dkasimovskiy wants to merge 13 commits into
masterfrom
fix/box-integration-tests

Conversation

@dkasimovskiy

Copy link
Copy Markdown
Contributor

The integration test base classes in tarantool-core, tarantool-schema, tarantool-jackson-mapping, and tarantool-client created a static HashedWheelTimer that was never released, producing a Netty leak warning ('LEAK: HashedWheelTimer.release() was not called before it's garbage-collected') for every test class on every CI run.

  • Promote timerService and factory to static (shared across tests in a class for efficiency) and register a JVM shutdown hook that calls timerService.stop() and cancels any pending timeouts at JVM exit. A shutdown hook is used instead of @afterall because the static fields in BaseTest are per-declaring-class and would otherwise be stopped after the first subclass, breaking subsequent test classes.
  • IProtoClientTest.testTimeoutCancel was reassigning the shared static timer; refactored to use a local HashedWheelTimer and ConnectionFactory so the cancellation behavior is still exercised without affecting other tests.
  • IProtoClientWatchersTest.checkTTVersion was throwing NPE when the TARANTOOL_VERSION environment variable was not set (the case for local runs; CI sets it). Fall back to the actual container version when the env var is unset.

I haven't forgotten about:

  • Tests
  • Changelog
  • Documentation
    • JavaDoc was written
  • Commit messages comply with the guideline
  • Cleanup the code for review. See checklist

Related issues:

dkasimovskiy and others added 13 commits June 11, 2026 16:37
The integration test base classes in tarantool-core, tarantool-schema,
tarantool-jackson-mapping, and tarantool-client created a static
HashedWheelTimer that was never released, producing a Netty leak warning
('LEAK: HashedWheelTimer.release() was not called before it's
garbage-collected') for every test class on every CI run.

- Promote timerService and factory to static (shared across tests in a
  class for efficiency) and register a JVM shutdown hook that calls
  timerService.stop() and cancels any pending timeouts at JVM exit. A
  shutdown hook is used instead of @afterall because the static fields in
  BaseTest are per-declaring-class and would otherwise be stopped after
  the first subclass, breaking subsequent test classes.
- IProtoClientTest.testTimeoutCancel was reassigning the shared static
  timer; refactored to use a local HashedWheelTimer and ConnectionFactory
  so the cancellation behavior is still exercised without affecting other
  tests.
- IProtoClientWatchersTest.checkTTVersion was throwing NPE when the
  TARANTOOL_VERSION environment variable was not set (the case for local
  runs; CI sets it). Fall back to the actual container version when the
  env var is unset.
Synchronize PoolEntry state mutations and add volatile visibility to
fix intermittent failures in ConnectionPoolReconnectsTest.

PoolEntry changes:
- Make connectFuture, lastHeartbeatEvent, heartbeatTask, reconnectTask,
  isHeartbeatStarted, isLocked volatile to ensure cross-thread visibility
  between netty IO threads and the main thread.
- Add AtomicBoolean isShutdown for idempotent shutdown() so the close
  listener fires only once even with overlapping close events.
- Synchronize lock(), unlock(), close(), shutdown(), internalConnect(),
  stopReconnectTask() to serialize state transitions.
- Make connectAfter() cancel the previous reconnect task atomically and
  increment the reconnecting counter only when no task is pending,
  preventing duplicate reconnect scheduling.
- Add double-check for connectFuture in internalConnect() to return the
  in-flight future instead of starting a new connect, eliminating the
  shutdown() vs internalConnect() race that could leave the pool with a
  closed client behind a fresh future.
- Wrap state mutations in handleConnectError() with synchronized block.

IProtoClientPoolImpl changes:
- Synchronize forEach() on connectionPoolLock to prevent
  ConcurrentModificationException when called concurrently with setGroups().

Verified: ConnectionPoolReconnectsTest passes 5/5 runs locally.
ConnectionPoolTest (12 tests) and 24 unit tests remain green.
… cycle

The previous isShutdown early-return made shutdown() a no-op on the
second invocation, which leaked the new connection opened by a
reconnect after a KILL.

Sequence with the bug:
1. Heartbeat fires KILL -> shutdown() -> client.close() (closes old
   connection), isShutdown=true, emit onConnectionClosed
2. connectAfter() schedules a reconnect (1s later)
3. Reconnect fires -> internalConnect() opens a new channel on the
   same ConnectionImpl (state CLOSED -> CONNECTING -> READY)
4. The new connection's auth/ping times out (pipe locked in heartbeat
   tests) -> handleConnectError -> shutdown() is now a no-op, so
   client.close() is NOT called on the new connection
5. connectAfter() schedules ANOTHER reconnect, leaving the new
   connection open and leaking

This caused the testReconnectsAfterInvalidationAndKill test to time
out at 60s because active connections to node-a never reached 0
after KILL, and the test was waiting for that condition.

Fix: keep the client.close() call (and stopHeartbeat/connectFuture
cleanup) on every shutdown() invocation, but guard only the
onConnectionClosed emit with the isShutdown flag. This way the
client is always closed (idempotently, since closeChannel is a no-op
if state is already CLOSED/CLOSING), but the pool's
onConnectionClosed listener fires only once per PoolEntry, preventing
double-decrement of the unavailable counter.

Verified:
- ConnectionPoolReconnectsTest: 5/5 runs pass
- ConnectionPoolTest: 5/5 runs pass (12 tests each)
- All 24 unit tests pass

ConnectionPoolHeartbeatTest is still flaky (3-5/10 pass rate) due to
a separate test-infrastructure issue: the test asserts on tarantool's
box.stat.net().CONNECTIONS.current counter, which is updated
asynchronously and is often 0-6 below the expected count when 20+
connections are created in a tight burst. This is a test-side timing
race, not a PoolEntry bug, and cannot be fixed without modifying the
test to wait for the counter to stabilize.
The ConnectionPoolHeartbeatTest assertions in BasePoolTest.getActiveConnectionsCount
were flaky because tarantool's box.stat.net().CONNECTIONS.current is updated
asynchronously by the IProto worker.

When the test creates 20+ pool entries in a tight burst (pool.get() in a loop
followed by CompletableFuture.allOf().join()), netty establishes all the TCP
connections in ~50 ms, but the IProto worker in tarantool processes them serially
with 10-30 ms per connection. The test then reads the counter before the worker
has caught up, so the returned value is 5-15 below the expected count.

In the most extreme case, even the helper's own 'tt' connection is not yet
counted in CONNECTIONS.current by the time the Lua runs (because the IProto
worker updates the counter after sending the response, and the 'tt' connection
can close before the response reaches it), producing a delta of -2 to -5
relative to the baseline and failing the equality check with messages like
'expected: <20> but was: <10>' or 'expected: <13> but was: <-2>'.

Fix: run the read inside a wait-for-stable loop. The Lua sleeps on a fiber
for 50 ms between reads, which yields the IProto worker and gives it time to
accept the pending TCP connections and update the counter. The loop exits as
soon as two consecutive reads return the same value, with a 2.5 s safety
timeout.

Verified:
- ConnectionPoolHeartbeatTest: 10/10 runs pass (previously 2-5/10)
- ConnectionPoolHeartbeatTest: 5/5 full profile runs pass (17/17 tests)
- ConnectionPoolReconnectsTest: 5/5 runs pass (no regression)
- ConnectionPoolTest: 5/5 runs pass (12/12 tests, no regression)
- 24/24 unit tests pass (no regression)
- 349/349 crud integration tests pass (no regression)
tarantoolctl/tt processes a multi-line script line by line and emits a
separate YAML document (---\n...\n) for each line. SnakeYAML's
yaml.load() rejects multi-document streams with
"expected a single document in the stream", which broke 3.x tests
that piped the wait-for-stable script via echo.

Put the script on a single line with ';' separators so the output is
a single YAML document.
…nt.close()/emit outside monitor

The previous attempt to fix race conditions in PoolEntry made shutdown(),
close() and handleConnectError() synchronized on the PoolEntry monitor while
they invoked client.close() (which acquires the ConnectionImpl monitor) and
emitted pool events. The Netty close-event path runs the reverse order:
ConnectionImpl.onChannelClose() (synchronized on ConnectionImpl) calls back
into PoolEntry.handleConnectError(), which then blocks on the PoolEntry
monitor. This produced an ABBA deadlock between the HashedWheelTimer worker
and the Netty event loop, which manifested as a permanent hang in
DistributingRoundRobinBalancerTest (the @timeout(25) SameThread variant
cannot interrupt CompletableFuture.join()).

The thread dump of the hung JVM confirmed:
  - 'pool-N-thread-1' BLOCKED in ConnectionImpl.closeChannel, holding PoolEntry
  - 'multiThreadIoEventLoopGroup-2-1' BLOCKED in PoolEntry.handleConnectError, holding ConnectionImpl

Fix: narrow the entry-monitor critical sections to field mutations only.
client.close() and emit(...) are now called outside the monitor, breaking
the lock-ordering cycle.

Also fix a race in internalConnect(): assign connectFuture to the local chain
BEFORE attaching whenComplete, so an inline failure completing before the
method returns cannot have its nulling of connectFuture in
handleConnectError() overwritten by a stale failed future.

Verified locally:
  - DistributingRoundRobinBalancerTest: 5/5 on 2.11.8, 5/5 on 3.5.0, no hangs
  - testDistributingRoundRobinStartWithStuckNodeA (reproducer): 5/5 stable
  - ConnectionPoolReconnectsTest + ConnectionPoolHeartbeatTest: 5/5 on 2.11.8, 3/3 on 3.5.0
  - IProtoClientPoolTest + ConnectionPoolTest (unit): 32/32
…d NPE on inline connect failure

internalConnect() now returns the local chain variable (cf) rather than the
connectFuture field. If client.connect() fails inline (e.g. connection
refused, exercised by ConnectionPoolTest#testConnectError), the
whenComplete callback fires reentrantly on the calling thread, which
causes handleConnectError() to clear the connectFuture field for reconnect
purposes. The local variable still references the (failed) future, so
callers always receive a non-null result and observe the failure via
ExecutionException from .get(). The field being null guarantees the next
reconnect attempt actually reconnects instead of returning a stale failed
future.

Previously internalConnect() assigned the result of whenComplete (a new
stage) to connectFuture, which inadvertently masked the reentrant nulling
in handleConnectError(). Reordering that assignment to occur before
whenComplete exposed callers to the null field, causing a NullPointerException
at IProtoClientPoolImpl.get:457 when handling inline connect failures.

Fixes the regression introduced in ae504a3 (testConnectError in CI run
27552580408 on 3.5.0) while preserving the deadlock fix and the reconnect
de-latching behaviour from that commit.

Verified locally on 3.5.0 and 2.11.8:
  - ConnectionPoolTest#testConnectError: 10/10 PASS on each version
  - ConnectionPoolTest (full): 12/12 on 3.5.0
  - ConnectionPoolReconnectsTest: 10/10 on 3.5.0
  - ConnectionPoolHeartbeatTest: 4/4 on 3.5.0
  - DistributingRoundRobinBalancerTest: 5/5 on 3.5.0
  - IProtoClientPoolTest + ConnectionPoolTest (unit): 32/32
…in ConnectionPoolReconnectsTest

box.stat.net().CONNECTIONS.current is updated asynchronously by the IProto
worker and closing the administrative net.box connection used by
executeCommandDecoded is asynchronous on the server, so a single read can
briefly observe a stale or transitional value. This made the absolute
assertion `assertEquals(count1, getActiveConnectionsCount(tt))` flaky when
timings shifted (e.g. under docker emulation locally and on the self-hosted
3.5.0 runner), producing `expected: <11> but was: <12>`.

Add a waitForActiveConnections(tt, expected) helper in BasePoolTest that
retries the assertion via the existing waitFor() loop, and use it in
ConnectionPoolReconnectsTest#testReconnectAfterNodeFailure at both
assertion sites (initial connect and post-reconnect). Other tests
(ConnectionPoolTest, ConnectionPoolHeartbeatTest) are unaffected as they
use the established baseline+delta pattern or pass in CI.

Verified locally on 3.5.0:
  - ConnectionPoolReconnectsTest: 10/10 PASS (previously flaky)
  - Full box-integration target modules: 0 failures
…efore declaring cluster ready

VshardClusterConfigurator#configure() previously declared the cluster
ready after three checks: router is up, vshard.router.bootstrap() returns
cleanly, and crud._VERSION is reachable on the router. None of these
verifies that individual storages have completed the vshard handshake
during the initial rebalance; the router can answer 'bootstrap ok' while
some storages are still in the VHANDSHAKE_NOT_COMPLETE (code 40) state,
and a CRUD request that targets such a storage fails immediately.

Observed in tests-crud-integration (3.5.0):
  TarantoolTemplateViaJavaConfigTest.testKVTemplateDeleteEntities
    -> CrudException: Failed to truncate for storage-002
       VHANDSHAKE_NOT_COMPLETE, 'Handshake with storage-002-a have not
       been completed yet'

Add a fourth readiness step: waitUntilVshardStoragesAreReady polls
vshard.router.info() until every replica in every replicaset is in
status='available' and there are no unreachable buckets, with a 120s
budget. This guarantees that any subsequent CRUD request hits a fully
handshaked storage.

Verified locally on 3.5.0:
  - TarantoolCrudClientTest: 347/347 PASS (was failing with
    VHANDSHAKE_NOT_COMPLETE in CI)
  - TarantoolTemplateViaJavaConfigTest (spring-data27, the originally
    failing test): 35/35 PASS

This is a testcontainers/infra change; it is added to the same PR as
the pooling fixups because both were investigated together after the
box-integration deadlocks were resolved.
…re declaring cluster ready

The previous readiness check (a13030a) inspected vshard.router.info() for
network-level availability of every replica, which is sufficient only to
verify that the router can ping each storage. It did not verify that the
storage has completed the vshard handshake on its own side and is ready
to serve read-write requests. As a result, a cluster could be declared
ready while some storage was still in VHANDSHAKE_NOT_COMPLETE (vshard
code 40), and the first rw call (typically a CRUD truncate issued by an
@beforeeach in spring-data31 TarantoolTemplateViaJavaConfigTest) failed
with: 'Handshake with storage-002-a have not been completed yet'.

Strengthen the readiness check: in addition to the previous invariants
(every replicaset has a master, every replica is 'available', there are
no unreachable/unknown buckets), require bucket.available_rw ==
vshard.router.bucket_count(). This guarantees that all buckets are
servable for read-write before the cluster is reported ready.

Also add VSHARD_STORAGES_STATUS_COMMAND and logVshardStoragesStatus() to
emit the actual total/available_rw/available_ro/unreachable/unknown
counts when the cluster is not yet ready, so CI logs show the progress
of the vshard rebalance.

Bump TIMEOUT_VSHARD_STORAGES_READY_IN_SECONDS from 120 to 300 to give
slower self-hosted runners a comfortable budget for the initial
rebalance to settle.

Observed in tests-crud-integration (3.5.0):
  TarantoolTemplateViaJavaConfigTest.testSimpleClient
    -> CrudException: VHANDSHAKE_NOT_COMPLETE for storage-002-a

Verified locally on 3.5.0 (the originally failing test now passes 3/3
in a stress loop; the full crud-integration module passes all but one
class which fails only due to a local amd64->arm64 emulation artifact
when the same container is restarted several times in quick succession):
  - spring-data31 TarantoolTemplateViaJavaConfigTest: 3/3 PASS
  - spring-data27 TarantoolTemplateViaJavaConfigTest: 2/3 PASS (3rd hangs
    only on local amd64->arm64 emulation; not a code defect)
  - TarantoolCrudClientTest: 347/347 PASS
  - spring-data27 full module: 357/357 PASS
… check

The readiness check from 3ad515a relied on vshard.router.info() reporting
bucket.available_rw == vshard.router.bucket_count(). This is a passive
view of the router-side bucket accounting and does not always match the
actual rw-handshaked state of each storage in the early moments of a
fresh cluster: the third CI run (run 27611921481) still failed with
'Timeout exceeded while waiting for vshard storages to complete
handshake' on spring-data32.RepositoryViaJavaConfigTest, even though
the previous data27/data31 classes had started successfully within
the same Maven reactor.

Add a second, active tier to the check: after the passive
vshard.router.info() pre-check passes, perform a real read-write call
(vshard.router.callrw(1, 'box.info', {})) wrapped in pcall. This is the
most direct signal that the storage-side handshake has actually
finished for at least one storage. A successful rw call proves that
the very first CRUD call (e.g. truncate issued by @beforeeach) will
not fail with VHANDSHAKE_NOT_COMPLETE.

The probe is intentionally kept on a single Lua line because the
testcontainers command pipe flattens newlines, which would otherwise
confuse the Tarantool Lua parser around 'end' (observed in local
verification: the multi-line form produced 'end expected near local').

Verified locally on 3.5.0:
  - spring-data31 TarantoolTemplateViaJavaConfigTest: 35/35 PASS
  - spring-data27 TarantoolTemplateViaJavaConfigTest: 35/35 PASS
… rw before declaring cluster ready"

This reverts commit 3ad515a.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant