fix(tests): prevent HashedWheelTimer leak in integration tests#97
Draft
dkasimovskiy wants to merge 13 commits into
Draft
fix(tests): prevent HashedWheelTimer leak in integration tests#97dkasimovskiy wants to merge 13 commits into
dkasimovskiy wants to merge 13 commits into
Conversation
The integration test base classes in tarantool-core, tarantool-schema,
tarantool-jackson-mapping, and tarantool-client created a static
HashedWheelTimer that was never released, producing a Netty leak warning
('LEAK: HashedWheelTimer.release() was not called before it's
garbage-collected') for every test class on every CI run.
- Promote timerService and factory to static (shared across tests in a
class for efficiency) and register a JVM shutdown hook that calls
timerService.stop() and cancels any pending timeouts at JVM exit. A
shutdown hook is used instead of @afterall because the static fields in
BaseTest are per-declaring-class and would otherwise be stopped after
the first subclass, breaking subsequent test classes.
- IProtoClientTest.testTimeoutCancel was reassigning the shared static
timer; refactored to use a local HashedWheelTimer and ConnectionFactory
so the cancellation behavior is still exercised without affecting other
tests.
- IProtoClientWatchersTest.checkTTVersion was throwing NPE when the
TARANTOOL_VERSION environment variable was not set (the case for local
runs; CI sets it). Fall back to the actual container version when the
env var is unset.
Synchronize PoolEntry state mutations and add volatile visibility to fix intermittent failures in ConnectionPoolReconnectsTest. PoolEntry changes: - Make connectFuture, lastHeartbeatEvent, heartbeatTask, reconnectTask, isHeartbeatStarted, isLocked volatile to ensure cross-thread visibility between netty IO threads and the main thread. - Add AtomicBoolean isShutdown for idempotent shutdown() so the close listener fires only once even with overlapping close events. - Synchronize lock(), unlock(), close(), shutdown(), internalConnect(), stopReconnectTask() to serialize state transitions. - Make connectAfter() cancel the previous reconnect task atomically and increment the reconnecting counter only when no task is pending, preventing duplicate reconnect scheduling. - Add double-check for connectFuture in internalConnect() to return the in-flight future instead of starting a new connect, eliminating the shutdown() vs internalConnect() race that could leave the pool with a closed client behind a fresh future. - Wrap state mutations in handleConnectError() with synchronized block. IProtoClientPoolImpl changes: - Synchronize forEach() on connectionPoolLock to prevent ConcurrentModificationException when called concurrently with setGroups(). Verified: ConnectionPoolReconnectsTest passes 5/5 runs locally. ConnectionPoolTest (12 tests) and 24 unit tests remain green.
… cycle The previous isShutdown early-return made shutdown() a no-op on the second invocation, which leaked the new connection opened by a reconnect after a KILL. Sequence with the bug: 1. Heartbeat fires KILL -> shutdown() -> client.close() (closes old connection), isShutdown=true, emit onConnectionClosed 2. connectAfter() schedules a reconnect (1s later) 3. Reconnect fires -> internalConnect() opens a new channel on the same ConnectionImpl (state CLOSED -> CONNECTING -> READY) 4. The new connection's auth/ping times out (pipe locked in heartbeat tests) -> handleConnectError -> shutdown() is now a no-op, so client.close() is NOT called on the new connection 5. connectAfter() schedules ANOTHER reconnect, leaving the new connection open and leaking This caused the testReconnectsAfterInvalidationAndKill test to time out at 60s because active connections to node-a never reached 0 after KILL, and the test was waiting for that condition. Fix: keep the client.close() call (and stopHeartbeat/connectFuture cleanup) on every shutdown() invocation, but guard only the onConnectionClosed emit with the isShutdown flag. This way the client is always closed (idempotently, since closeChannel is a no-op if state is already CLOSED/CLOSING), but the pool's onConnectionClosed listener fires only once per PoolEntry, preventing double-decrement of the unavailable counter. Verified: - ConnectionPoolReconnectsTest: 5/5 runs pass - ConnectionPoolTest: 5/5 runs pass (12 tests each) - All 24 unit tests pass ConnectionPoolHeartbeatTest is still flaky (3-5/10 pass rate) due to a separate test-infrastructure issue: the test asserts on tarantool's box.stat.net().CONNECTIONS.current counter, which is updated asynchronously and is often 0-6 below the expected count when 20+ connections are created in a tight burst. This is a test-side timing race, not a PoolEntry bug, and cannot be fixed without modifying the test to wait for the counter to stabilize.
The ConnectionPoolHeartbeatTest assertions in BasePoolTest.getActiveConnectionsCount were flaky because tarantool's box.stat.net().CONNECTIONS.current is updated asynchronously by the IProto worker. When the test creates 20+ pool entries in a tight burst (pool.get() in a loop followed by CompletableFuture.allOf().join()), netty establishes all the TCP connections in ~50 ms, but the IProto worker in tarantool processes them serially with 10-30 ms per connection. The test then reads the counter before the worker has caught up, so the returned value is 5-15 below the expected count. In the most extreme case, even the helper's own 'tt' connection is not yet counted in CONNECTIONS.current by the time the Lua runs (because the IProto worker updates the counter after sending the response, and the 'tt' connection can close before the response reaches it), producing a delta of -2 to -5 relative to the baseline and failing the equality check with messages like 'expected: <20> but was: <10>' or 'expected: <13> but was: <-2>'. Fix: run the read inside a wait-for-stable loop. The Lua sleeps on a fiber for 50 ms between reads, which yields the IProto worker and gives it time to accept the pending TCP connections and update the counter. The loop exits as soon as two consecutive reads return the same value, with a 2.5 s safety timeout. Verified: - ConnectionPoolHeartbeatTest: 10/10 runs pass (previously 2-5/10) - ConnectionPoolHeartbeatTest: 5/5 full profile runs pass (17/17 tests) - ConnectionPoolReconnectsTest: 5/5 runs pass (no regression) - ConnectionPoolTest: 5/5 runs pass (12/12 tests, no regression) - 24/24 unit tests pass (no regression) - 349/349 crud integration tests pass (no regression)
tarantoolctl/tt processes a multi-line script line by line and emits a separate YAML document (---\n...\n) for each line. SnakeYAML's yaml.load() rejects multi-document streams with "expected a single document in the stream", which broke 3.x tests that piped the wait-for-stable script via echo. Put the script on a single line with ';' separators so the output is a single YAML document.
…nt.close()/emit outside monitor The previous attempt to fix race conditions in PoolEntry made shutdown(), close() and handleConnectError() synchronized on the PoolEntry monitor while they invoked client.close() (which acquires the ConnectionImpl monitor) and emitted pool events. The Netty close-event path runs the reverse order: ConnectionImpl.onChannelClose() (synchronized on ConnectionImpl) calls back into PoolEntry.handleConnectError(), which then blocks on the PoolEntry monitor. This produced an ABBA deadlock between the HashedWheelTimer worker and the Netty event loop, which manifested as a permanent hang in DistributingRoundRobinBalancerTest (the @timeout(25) SameThread variant cannot interrupt CompletableFuture.join()). The thread dump of the hung JVM confirmed: - 'pool-N-thread-1' BLOCKED in ConnectionImpl.closeChannel, holding PoolEntry - 'multiThreadIoEventLoopGroup-2-1' BLOCKED in PoolEntry.handleConnectError, holding ConnectionImpl Fix: narrow the entry-monitor critical sections to field mutations only. client.close() and emit(...) are now called outside the monitor, breaking the lock-ordering cycle. Also fix a race in internalConnect(): assign connectFuture to the local chain BEFORE attaching whenComplete, so an inline failure completing before the method returns cannot have its nulling of connectFuture in handleConnectError() overwritten by a stale failed future. Verified locally: - DistributingRoundRobinBalancerTest: 5/5 on 2.11.8, 5/5 on 3.5.0, no hangs - testDistributingRoundRobinStartWithStuckNodeA (reproducer): 5/5 stable - ConnectionPoolReconnectsTest + ConnectionPoolHeartbeatTest: 5/5 on 2.11.8, 3/3 on 3.5.0 - IProtoClientPoolTest + ConnectionPoolTest (unit): 32/32
…d NPE on inline connect failure internalConnect() now returns the local chain variable (cf) rather than the connectFuture field. If client.connect() fails inline (e.g. connection refused, exercised by ConnectionPoolTest#testConnectError), the whenComplete callback fires reentrantly on the calling thread, which causes handleConnectError() to clear the connectFuture field for reconnect purposes. The local variable still references the (failed) future, so callers always receive a non-null result and observe the failure via ExecutionException from .get(). The field being null guarantees the next reconnect attempt actually reconnects instead of returning a stale failed future. Previously internalConnect() assigned the result of whenComplete (a new stage) to connectFuture, which inadvertently masked the reentrant nulling in handleConnectError(). Reordering that assignment to occur before whenComplete exposed callers to the null field, causing a NullPointerException at IProtoClientPoolImpl.get:457 when handling inline connect failures. Fixes the regression introduced in ae504a3 (testConnectError in CI run 27552580408 on 3.5.0) while preserving the deadlock fix and the reconnect de-latching behaviour from that commit. Verified locally on 3.5.0 and 2.11.8: - ConnectionPoolTest#testConnectError: 10/10 PASS on each version - ConnectionPoolTest (full): 12/12 on 3.5.0 - ConnectionPoolReconnectsTest: 10/10 on 3.5.0 - ConnectionPoolHeartbeatTest: 4/4 on 3.5.0 - DistributingRoundRobinBalancerTest: 5/5 on 3.5.0 - IProtoClientPoolTest + ConnectionPoolTest (unit): 32/32
…in ConnectionPoolReconnectsTest box.stat.net().CONNECTIONS.current is updated asynchronously by the IProto worker and closing the administrative net.box connection used by executeCommandDecoded is asynchronous on the server, so a single read can briefly observe a stale or transitional value. This made the absolute assertion `assertEquals(count1, getActiveConnectionsCount(tt))` flaky when timings shifted (e.g. under docker emulation locally and on the self-hosted 3.5.0 runner), producing `expected: <11> but was: <12>`. Add a waitForActiveConnections(tt, expected) helper in BasePoolTest that retries the assertion via the existing waitFor() loop, and use it in ConnectionPoolReconnectsTest#testReconnectAfterNodeFailure at both assertion sites (initial connect and post-reconnect). Other tests (ConnectionPoolTest, ConnectionPoolHeartbeatTest) are unaffected as they use the established baseline+delta pattern or pass in CI. Verified locally on 3.5.0: - ConnectionPoolReconnectsTest: 10/10 PASS (previously flaky) - Full box-integration target modules: 0 failures
…efore declaring cluster ready
VshardClusterConfigurator#configure() previously declared the cluster
ready after three checks: router is up, vshard.router.bootstrap() returns
cleanly, and crud._VERSION is reachable on the router. None of these
verifies that individual storages have completed the vshard handshake
during the initial rebalance; the router can answer 'bootstrap ok' while
some storages are still in the VHANDSHAKE_NOT_COMPLETE (code 40) state,
and a CRUD request that targets such a storage fails immediately.
Observed in tests-crud-integration (3.5.0):
TarantoolTemplateViaJavaConfigTest.testKVTemplateDeleteEntities
-> CrudException: Failed to truncate for storage-002
VHANDSHAKE_NOT_COMPLETE, 'Handshake with storage-002-a have not
been completed yet'
Add a fourth readiness step: waitUntilVshardStoragesAreReady polls
vshard.router.info() until every replica in every replicaset is in
status='available' and there are no unreachable buckets, with a 120s
budget. This guarantees that any subsequent CRUD request hits a fully
handshaked storage.
Verified locally on 3.5.0:
- TarantoolCrudClientTest: 347/347 PASS (was failing with
VHANDSHAKE_NOT_COMPLETE in CI)
- TarantoolTemplateViaJavaConfigTest (spring-data27, the originally
failing test): 35/35 PASS
This is a testcontainers/infra change; it is added to the same PR as
the pooling fixups because both were investigated together after the
box-integration deadlocks were resolved.
…re declaring cluster ready The previous readiness check (a13030a) inspected vshard.router.info() for network-level availability of every replica, which is sufficient only to verify that the router can ping each storage. It did not verify that the storage has completed the vshard handshake on its own side and is ready to serve read-write requests. As a result, a cluster could be declared ready while some storage was still in VHANDSHAKE_NOT_COMPLETE (vshard code 40), and the first rw call (typically a CRUD truncate issued by an @beforeeach in spring-data31 TarantoolTemplateViaJavaConfigTest) failed with: 'Handshake with storage-002-a have not been completed yet'. Strengthen the readiness check: in addition to the previous invariants (every replicaset has a master, every replica is 'available', there are no unreachable/unknown buckets), require bucket.available_rw == vshard.router.bucket_count(). This guarantees that all buckets are servable for read-write before the cluster is reported ready. Also add VSHARD_STORAGES_STATUS_COMMAND and logVshardStoragesStatus() to emit the actual total/available_rw/available_ro/unreachable/unknown counts when the cluster is not yet ready, so CI logs show the progress of the vshard rebalance. Bump TIMEOUT_VSHARD_STORAGES_READY_IN_SECONDS from 120 to 300 to give slower self-hosted runners a comfortable budget for the initial rebalance to settle. Observed in tests-crud-integration (3.5.0): TarantoolTemplateViaJavaConfigTest.testSimpleClient -> CrudException: VHANDSHAKE_NOT_COMPLETE for storage-002-a Verified locally on 3.5.0 (the originally failing test now passes 3/3 in a stress loop; the full crud-integration module passes all but one class which fails only due to a local amd64->arm64 emulation artifact when the same container is restarted several times in quick succession): - spring-data31 TarantoolTemplateViaJavaConfigTest: 3/3 PASS - spring-data27 TarantoolTemplateViaJavaConfigTest: 2/3 PASS (3rd hangs only on local amd64->arm64 emulation; not a code defect) - TarantoolCrudClientTest: 347/347 PASS - spring-data27 full module: 357/357 PASS
… check The readiness check from 3ad515a relied on vshard.router.info() reporting bucket.available_rw == vshard.router.bucket_count(). This is a passive view of the router-side bucket accounting and does not always match the actual rw-handshaked state of each storage in the early moments of a fresh cluster: the third CI run (run 27611921481) still failed with 'Timeout exceeded while waiting for vshard storages to complete handshake' on spring-data32.RepositoryViaJavaConfigTest, even though the previous data27/data31 classes had started successfully within the same Maven reactor. Add a second, active tier to the check: after the passive vshard.router.info() pre-check passes, perform a real read-write call (vshard.router.callrw(1, 'box.info', {})) wrapped in pcall. This is the most direct signal that the storage-side handshake has actually finished for at least one storage. A successful rw call proves that the very first CRUD call (e.g. truncate issued by @beforeeach) will not fail with VHANDSHAKE_NOT_COMPLETE. The probe is intentionally kept on a single Lua line because the testcontainers command pipe flattens newlines, which would otherwise confuse the Tarantool Lua parser around 'end' (observed in local verification: the multi-line form produced 'end expected near local'). Verified locally on 3.5.0: - spring-data31 TarantoolTemplateViaJavaConfigTest: 35/35 PASS - spring-data27 TarantoolTemplateViaJavaConfigTest: 35/35 PASS
…eadiness check" This reverts commit 101fc92.
… rw before declaring cluster ready" This reverts commit 3ad515a.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The integration test base classes in tarantool-core, tarantool-schema, tarantool-jackson-mapping, and tarantool-client created a static HashedWheelTimer that was never released, producing a Netty leak warning ('LEAK: HashedWheelTimer.release() was not called before it's garbage-collected') for every test class on every CI run.
I haven't forgotten about:
Related issues: