feat: add automatic S3 request retry with exponential backoff#1701
feat: add automatic S3 request retry with exponential backoff#1701allanrogerr wants to merge 7 commits intominio:masterfrom
Conversation
Implements SDK-level automatic retry matching the minio-go retry-mechanism spec. All requests through BaseS3Client.executeAsync() now retry on retryable S3 error codes (InternalError, SlowDown, RequestTimeout, etc.), retryable HTTP status codes (408, 429, 499, 500, 502, 503, 504, 520), and transient IO errors (connection reset, EOF, server closed idle connection). Non-seekable request bodies (raw okhttp3.RequestBody) get a single attempt; seekable bodies (byte[], ByteBuffer, RandomAccessFile) retry up to maxRetries (default 10). Backoff is full-jitter exponential: rand(0, min(1s, 200ms*2^n)) before the nth retry, matching minio-go's DefaultRetryUnit/DefaultRetryCap. The "RetryHead" S3 code is explicitly excluded so executeHeadAsync region redirect logic is unaffected. maxRetries is configurable via builder and setMaxRetries() post-construction. Closes minio#1700.
49fda52 to
9524c1f
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces SDK-level automatic retry for S3 requests with full-jitter exponential backoff, aligning the Java SDK’s behavior with minio-go’s retry policy and making transient S3/HTTP/IO failures less likely to leak to callers.
Changes:
- Added a new package-private
Retryhelper for retryability classification (S3 codes, HTTP status codes, IO exceptions) and jittered exponential backoff. - Wired a retry loop into
BaseS3Client.executeAsync()with delayed retries via a daemonScheduledExecutorService, plus a seekability-based single-attempt gate for non-replayable bodies. - Exposed configuration via
maxRetries(...)on builders andsetMaxRetries(...), and added anInvalidResponseException.responseCode()accessor plus a comprehensive newRetryTestsuite.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
api/src/main/java/io/minio/BaseS3Client.java |
Adds execute-with-retry loop, backoff scheduling, and max-retry configuration. |
api/src/main/java/io/minio/Retry.java |
Defines retryable error sets and backoff computation. |
api/src/main/java/io/minio/MinioAsyncClient.java |
Adds maxRetries(...) builder option and applies it to the built client. |
api/src/main/java/io/minio/MinioClient.java |
Exposes setMaxRetries(...) and maxRetries(...) on the synchronous client/builder. |
api/src/main/java/io/minio/Http.java |
Adds Http.S3Request.body() accessor used for retry seekability gating. |
api/src/main/java/io/minio/errors/InvalidResponseException.java |
Stores/returns the triggering HTTP response code for retry classification. |
api/src/test/java/io/minio/RetryTest.java |
Adds unit + MockWebServer integration tests covering retry classification and behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Http.Body now captures the initial file pointer (fileOffset) when
created for RandomAccessFile bodies; toRequestBody() seeks back to
that offset before each attempt, fixing both first-attempt and retry
payload corruption when checksum computation has advanced the pointer
- BaseS3Client.createBody() explicitly restores the file pointer after
computing checksums so Http.Body captures the correct starting offset
- Retry scheduler now dispatches actual retry work (credential fetch +
request build) to ForkJoinPool.commonPool(), freeing the single
scheduler thread from head-of-line blocking by slow providers
- Replace package-private {@link Retry#MAX_RETRY} Javadoc references in
MinioClient and MinioAsyncClient with {@code 10} to satisfy doclint
- Fix RETRYABLE_HTTP_CODES closing paren indentation to match codebase
style (ImmutableSet.of() in Http.java and Signer.java)
- Add testRandomAccessFileBodyRetried integration test: verifies that
both the failed and retried PUT requests carry identical body content
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…he.remove The no-retry tests for 404/403 were asserting only assertNotNull, which would pass for any exception including NPEs. Stronger assertions now verify ErrorResponseException type and S3 error code/HTTP status, and the retry-exhausted test verifies InvalidResponseException with status 500. While adding those assertions, the tests revealed a pre-existing NPE: regionCache.remove(s3request.bucket()) was called unconditionally when the S3 error code was NoSuchBucket/RetryHead, but bucket is null for operations like listBuckets(). Added a null guard so the cache invalidation is skipped when there is no bucket in the request.
…gs JUA SpotBugs JUA flags Assert.assertTrue(e instanceof T) because the message hides what type was actually thrown on failure. Using a typed catch block is stricter: any unexpected exception type propagates as a test error with the full stack trace, which is more informative than a failed boolean assertion.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…strengthen test assertions Replace the shared static Random instance with ThreadLocalRandom.current() at the backoff call site and for the fan-out name generator in MinioAsyncClient — eliminating shared-state CAS contention under concurrent retry load. Disable OkHttp's retryOnConnectionFailure in doExecuteAsync so the SDK's executeWithRetry is the sole retry policy; previously OkHttp could silently add an extra attempt on stale-connection IOExceptions, causing more total attempts than maxRetries. Strengthen the two remaining assertNotNull catch blocks in RetryTest (testMaxRetriesOneDisablesRetry, testSetMaxRetriesPostConstruction) to typed InvalidResponseException catches with responseCode()==500 assertions, matching the pattern applied to the other tests in the previous commit.
Reorder ThreadLocalRandom import to match alphabetical sort order and collapse the two-line OkHttpClient builder call to one line, satisfying the project's spotlessJavaCheck formatter requirements.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
If ForkJoinPool.commonPool() rejects the retry task (e.g. during JVM shutdown), the runAsync future completes exceptionally but retryFuture was never completed, causing the overall request future to hang. Wire an .exceptionally() handler on the runAsync future so any dispatch failure propagates into retryFuture immediately.
There was a problem hiding this comment.
The feature can be simplified by
diff --git a/api/src/main/java/io/minio/Http.java b/api/src/main/java/io/minio/Http.java
index 5fa9bec2..23ef9378 100644
--- a/api/src/main/java/io/minio/Http.java
+++ b/api/src/main/java/io/minio/Http.java
@@ -52,6 +52,7 @@ import java.util.Locale;
import java.util.Map;
import java.util.Objects;
import java.util.Set;
+import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.stream.Collectors;
@@ -65,9 +66,12 @@ import javax.net.ssl.SSLSocketFactory;
import javax.net.ssl.TrustManager;
import javax.net.ssl.TrustManagerFactory;
import javax.net.ssl.X509TrustManager;
+import okhttp3.Interceptor;
import okhttp3.MediaType;
import okhttp3.OkHttpClient;
import okhttp3.Protocol;
+import okhttp3.Request;
+import okhttp3.Response;
import okio.BufferedSink;
import okio.Okio;
@@ -608,6 +612,54 @@ public class Http {
client, System.getenv("SSL_CERT_FILE"), System.getenv("SSL_CERT_DIR"));
}
+ public static class StatusRetryInterceptor implements Interceptor {
+ private final int maxRetries;
+ private final Set<Integer> retryStatusCodes;
+ private final long delayMs; // This acts as our "base" delay (e.g., 100ms)
+
+ public StatusRetryInterceptor(int maxRetries, Set<Integer> retryStatusCodes, long delayMs) {
+ this.maxRetries = maxRetries;
+ this.retryStatusCodes = retryStatusCodes;
+ this.delayMs = delayMs;
+ }
+
+ @Override
+ public Response intercept(Chain chain) throws IOException {
+ okhttp3.Request request = chain.request();
+ okhttp3.Response response = chain.proceed(request);
+
+ int tryCount = 0;
+
+ while (retryStatusCodes.contains(response.code()) && tryCount < maxRetries) {
+ tryCount++;
+
+ // Critical step: close the response body to avoid socket leaks
+ response.close();
+
+ if (delayMs > 0) {
+ // 1. Calculate exponential backoff limit: base * 2^tryCount
+ // This doubles the maximum potential wait time with each retry
+ long maxBackoffLimit = delayMs * (1L << tryCount); // Using bit-shift for 2^tryCount
+
+ // 2. Apply random jitter: pick a random number between 0 and maxBackoffLimit
+ long jitteredDelay = ThreadLocalRandom.current().nextLong(0, maxBackoffLimit);
+
+ try {
+ Thread.sleep(jitteredDelay);
+ } catch (InterruptedException e) {
+ Thread.currentThread().interrupt();
+ throw new IOException("Retry interrupted", e);
+ }
+ }
+
+ // Execute the request again
+ response = chain.proceed(request);
+ }
+
+ return response;
+ }
+ }
+
/**
* Creates new HTTP client with default timeout with additional TLS certificates from
* SSL_CERT_FILE and SSL_CERT_DIR environment variables if present.
@@ -620,6 +672,19 @@ public class Http {
.writeTimeout(DEFAULT_TIMEOUT, TimeUnit.MILLISECONDS)
.readTimeout(DEFAULT_TIMEOUT, TimeUnit.MILLISECONDS)
.protocols(Arrays.asList(Protocol.HTTP_1_1))
+ .addInterceptor(
+ new StatusRetryInterceptor(
+ 5,
+ ImmutableSet.of(
+ 408, // Request Timeout
+ 429, // Too Many Requests
+ 499, // Client Closed Request (nginx)
+ 500, // Internal Server Error
+ 502, // Bad Gateway
+ 503, // Service Unavailable
+ 504, // Gateway Timeout
+ 520), // Cloudflare unknown error
+ 1000))
.build();
try {
return enableExternalCertificatesFromEnv(client);There was a problem hiding this comment.
Thank you for the suggestion — an OkHttp interceptor is an appealing simplification for the HTTP-status case. The reason the current implementation uses a CompletableFuture-based wrapper instead is that the interceptor approach has three gaps that would be regressions for this PR:
1. S3 XML error codes are not visible at the interceptor layer.
S3 errors like SlowDown, ThrottlingException, InternalError, ExpiredToken etc. arrive as HTTP 200 responses with an XML body — the status code is always 200. An interceptor keyed on response.code() never sees these; they must be classified after parsing the body in throwMinioException(), which is downstream of OkHttp.
2. Thread.sleep() inside an interceptor blocks an OkHttp dispatcher thread.
OkHttp's async dispatcher has a bounded thread pool (default 64). Sleeping inside Callback.onResponse / an interceptor for exponential backoff (up to ~31 s at 5 retries × 1 000 ms base) would hold a dispatcher thread for the full delay, starving other concurrent requests. The current design schedules the next attempt on RETRY_SCHEDULER and releases the dispatcher thread immediately.
3. Non-seekable request bodies cannot be replayed via chain.proceed(request).
RequestBody.writeTo() is called once; OkHttp does not buffer or reset it. For multipart uploads or streamed PUT bodies, a second chain.proceed would send an empty or corrupt body. The current wrapper gates retry on the body's seekability before re-issuing the request.
If you'd prefer to move the HTTP-status subset into an interceptor and keep only S3-code + IOException handling in the wrapper, I'm happy to split it that way — but I wanted to flag the above before making that change.
There was a problem hiding this comment.
- S3 XML error codes are not visible at the interceptor layer.
No S3 error is returned with 2xx HTTP status code. If so, it is a bug at server side. As @harshavardhana mentioned previously we do not need to use S3 error code instead we need to use HTTP status code.
- Thread.sleep() inside an interceptor blocks an OkHttp dispatcher thread.
This is fine. If the server is busy, why send new requests and wait in another thread. New request to the same server is blocked if maxRequestsPerHost = 5 is hit due to retry sleep. Instead of 1s, we need to start the sleep with 100ms or lesser.
- Non-seekable request bodies cannot be replayed via chain.proceed(request).
You don't need to. We have retry-able custom RequestBody. It works for ByteBuffer, RandomAccessFile and byte[].
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Description
Implements SDK-level automatic retry with full-jitter exponential backoff, matching the behaviour specified in minio-go's retry-mechanism spec. All requests through
BaseS3Client.executeAsync()now automatically retry on retryable S3 error codes, retryable HTTP status codes, and transient network/IO errors up tomaxRetriesattempts (default 10). The retry policy lives in a new package-privateRetryclass; the execution loop is wired intoBaseS3Client.executeWithRetry()using a daemonScheduledExecutorServicefor Java 8–compatible backoff delays.Motivation and Context
The SDK had no SDK-level retry capability. Transient failures — server-side throttling (
SlowDown,InternalError), gateway errors (502, 503, 504), request timeouts (408), and connection resets — propagated directly to callers, requiring every consumer to implement its own retry loop. The Go client has had this capability since its inception; this PR closes the gap for the Java SDK. It does not address #1700 since error codeBucketNotEmpty(which maps to http code 409) is not natively retriable inminio-go.Retryable S3 codes (matching minio-go
retryableS3Codes):RequestError,RequestTimeout,Throttling,ThrottlingException,RequestLimitExceeded,RequestThrottled,InternalError,ExpiredToken,ExpiredTokenException,SlowDown,SlowDownWrite,SlowDownRead.Retryable HTTP status codes (matching minio-go
retryableHTTPStatusCodes): 408, 429, 499, 500, 502, 503, 504, 520.Retryable IO errors: all
IOExceptionexcept TLS handshake failures and HTTP/HTTPS protocol mismatch — matching minio-go'sisRequestErrorRetryable.Seekability gate: non-seekable bodies (
okhttp3.RequestBody) get exactly one attempt; seekable bodies (byte[],ByteBuffer,RandomAccessFile) retry up tomaxRetries— matching minio-go's body-seeker check.Backoff:
rand(0, min(1 s, 200 ms × 2ⁿ))before the nth retry (full jitter, capped at 1 s), matching minio-go'sDefaultRetryUnit/DefaultRetryCap/MaxJitterdefaults.RetryHead guard: the
"RetryHead"S3 error code is explicitly excluded from retry soexecuteHeadAsyncregion-redirect logic is unaffected.How to test this PR?
Run
./gradlew :api:check. The 43 JUnit 4 tests inapi/src/test/java/io/minio/RetryTest.javacover:IOExceptionclassification (retryable vs. SSL/protocol-mismatch),isRetryable()dispatch for all four exception types,maxRetries=1disabling retry,setMaxRetriespost-construction,byte[]),InvalidResponseException.responseCode()accessor.OR build a local mint against the code in this branch the run
minte.g.Types of changes
Checklist:
commit-idorPR #here)