Skip to content

feat: add automatic S3 request retry with exponential backoff#1701

Open
allanrogerr wants to merge 7 commits intominio:masterfrom
allanrogerr:feature/retry-mechanism
Open

feat: add automatic S3 request retry with exponential backoff#1701
allanrogerr wants to merge 7 commits intominio:masterfrom
allanrogerr:feature/retry-mechanism

Conversation

@allanrogerr
Copy link
Copy Markdown

@allanrogerr allanrogerr commented May 4, 2026

Description

Implements SDK-level automatic retry with full-jitter exponential backoff, matching the behaviour specified in minio-go's retry-mechanism spec. All requests through BaseS3Client.executeAsync() now automatically retry on retryable S3 error codes, retryable HTTP status codes, and transient network/IO errors up to maxRetries attempts (default 10). The retry policy lives in a new package-private Retry class; the execution loop is wired into BaseS3Client.executeWithRetry() using a daemon ScheduledExecutorService for Java 8–compatible backoff delays.

Motivation and Context

The SDK had no SDK-level retry capability. Transient failures — server-side throttling (SlowDown, InternalError), gateway errors (502, 503, 504), request timeouts (408), and connection resets — propagated directly to callers, requiring every consumer to implement its own retry loop. The Go client has had this capability since its inception; this PR closes the gap for the Java SDK. It does not address #1700 since error code BucketNotEmpty (which maps to http code 409) is not natively retriable in minio-go.
Retryable S3 codes (matching minio-go retryableS3Codes): RequestError, RequestTimeout, Throttling, ThrottlingException, RequestLimitExceeded, RequestThrottled, InternalError, ExpiredToken, ExpiredTokenException, SlowDown, SlowDownWrite, SlowDownRead.
Retryable HTTP status codes (matching minio-go retryableHTTPStatusCodes): 408, 429, 499, 500, 502, 503, 504, 520.
Retryable IO errors: all IOException except TLS handshake failures and HTTP/HTTPS protocol mismatch — matching minio-go's isRequestErrorRetryable.
Seekability gate: non-seekable bodies (okhttp3.RequestBody) get exactly one attempt; seekable bodies (byte[], ByteBuffer, RandomAccessFile) retry up to maxRetries — matching minio-go's body-seeker check.
Backoff: rand(0, min(1 s, 200 ms × 2ⁿ)) before the nth retry (full jitter, capped at 1 s), matching minio-go's DefaultRetryUnit/DefaultRetryCap/MaxJitter defaults.
RetryHead guard: the "RetryHead" S3 error code is explicitly excluded from retry so executeHeadAsync region-redirect logic is unaffected.

How to test this PR?

Run ./gradlew :api:check. The 43 JUnit 4 tests in api/src/test/java/io/minio/RetryTest.java cover:

  • all 12 retryable S3 codes,
  • all 8 retryable HTTP codes,
  • IOException classification (retryable vs. SSL/protocol-mismatch),
  • isRetryable() dispatch for all four exception types,
  • backoff bounds for attempts 0–5+, and
  • MockWebServer integration tests for
    • 500-then-success,
    • 503-then-success,
    • S3-InternalError-then-success,
    • S3-SlowDown-then-success,
    • 429/502/504-then-success,
    • no-retry-on-404/403,
    • exhausted-retries-throws-last-error,
    • maxRetries=1 disabling retry,
    • setMaxRetries post-construction,
    • three-attempt sequence,
    • seekable-body retry (PUT with byte[]),
    • path consistency across retries, and
  • the InvalidResponseException.responseCode() accessor.

OR build a local mint against the code in this branch the run mint e.g.

docker run --rm --network host -e SERVER_ENDPOINT="127.0.0.1:9000" -e ACCESS_KEY="minioadmin" -e SECRET_KEY="minioadmin" -e ENABLE_HTTPS=0 -e MINT_MODE="full" minio/mint:local minio-java
Running with
SERVER_ENDPOINT:      127.0.0.1:9000
ACCESS_KEY:           minioadmin
SECRET_KEY:           ***REDACTED***
ENABLE_HTTPS:         0
SERVER_REGION:        us-east-1
MINT_DATA_DIR:        /mint/data
MINT_MODE:            full
ENABLE_VIRTUAL_STYLE: 0
RUN_ON_FAIL:          0

To get logs, run 'docker cp :/mint/log /tmp/mint-logs'

(1/1) Running minio-java tests ... done in 46 seconds

All tests ran successfully

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Optimization (provides speedup with no functional changes)
  • Cleanup/Maintenance (no functional changes)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • Fixes a regression (If yes, please add commit-id or PR # here)
  • Unit tests added/updated
  • Internal documentation updated
  • Create a documentation update request here
  • No internode changes / version interoperability introduced

@allanrogerr allanrogerr added the test-framework Tests and test framework changes label May 4, 2026
@allanrogerr allanrogerr self-assigned this May 4, 2026
Implements SDK-level automatic retry matching the minio-go retry-mechanism
spec. All requests through BaseS3Client.executeAsync() now retry on
retryable S3 error codes (InternalError, SlowDown, RequestTimeout, etc.),
retryable HTTP status codes (408, 429, 499, 500, 502, 503, 504, 520), and
transient IO errors (connection reset, EOF, server closed idle connection).

Non-seekable request bodies (raw okhttp3.RequestBody) get a single attempt;
seekable bodies (byte[], ByteBuffer, RandomAccessFile) retry up to maxRetries
(default 10). Backoff is full-jitter exponential: rand(0, min(1s, 200ms*2^n))
before the nth retry, matching minio-go's DefaultRetryUnit/DefaultRetryCap.
The "RetryHead" S3 code is explicitly excluded so executeHeadAsync region
redirect logic is unaffected. maxRetries is configurable via builder and
setMaxRetries() post-construction.

Closes minio#1700.
@allanrogerr allanrogerr force-pushed the feature/retry-mechanism branch from 49fda52 to 9524c1f Compare May 4, 2026 11:50
@allanrogerr allanrogerr marked this pull request as ready for review May 4, 2026 13:37
@allanrogerr allanrogerr requested review from Copilot and rraulinio May 4, 2026 13:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces SDK-level automatic retry for S3 requests with full-jitter exponential backoff, aligning the Java SDK’s behavior with minio-go’s retry policy and making transient S3/HTTP/IO failures less likely to leak to callers.

Changes:

  • Added a new package-private Retry helper for retryability classification (S3 codes, HTTP status codes, IO exceptions) and jittered exponential backoff.
  • Wired a retry loop into BaseS3Client.executeAsync() with delayed retries via a daemon ScheduledExecutorService, plus a seekability-based single-attempt gate for non-replayable bodies.
  • Exposed configuration via maxRetries(...) on builders and setMaxRetries(...), and added an InvalidResponseException.responseCode() accessor plus a comprehensive new RetryTest suite.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
api/src/main/java/io/minio/BaseS3Client.java Adds execute-with-retry loop, backoff scheduling, and max-retry configuration.
api/src/main/java/io/minio/Retry.java Defines retryable error sets and backoff computation.
api/src/main/java/io/minio/MinioAsyncClient.java Adds maxRetries(...) builder option and applies it to the built client.
api/src/main/java/io/minio/MinioClient.java Exposes setMaxRetries(...) and maxRetries(...) on the synchronous client/builder.
api/src/main/java/io/minio/Http.java Adds Http.S3Request.body() accessor used for retry seekability gating.
api/src/main/java/io/minio/errors/InvalidResponseException.java Stores/returns the triggering HTTP response code for retry classification.
api/src/test/java/io/minio/RetryTest.java Adds unit + MockWebServer integration tests covering retry classification and behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread api/src/main/java/io/minio/BaseS3Client.java
Comment thread api/src/main/java/io/minio/BaseS3Client.java
Comment thread api/src/test/java/io/minio/RetryTest.java
Comment thread api/src/main/java/io/minio/Retry.java Outdated
Comment thread api/src/main/java/io/minio/MinioClient.java
Comment thread api/src/main/java/io/minio/MinioAsyncClient.java
- Http.Body now captures the initial file pointer (fileOffset) when
  created for RandomAccessFile bodies; toRequestBody() seeks back to
  that offset before each attempt, fixing both first-attempt and retry
  payload corruption when checksum computation has advanced the pointer
- BaseS3Client.createBody() explicitly restores the file pointer after
  computing checksums so Http.Body captures the correct starting offset
- Retry scheduler now dispatches actual retry work (credential fetch +
  request build) to ForkJoinPool.commonPool(), freeing the single
  scheduler thread from head-of-line blocking by slow providers
- Replace package-private {@link Retry#MAX_RETRY} Javadoc references in
  MinioClient and MinioAsyncClient with {@code 10} to satisfy doclint
- Fix RETRYABLE_HTTP_CODES closing paren indentation to match codebase
  style (ImmutableSet.of() in Http.java and Signer.java)
- Add testRandomAccessFileBodyRetried integration test: verifies that
  both the failed and retried PUT requests carry identical body content
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread api/src/main/java/io/minio/BaseS3Client.java
Comment thread api/src/main/java/io/minio/Http.java
Comment thread api/src/test/java/io/minio/RetryTest.java
Comment thread api/src/test/java/io/minio/RetryTest.java Outdated
Comment thread api/src/test/java/io/minio/RetryTest.java
…he.remove

The no-retry tests for 404/403 were asserting only assertNotNull, which
would pass for any exception including NPEs.  Stronger assertions now
verify ErrorResponseException type and S3 error code/HTTP status, and
the retry-exhausted test verifies InvalidResponseException with status
500.

While adding those assertions, the tests revealed a pre-existing NPE:
regionCache.remove(s3request.bucket()) was called unconditionally when
the S3 error code was NoSuchBucket/RetryHead, but bucket is null for
operations like listBuckets().  Added a null guard so the cache
invalidation is skipped when there is no bucket in the request.
@allanrogerr allanrogerr requested a review from Copilot May 5, 2026 19:12
@allanrogerr allanrogerr review requested due to automatic review settings May 5, 2026 19:13
…gs JUA

SpotBugs JUA flags Assert.assertTrue(e instanceof T) because the message
hides what type was actually thrown on failure.  Using a typed catch block
is stricter: any unexpected exception type propagates as a test error with
the full stack trace, which is more informative than a failed boolean
assertion.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread api/src/main/java/io/minio/BaseS3Client.java Outdated
Comment thread api/src/main/java/io/minio/BaseS3Client.java
Comment thread api/src/test/java/io/minio/RetryTest.java
Comment thread api/src/test/java/io/minio/RetryTest.java
…strengthen test assertions

Replace the shared static Random instance with ThreadLocalRandom.current()
at the backoff call site and for the fan-out name generator in
MinioAsyncClient — eliminating shared-state CAS contention under concurrent
retry load.

Disable OkHttp's retryOnConnectionFailure in doExecuteAsync so the SDK's
executeWithRetry is the sole retry policy; previously OkHttp could silently
add an extra attempt on stale-connection IOExceptions, causing more total
attempts than maxRetries.

Strengthen the two remaining assertNotNull catch blocks in RetryTest
(testMaxRetriesOneDisablesRetry, testSetMaxRetriesPostConstruction) to typed
InvalidResponseException catches with responseCode()==500 assertions,
matching the pattern applied to the other tests in the previous commit.
@allanrogerr allanrogerr requested a review from Copilot May 5, 2026 20:27
@allanrogerr allanrogerr review requested due to automatic review settings May 5, 2026 20:32
Reorder ThreadLocalRandom import to match alphabetical sort order and
collapse the two-line OkHttpClient builder call to one line, satisfying
the project's spotlessJavaCheck formatter requirements.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread api/src/main/java/io/minio/BaseS3Client.java Outdated
If ForkJoinPool.commonPool() rejects the retry task (e.g. during JVM
shutdown), the runAsync future completes exceptionally but retryFuture
was never completed, causing the overall request future to hang. Wire
an .exceptionally() handler on the runAsync future so any dispatch
failure propagates into retryFuture immediately.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature can be simplified by

diff --git a/api/src/main/java/io/minio/Http.java b/api/src/main/java/io/minio/Http.java
index 5fa9bec2..23ef9378 100644
--- a/api/src/main/java/io/minio/Http.java
+++ b/api/src/main/java/io/minio/Http.java
@@ -52,6 +52,7 @@ import java.util.Locale;
 import java.util.Map;
 import java.util.Objects;
 import java.util.Set;
+import java.util.concurrent.ThreadLocalRandom;
 import java.util.concurrent.TimeUnit;
 import java.util.regex.Matcher;
 import java.util.stream.Collectors;
@@ -65,9 +66,12 @@ import javax.net.ssl.SSLSocketFactory;
 import javax.net.ssl.TrustManager;
 import javax.net.ssl.TrustManagerFactory;
 import javax.net.ssl.X509TrustManager;
+import okhttp3.Interceptor;
 import okhttp3.MediaType;
 import okhttp3.OkHttpClient;
 import okhttp3.Protocol;
+import okhttp3.Request;
+import okhttp3.Response;
 import okio.BufferedSink;
 import okio.Okio;
 
@@ -608,6 +612,54 @@ public class Http {
         client, System.getenv("SSL_CERT_FILE"), System.getenv("SSL_CERT_DIR"));
   }
 
+  public static class StatusRetryInterceptor implements Interceptor {
+    private final int maxRetries;
+    private final Set<Integer> retryStatusCodes;
+    private final long delayMs; // This acts as our "base" delay (e.g., 100ms)
+
+    public StatusRetryInterceptor(int maxRetries, Set<Integer> retryStatusCodes, long delayMs) {
+      this.maxRetries = maxRetries;
+      this.retryStatusCodes = retryStatusCodes;
+      this.delayMs = delayMs;
+    }
+
+    @Override
+    public Response intercept(Chain chain) throws IOException {
+      okhttp3.Request request = chain.request();
+      okhttp3.Response response = chain.proceed(request);
+
+      int tryCount = 0;
+
+      while (retryStatusCodes.contains(response.code()) && tryCount < maxRetries) {
+        tryCount++;
+
+        // Critical step: close the response body to avoid socket leaks
+        response.close();
+
+        if (delayMs > 0) {
+          // 1. Calculate exponential backoff limit: base * 2^tryCount
+          // This doubles the maximum potential wait time with each retry
+          long maxBackoffLimit = delayMs * (1L << tryCount); // Using bit-shift for 2^tryCount
+
+          // 2. Apply random jitter: pick a random number between 0 and maxBackoffLimit
+          long jitteredDelay = ThreadLocalRandom.current().nextLong(0, maxBackoffLimit);
+
+          try {
+            Thread.sleep(jitteredDelay);
+          } catch (InterruptedException e) {
+            Thread.currentThread().interrupt();
+            throw new IOException("Retry interrupted", e);
+          }
+        }
+
+        // Execute the request again
+        response = chain.proceed(request);
+      }
+
+      return response;
+    }
+  }
+
   /**
    * Creates new HTTP client with default timeout with additional TLS certificates from
    * SSL_CERT_FILE and SSL_CERT_DIR environment variables if present.
@@ -620,6 +672,19 @@ public class Http {
             .writeTimeout(DEFAULT_TIMEOUT, TimeUnit.MILLISECONDS)
             .readTimeout(DEFAULT_TIMEOUT, TimeUnit.MILLISECONDS)
             .protocols(Arrays.asList(Protocol.HTTP_1_1))
+            .addInterceptor(
+                new StatusRetryInterceptor(
+                    5,
+                    ImmutableSet.of(
+                        408, // Request Timeout
+                        429, // Too Many Requests
+                        499, // Client Closed Request (nginx)
+                        500, // Internal Server Error
+                        502, // Bad Gateway
+                        503, // Service Unavailable
+                        504, // Gateway Timeout
+                        520), // Cloudflare unknown error
+                    1000))
             .build();
     try {
       return enableExternalCertificatesFromEnv(client);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion — an OkHttp interceptor is an appealing simplification for the HTTP-status case. The reason the current implementation uses a CompletableFuture-based wrapper instead is that the interceptor approach has three gaps that would be regressions for this PR:

1. S3 XML error codes are not visible at the interceptor layer.
S3 errors like SlowDown, ThrottlingException, InternalError, ExpiredToken etc. arrive as HTTP 200 responses with an XML body — the status code is always 200. An interceptor keyed on response.code() never sees these; they must be classified after parsing the body in throwMinioException(), which is downstream of OkHttp.

2. Thread.sleep() inside an interceptor blocks an OkHttp dispatcher thread.
OkHttp's async dispatcher has a bounded thread pool (default 64). Sleeping inside Callback.onResponse / an interceptor for exponential backoff (up to ~31 s at 5 retries × 1 000 ms base) would hold a dispatcher thread for the full delay, starving other concurrent requests. The current design schedules the next attempt on RETRY_SCHEDULER and releases the dispatcher thread immediately.

3. Non-seekable request bodies cannot be replayed via chain.proceed(request).
RequestBody.writeTo() is called once; OkHttp does not buffer or reset it. For multipart uploads or streamed PUT bodies, a second chain.proceed would send an empty or corrupt body. The current wrapper gates retry on the body's seekability before re-issuing the request.

If you'd prefer to move the HTTP-status subset into an interceptor and keep only S3-code + IOException handling in the wrapper, I'm happy to split it that way — but I wanted to flag the above before making that change.

Copy link
Copy Markdown
Member

@balamurugana balamurugana May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. S3 XML error codes are not visible at the interceptor layer.

No S3 error is returned with 2xx HTTP status code. If so, it is a bug at server side. As @harshavardhana mentioned previously we do not need to use S3 error code instead we need to use HTTP status code.

  1. Thread.sleep() inside an interceptor blocks an OkHttp dispatcher thread.

This is fine. If the server is busy, why send new requests and wait in another thread. New request to the same server is blocked if maxRequestsPerHost = 5 is hit due to retry sleep. Instead of 1s, we need to start the sleep with 100ms or lesser.

  1. Non-seekable request bodies cannot be replayed via chain.proceed(request).

You don't need to. We have retry-able custom RequestBody. It works for ByteBuffer, RandomAccessFile and byte[].

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@allanrogerr allanrogerr requested a review from balamurugana May 5, 2026 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test-framework Tests and test framework changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants