Skip to content

feat(transport): add HTTP retry with exponential backoff#1520

Open
jpnurmi wants to merge 94 commits intomasterfrom
jpnurmi/feat/http-retry
Open

feat(transport): add HTTP retry with exponential backoff#1520
jpnurmi wants to merge 94 commits intomasterfrom
jpnurmi/feat/http-retry

Conversation

@jpnurmi
Copy link
Collaborator

@jpnurmi jpnurmi commented Feb 13, 2026

Add HTTP retry with exponential backoff for network failures, modeled after Crashpad's upload retry behavior.

Note

Adds an explicit 15s connect timeout to both curl and WinHTTP, matching the value used by Crashpad. This reduces the time the transport worker is blocked on unreachable hosts, previously ~75s on curl (OS TCP SYN retry limit) and 60s on WinHTTP.

Failed envelopes are stored as <db>/cache/<ts>-<n>-<uuid>.envelope and retried on startup after a 100ms throttle, and then with exponential backoff (15min, 30min, 1h, 2h, 4h, 8h). When retries are exhausted, and offline caching is enabled, envelopes are stored as <db>/cache/<uuid>.envelope instead of being discarded.

flowchart TD
    startup --> R{retry?}
    R -->|yes| throttle
    R -->|no| C{cache?}
    throttle -. 100ms .-> resend
    resend -->|success| C
    resend -->|fail| C2[&lt;db&gt;/cache/<br/>&lt;ts&gt;-&lt;n&gt;-&lt;uuid&gt;.envelope]
    C2 --> backoff
    backoff -. 2ⁿ×15min .-> resend
    C -->|yes| CACHE[&lt;db&gt;/cache/<br/>&lt;uuid&gt;.envelope]
    C -->|no| discard
Loading

Builds upon:

See also:

@github-actions
Copy link

github-actions bot commented Feb 13, 2026

Messages
📖 Do not forget to update Sentry-docs with your feature once the pull request gets approved.

Generated by 🚫 dangerJS against 696127b

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch 2 times, most recently from b083a57 to a264f66 Compare February 13, 2026 17:47
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch 2 times, most recently from 243c880 to d6aa792 Compare February 16, 2026 19:41
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch 2 times, most recently from fbbffb2 to abd5815 Compare February 17, 2026 08:34
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 17, 2026

@cursor review

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch from f030cf9 to 22e3fc4 Compare February 17, 2026 10:24
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 17, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 17, 2026

@cursor review

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch from 22e3fc4 to ffce486 Compare February 17, 2026 10:49
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 17, 2026

@cursor review

…_result

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

jpnurmi and others added 2 commits February 25, 2026 18:01
The retry_count >= 0 branch passed the full source filename to %.36s,
which would grab the timestamp prefix instead of the UUID for retry-
format filenames. Extract the cache name (last 45 chars) before either
branch so both use the correct UUID.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use a mutex (sealed_lock) to serialize the SEALED check in
retry_enqueue with the SEALED set in retry_dump_queue. Store the
envelope address as uintptr_t so retry_dump_cb can skip envelopes
already written by retry_enqueue without risking accidental
dereferencing. The address is safe to compare because the task holds
a ref that keeps the envelope alive during foreach_matching.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…move_cache

Move sentry__retry_parse_filename and sentry__retry_make_path into
sentry_database as sentry__parse_cache_filename and
sentry__run_make_cache_path. This consolidates cache filename format
knowledge in one module and replaces the fragile `src_len > 45`
heuristic in sentry__run_move_cache with proper parsing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable autofix in the Cursor dashboard.

jpnurmi and others added 2 commits February 26, 2026 09:43
The SENTRY_RETRY_ATTEMPTS constant was bumped from 5 to 6 in 81d0f68
but the public API documentation was not updated to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use a SENTRY_POLL_SHUTDOWN sentinel so that a concurrent
retry_poll_task cannot resubmit the delayed poll that shutdown
just dropped. The CAS(SCHEDULED→IDLE) in retry_poll_task is a
no-op when scheduled is SHUTDOWN, and the subsequent
CAS(IDLE→SCHEDULED) also fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable autofix in the Cursor dashboard.

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 26, 2026

Sorry for the noise. I brought in some improvements from #1542, and the clankers found a tricky race in the shutdown sequence.

jpnurmi and others added 2 commits February 26, 2026 14:40
On Windows, WinHTTP TCP connect to an unreachable host takes ~2s,
which can exceed the shutdown timeout. Add a cancel_client callback
that closes just the WinHTTP request handle, unblocking the worker
thread so it can process the failure and shut down cleanly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… and worker

Use InterlockedExchangePointer to atomically swap client->request to
NULL in cancel, shutdown, and worker exit cleanup. Whichever thread wins
the swap closes the handle; the loser gets NULL and skips.

The worker also snapshots client->request into a local variable right
after WinHttpOpenRequest and uses the local for all subsequent API calls,
so it never reads NULL from the struct if cancel fires mid-function.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable autofix in the Cursor dashboard.

jpnurmi and others added 2 commits February 26, 2026 17:37
…ncel

Replace the unconditional cancel_client call before bgworker_shutdown
with an on_timeout callback that fires only when the shutdown timeout
expires. This avoids aborting in-flight requests that would have
completed within the timeout, while still unblocking the worker when
it's stuck (e.g. WinHTTP connect to unreachable host).

The callback closes session/connect handles to cancel pending WinHTTP
operations, then the shutdown loop falls through to the !running check
which joins the worker thread. This ensures handle_result runs and the
retry counter is properly bumped on disk.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd_task

The local `HINTERNET request = client->request` snapshot was only needed
for the cancel_client approach. Since shutdown_client only fires at the
timeout point, mid-function reads of client->request are safe. Keep only
the InterlockedExchangePointer in the exit block to prevent double-close.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Move sentry__atomic_store(&bgw->running, 0) from before the on_timeout
callback to the else branch (detach path). This lets the worker's
shutdown_task set running=0 naturally after finishing in-flight work,
making the dump_queue safety-net reachable if the callback fails to
unblock the worker within another 250ms cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

…iness

Tests that directly call sentry__retry_send were racing with the
transport's background retry worker polling the same cache directory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

…e data loss

After retry_enqueue writes an envelope and stores its address in
sealed_envelope, the envelope is freed when the bgworker task
completes. If a subsequent envelope is allocated at the same address
and is still pending during shutdown, retry_dump_cb would incorrectly
skip it, losing the envelope. Clear sealed_envelope after the first
match so later iterations cannot false-match a reused address.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants