Robust Retry Logic for S3 Artifacts in Cross-Region Scenarios #14415

xiki-tempula · 2025-04-22T09:09:43Z

xiki-tempula
Apr 22, 2025

Hi Argo Workflows Maintainers and Community,

I initially posted a similar query in the Hera project (argoproj-labs/hera#1370), but realized this question about S3 artifact handling might be better suited for the main Argo Workflows repository.

We have a workflow that utilizes Argo Workflows for orchestrating a computationally intensive task. A key step/pod of this workflow involves:

Downloads an input artifact from an S3 bucket.
Performs significant computation using the input artifact.
Uploads the resulting output artifact back to an S3 bucket.

Our setup recently changed due to business requirements: the compute nodes running the workflow pods are in the US, while the S3 bucket storing the artifacts is now located in Europe. This geographical separation has led to increased network latency and instability, resulting in frequent S3 endpoint connection errors during the artifact download and upload phases.

We understand that Argo Workflows provides a pod-level retryStrategy. However, triggering a full pod retry (which re-runs the heavy computation) just because of a transient network error during the S3 artifact transfer is very inefficient and resource-intensive for our use case.

Our question is: Does Argo Workflows natively support, or are there plans to introduce, more granular retry logic specifically for artifact operations (particularly S3)? We are looking for a mechanism (e.g., configurable retries with backoff, circuit breakers) that can automatically retry just the S3 download or upload upon failure, without needing to restart the entire pod/container.

This would greatly improve the resilience and efficiency of workflows operating in cross-region or less reliable network environments.

Thanks for developing Argo Workflows and for any guidance or information you can provide on this feature!

Joibel · 2025-04-23T17:49:15Z

Joibel
Apr 23, 2025
Maintainer

S3 artifact download does support retries, but it's a little bit opaque. The [EXECUTOR_RETRY_*](variables https://argo-workflows.readthedocs.io/en/latest/environment-variables/#executor) should control local retries when downloading artifacts in the event of transient errors. Do they help?

1 reply

xiki-tempula Apr 24, 2025
Author

Thanks for the answer, does [EXECUTOR_RETRY_*] only retry the upload or it will run the whole pod again?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Robust Retry Logic for S3 Artifacts in Cross-Region Scenarios #14415

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Robust Retry Logic for S3 Artifacts in Cross-Region Scenarios #14415

Uh oh!

xiki-tempula Apr 22, 2025

Replies: 1 comment · 1 reply

Uh oh!

Joibel Apr 23, 2025 Maintainer

Uh oh!

xiki-tempula Apr 24, 2025 Author

xiki-tempula
Apr 22, 2025

Replies: 1 comment 1 reply

Joibel
Apr 23, 2025
Maintainer

xiki-tempula Apr 24, 2025
Author