Robust Retry Logic for S3 Artifacts in Cross-Region Scenarios #14415
Unanswered
xiki-tempula
asked this question in
Q&A
Replies: 1 comment 1 reply
-
|
S3 artifact download does support retries, but it's a little bit opaque. The [EXECUTOR_RETRY_*](variables https://argo-workflows.readthedocs.io/en/latest/environment-variables/#executor) should control local retries when downloading artifacts in the event of transient errors. Do they help? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Argo Workflows Maintainers and Community,
I initially posted a similar query in the Hera project (argoproj-labs/hera#1370), but realized this question about S3 artifact handling might be better suited for the main Argo Workflows repository.
We have a workflow that utilizes Argo Workflows for orchestrating a computationally intensive task. A key step/pod of this workflow involves:
Our setup recently changed due to business requirements: the compute nodes running the workflow pods are in the US, while the S3 bucket storing the artifacts is now located in Europe. This geographical separation has led to increased network latency and instability, resulting in frequent S3 endpoint connection errors during the artifact download and upload phases.
We understand that Argo Workflows provides a pod-level retryStrategy. However, triggering a full pod retry (which re-runs the heavy computation) just because of a transient network error during the S3 artifact transfer is very inefficient and resource-intensive for our use case.
Our question is: Does Argo Workflows natively support, or are there plans to introduce, more granular retry logic specifically for artifact operations (particularly S3)? We are looking for a mechanism (e.g., configurable retries with backoff, circuit breakers) that can automatically retry just the S3 download or upload upon failure, without needing to restart the entire pod/container.
This would greatly improve the resilience and efficiency of workflows operating in cross-region or less reliable network environments.
Thanks for developing Argo Workflows and for any guidance or information you can provide on this feature!
Beta Was this translation helpful? Give feedback.
All reactions