Multiple heavy load workflows running - how to ensure HA of Argo Setup #14978
Replies: 1 comment
-
|
The documentation states that the workflow controller can officially run in a HA setup. BUT. Technically, it's not real HA, more like a hot-spare setup. I never tried it, because the meantime-to-recovery of a failed workflow controller is neglectable and (if you ask me) does not justify having additional workflow controllers running, just for a potentially rare case of an application failure. You could separate your workflows through namespaces, and let an Argo workflows controller run in each of them. The downside would be that you will have a seperate UI (argo-server) for each namespace as well. However, as long as the number of workflows tends to be low, you shouldn't have a problem ... usually. If your workflows get too big, you might wanna take a look at offloading. By default the workflow controller will query the etcd every 10 seconds, to handle the workflows (checking pod status, starting new pods, ...), so that Kubernetes gets something to do. For comparison, I'm currently running over 600 workflows per hour (each with an average of 4-5 pods) with a single workflow controller for the entire cluster. My only bottleneck is going to be the etcd when I keep scaling this up. The workflow controller seems to be able to handle this just fine. The documentation also states that Argo should be able handle "hundreds of thousands of smaller workflows daily", but unfortunately the authors forgot to describe how the Kubernetes cluster has to look like for that. Be advised: from what I have experienced, Argo seems to expect a healthy Kubernetes cluster, and is (at least in some places) not exactly fault-tolerant towards Kubernetes problems. If there are problems with Kubernetes, Argo might start to behave funny. So, you better have a tight monitoring of your cluster (CRI, etcd, Kubelets, ... all of it), to see what's going on and to identify potential bottlenecks, especially the etcd (since Argo uses that to store the workflow states). For example: I'm seeing this in my test setup currently on a regular bases, because our IT manged to turn the test hardware into a potato. I have several workflows failing every day, because of etcd-leader-change "errors". For Argo the etcd operation just fails, and it doesn't try again or anything, and often even leaves the affected workflows in an invalid state (the workflow is failed, but Argo displays one or more pods still as "running", even if they are not), so Argo doesn't even let you "resume" such a workflow, because it thinks some pods are still active. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm new to this space. In our project, we are using ArgoCronWorkFlow - it generates workflow which runs for 30 minutes - totally creating 45 pods - 2 Steps create 20pods each for parallel runs and 5 other steps are low processings ones which sums up to 45 pods per workflow. This is an ETL process, so this workflow will be triggered once in 40 minutes, so this process is going to be running continuously. My data is all saved in Azure Blob, so I'm good on data side. (Each pod are of memory: "2Gi" and cpu: "2000m" and autoscaling in K8 is working fine as of now)
I also have Statistics generation Workflow which will consume output from the above mentioned ETL workflow. The stats flow is also expected to run every 15 to 30 minutes, generating 20 pods minimum.
Right now I have argo installation done in single namespace, can single installation handle above 2 runs? I potentially also expect one or 2 more similar to above heavy runs to handle backfilling of delayed data.
How can i manage High availability of Argo Setup? other than strategies like GC of Pods, TTL, should I need to shard this setup by installing argo in different namespaces? or will this single namespace setup can handle above load?
Thanks,
Jaya
Beta Was this translation helpful? Give feedback.
All reactions