From f99253c9bcee851f814d64d92554af6fbdcef470 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Tue, 12 May 2026 11:42:07 -0700 Subject: [PATCH 1/8] Explaining different outage types --- docs/cloud/rto-rpo.mdx | 165 +++++++++++++++++++++++++++++++++-------- 1 file changed, 134 insertions(+), 31 deletions(-) diff --git a/docs/cloud/rto-rpo.mdx b/docs/cloud/rto-rpo.mdx index ecb12e35c8..f65ec3700e 100644 --- a/docs/cloud/rto-rpo.mdx +++ b/docs/cloud/rto-rpo.mdx @@ -1,8 +1,8 @@ --- id: rpo-rto -title: RPO and RTO -sidebar_label: RPO and RTO -description: Understand the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) in Temporal Cloud. +title: Outages and Recovery Objectives (RTO / RPO) +sidebar_label: Outages and Recovery Objectives (RTO / RPO) +description: Understand the types of outages Temporal Cloud is designed to handle, and the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) for each. slug: /cloud/rpo-rto toc_max_heading_level: 4 keywords: @@ -11,6 +11,7 @@ keywords: - RTO - Recovery Point Objective - Recovery Time Objective + - outages tags: - Recovery Point Objective - Recovery Time Objective @@ -23,30 +24,114 @@ When a cloud outage disrupts a Namespace, Temporal Cloud takes measures to maint To help users plan for keeping critical Workflows available during a cloud outage, Temporal Cloud publishes goals for the recovery time and recovery point for each kind of outage. These goals are called the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). These objectives are complementary to Temporal Cloud's [Service Level Agreement (SLA)](/cloud/sla). -To achieve the lowest RPO and RTO, Temporal Cloud offers [High Availability](/cloud/high-availability) features that keep Workflows operational with minimal downtime. When High Availability is enabled on a Namespace, the user chooses a region to place a "replica" that will take over in the event of a failure. The location of the replica determines the type of replication used and the type of outages that can be handled. Multi-region Replication is when the active and replica are in different regions on the same cloud (e.g., AWS us-east-1 and AWS us-west-2). Multi-cloud Replication is when the active and replica are in different clouds (e.g., AWS and GCP). Same-region Replication is when the active and replica are in the same region. Temporal always places the active and replica in different [cells](/cloud/overview#cell-based-infrastructure). - -As Workflows progress in the active region, history events are asynchronously replicated to the replica. -Because replication is asynchronous, High Availability does not impact the latency or throughput of Workflow Executions in the active region. -If an outage hits the active region or cell, Temporal Cloud will fail over to the replica so that existing Workflow Executions will continue to run and new Workflow Executions can be started. - -The Recovery Point Objective and Recovery Time Objective for Temporal Cloud depend on the type of outage and which [High Availability](/cloud/high-availability) feature your Namespace has enabled. Temporal Cloud can only set an RPO and RTO for cases where it has the ability to mitigate the outage. Therefore, the below RPOs and RTOs apply to Namespaces that have the corresponding type of replication and have enabled Temporal-initiated failovers, which comes enabled by default. - -1. **Availability zone outage**: - 1. _Applicable Namespaces:_ All Namespaces - 2. _Goals:_ Zero RPO and near-zero RTO - 3. _More details:_ Historically, these have been the most common type of outage in the cloud. Temporal Cloud replicates every Namespace across three availability zones. The failure of a single availability zone is handled automatically by Temporal Cloud behind the scenes, with no potential for data loss, and little-to-no observable downtime to the end user. -2. **Cell outage**: - 1. _Applicable Namespaces:_ Namespaces with Same-region Replication, Multi-region Replication, or Multi-cloud Replication - 2. _Goals:_ 1-minute RPO and 20-minute RTO - 3. _More details:_ Temporal Cloud runs on a [cell architecture](/cloud/sla). Each cell contains the software and services necessary to host a Namespace. While unlikely, it's possible for a cell to experience a disruption due to uncaught software bugs or sub-component failures (e.g., an outage in the underlying database). -3. **Regional outage**: - 1. _Applicable Namespaces:_ Namespaces with Multi-region Replication or Multi-cloud Replication - 2. _Goals:_ 1-minute RPO and 20-minute RTO - 3. _More details:_ On [rare occasions](https://temporal.io/blog/how-devs-kept-running-during-the-aws-us-east-1-oct-20-2025), an entire region within a cloud provider will be degraded. Since Namespaces depend on the cloud provider's infrastructure, Temporal Cloud is not immune to these outages. -4. **Cloud-wide outage**: - 1. _Applicable Namespaces:_ Namespaces with Multi-cloud Replication - 2. _Goals:_ 1-minute RPO and 20-minute RTO - 3. _More details:_ An entire cloud provider has an outage across most or all regions. Since cloud providers strive to keep cloud regions de-coupled, these are the rarest outages of all. Still, they [have happened](https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW) in the past. +## Types of outages Temporal Cloud designs around + +Temporal Cloud is engineered to withstand four broad categories of cloud outage. The categories are listed below in order of how commonly they occur in the real world. For each category, Temporal has experienced the outage in production, and the corresponding Temporal Cloud features have successfully mitigated the impact for real customer Namespaces. + +### Availability Zone outage + +An [Availability Zone](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones) (AZ) is akin to an isolated datacenter managed by a cloud hyperscaler, with independent power, networking, and cooling infrastructure. Each cloud region contains multiple AZs, and an individual AZ can fail due to events such as hardware failure, power loss, or a localized network partition. + +Historically, AZ outages are the most common type of outage in the cloud, and Temporal Cloud has weathered many of them transparently to its customers. + +**Temporal Cloud feature to mitigate this outage:** Every Namespace is automatically spread across at least three Availability Zones, and any Namespace can handle a single AZ failure without disruption to end-user Temporal operations. [High Availability](/cloud/high-availability) features are _not_ required to keep Temporal Cloud operations running through an AZ outage. + +If two AZs fail simultaneously, Temporal Cloud treats the event as a [Cloud Region outage](#cloud-region-outage). In that case, Namespaces in the region may be impacted, including those using Same-region Replication (in Preview). + +:::note + +When an AZ fails, Temporal may also trigger a failover on Namespaces that have High Availability enabled, as a precaution in case the outage scope expands. You can opt out of this behavior by disabling Temporal-managed failovers on the Namespace. + +::: + +#### RTO and RPO + +When using Temporal Cloud (no additional features required): + +- **Near-zero RTO.** When a single AZ fails, the remaining two AZs continue serving requests without a failover, so end users see little to no disruption. +- **Zero RPO.** Writes to Workflow state are synchronously replicated across all three AZs before being acknowledged back to the Client, so an AZ failure cannot cause data loss. + +### Cell outage + +Temporal Cloud runs on a [cell architecture](https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/what-is-a-cell-based-architecture.html). Each cell contains the software and services necessary to host a Namespace, and components within a cell are distributed across at least three Availability Zones. Cells provide a strong unit of isolation: a problem inside one cell does not propagate to other cells. + +**Example causes:** failure of a sub-component within the cell (for example, an individual database becoming unavailable) or a software bug introduced in a new deploy to the cell. + +**Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) (GA) and [Multi-cloud Replication](/cloud/high-availability) (GA) replicate a Namespace into another cell in a different region or different cloud provider. [Same-region Replication](/cloud/high-availability) (Preview) replicates a Namespace into another cell in the same region. With any of these features enabled, an outage that disrupts a single cell can be mitigated by failing the Namespace over to its replica. + +Cell-level disruptions occur from time to time, and Temporal's replication and failover tooling has restored affected Namespaces in real-world incidents. + +#### RTO and RPO + +When using Same-region Replication, Multi-region Replication, or Multi-cloud Replication for Temporal-managed failover: + +- **RTO under 20 minutes.** Temporal detects the disruption and fails the Namespace over to its replica cell. +- **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active cell. + +Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when a failover occurs. + +### Cloud Region outage + +A cloud region as a whole can become degraded, with effects that span beyond any single cell or Availability Zone. + +**Example causes:** failure of a key cloud service in the region (for example, the cloud provider's DNS resolver) causing cascading failures, two or more Availability Zones failing simultaneously, or network partitions between the region and other regions. + +**Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) and [Multi-cloud Replication](/cloud/high-availability) place the replica outside the affected region, so a Namespace can fail over and continue serving Workflows. Same-region Replication does not protect against a Cloud Region outage, since the replica resides in the same region. + +Regional outages are less common than cell or AZ outages, but they happen. During the [AWS us-east-1 incident on October 20, 2025](https://temporal.io/blog/how-devs-kept-running-during-the-aws-us-east-1-oct-20-2025), Temporal Cloud's regional failover kept customer Namespaces running. + +#### RTO and RPO + +When using Multi-region Replication or Multi-cloud Replication for Temporal-managed failover: + +- **RTO under 20 minutes.** Temporal detects the regional disruption and fails the Namespace over to its replica in another region. +- **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active region. + +Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when a failover occurs. + +### Cloud-wide outage + +On rare occasions, an issue affects most or all regions of a single cloud provider at once. + +**Example causes:** a software bug rolled out to every region of a cloud provider that triggers cascading failures across the provider's infrastructure. + +**Temporal Cloud feature to mitigate this outage:** [Multi-cloud Replication](/cloud/high-availability) places the replica in a different cloud provider entirely, so the Namespace can fail over even when an entire cloud provider goes down. + +Cloud-wide outages are the rarest category, but they [have occurred](https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW). Multi-cloud Replication is designed to keep Namespaces running through such events. + +#### RTO and RPO + +When using Multi-cloud Replication for Temporal-managed failover: + +- **RTO under 20 minutes.** Temporal detects the cloud-wide disruption and fails the Namespace over to its replica in a different cloud provider. +- **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active region, even across cloud providers. + +Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when a failover occurs. + +## How High Availability replication works + +To achieve the lowest RPO and RTO, Temporal Cloud offers [High Availability](/cloud/high-availability) features that keep Workflows operational with minimal downtime. When High Availability is enabled on a Namespace, the user chooses a region to place a "replica" that will take over in the event of a failure. The location of the replica determines the type of replication used and the categories of outage it can handle: + +- **Multi-region Replication** places the active and replica in different regions on the same cloud (for example, AWS us-east-1 and AWS us-west-2). +- **Multi-cloud Replication** places the active and replica in different cloud providers (for example, AWS and GCP). +- **Same-region Replication** (Preview) places the active and replica in the same region. + +Temporal always places the active and replica in different [cells](/cloud/overview#cell-based-infrastructure). + +As Workflows progress in the active region, history events are asynchronously replicated to the replica. Because replication is asynchronous, High Availability does not impact the latency or throughput of Workflow Executions in the active region. If an outage hits the active region or cell, Temporal Cloud will fail over to the replica so that existing Workflow Executions will continue to run and new Workflow Executions can be started. + +## Explaining Temporal Cloud's RTO and RPO + +The Recovery Point Objective and Recovery Time Objective for Temporal Cloud depend on the type of outage and which [High Availability](/cloud/high-availability) feature your Namespace has enabled. Temporal Cloud can only set an RPO and RTO for cases where it has the ability to mitigate the outage. Therefore, the published RPOs and RTOs apply to Namespaces that have the corresponding type of replication and have enabled Temporal-initiated failovers, which comes enabled by default. + +### Summary table + +| Outage type | Applicable Namespaces | RPO | RTO | +| ---------------------- | ------------------------------------------------------------------------------ | -------------- | ---------------- | +| Availability Zone outage | All Namespaces | Zero | Near-zero | +| Cell outage | Namespaces with Same-region, Multi-region, or Multi-cloud Replication | Under 1 minute | Under 20 minutes | +| Cloud Region outage | Namespaces with Multi-region or Multi-cloud Replication | Under 1 minute | Under 20 minutes | +| Cloud-wide outage | Namespaces with Multi-cloud Replication | Under 1 minute | Under 20 minutes | Notes: @@ -64,8 +149,17 @@ Temporal highly recommends keeping Temporal-initiated failovers enabled. When Te - All Namespaces are backed up every 4 hours. If an outage causes data loss on a Namespace that was not protected by High Availability, then Temporal will use the backup to restore as much data as feasible. +- Temporal has internal goals and measurements for Recovery Time and Recovery Point, but does not publish the achieved Recovery Time and Recovery Point for each incident. -## Minimizing the Recovery Point +### Explaining the RPO + +:::note Temporal's Recovery Point is different from a traditional Recovery Point + +In a traditional database, data within the Recovery Point window may be permanently lost during a failover. In Temporal Cloud, that data is not lost. Cloud data stores are engineered for extreme durability (commonly 99.999999999%, or "11 nines"), so any data acknowledged by Temporal Cloud is durably persisted. After the outage resolves, Temporal's Recovery and Conflict Resolution process automatically syncs that data back into the Namespace. + +The Recovery Point Objective therefore reflects the maximum data that may be temporarily unavailable in the replica at the moment of failover, not the maximum data that could be permanently lost. + +::: Temporal has put extensive work into tools and processes that minimize the recovery point and achieve its RPO for Temporal-initiated failovers, including: @@ -83,7 +177,15 @@ Temporal recommends monitoring the replication lag and alerting should it rise t ::: -## Minimizing the Recovery Time +### Explaining the RTO + +The Recovery Time for a given incident is measured from the moment the incident begins to cause abnormal Namespace operation — for example, when unavailability or error rates rise above an acceptable level — to the moment the Namespace is restored to full functionality. + +For most incidents, the vast majority of the Recovery Time is spent detecting the incident, determining the affected boundary (a single cell, a region, or an entire cloud), and deciding to fail Namespaces over to their replicas. The actual time to complete the failover is usually a very small piece of the Recovery Time. + +This Recovery Time covers only the Temporal Namespace. Your application's overall Recovery Time also depends on having enough healthy Workers that can reach the Namespace and process Workflows. Maintaining sufficient Worker capacity that can reach the replica region (or replica cloud) during a failover is your responsibility. + +#### How Temporal achieves a low Recovery Time Temporal has put extensive work into tools and processes that minimize the recovery time and achieve its RTO for Temporal-initiated failovers, including: @@ -97,6 +199,8 @@ Temporal has put extensive work into tools and processes that minimize the recov - Expert engineers on-call 24/7 monitoring Temporal Cloud Namespaces and ready to assist should an outage occur. +#### How users can achieve a lower Recovery Time + To achieve the lowest possible recovery times, Temporal recommends that you: - Keep Temporal-initiated failovers enabled on your Namespace (the default) @@ -112,8 +216,7 @@ Users can trigger manual failovers on their Namespaces even if Temporal-initiate - Even if you have robust tooling to detect an outage and trigger a failover, leaving Temporal-initiated failovers enabled provides a "safety net" in case your automation misses an outage. It also gives Temporal leeway to preemptively fail over your Namespace if we detect that it may be disrupted soon, e.g., by a rolling failure that has impacted other Namespaces but not yours, yet. - -## Understanding Temporal's RTO vs. SLA +#### Comparing RTO and SLA Temporal has both a Recovery Time Objective (RTO) and a Service Level Agreement (SLA). They serve complementary purposes and apply in different situations. From 412b8defb87f0d4621860786c928899a8ad0cc38 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Thu, 14 May 2026 12:09:22 -0700 Subject: [PATCH 2/8] Added blast radius to outage types --- docs/cloud/rto-rpo.mdx | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/cloud/rto-rpo.mdx b/docs/cloud/rto-rpo.mdx index f65ec3700e..d11fc3c086 100644 --- a/docs/cloud/rto-rpo.mdx +++ b/docs/cloud/rto-rpo.mdx @@ -34,6 +34,8 @@ An [Availability Zone](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using Historically, AZ outages are the most common type of outage in the cloud, and Temporal Cloud has weathered many of them transparently to its customers. +**Blast Radius:** A single Availability Zone within a single cloud region. Because every Namespace's components are spread across at least three AZs, the blast radius to Temporal Cloud users is typically zero — Namespaces stay operational with little to no downtime. However, the outage will take out any Workers the user is running in that AZ. We recommend spreading Workers across multiple AZs to mitigate this. + **Temporal Cloud feature to mitigate this outage:** Every Namespace is automatically spread across at least three Availability Zones, and any Namespace can handle a single AZ failure without disruption to end-user Temporal operations. [High Availability](/cloud/high-availability) features are _not_ required to keep Temporal Cloud operations running through an AZ outage. If two AZs fail simultaneously, Temporal Cloud treats the event as a [Cloud Region outage](#cloud-region-outage). In that case, Namespaces in the region may be impacted, including those using Same-region Replication (in Preview). @@ -57,6 +59,8 @@ Temporal Cloud runs on a [cell architecture](https://docs.aws.amazon.com/wellarc **Example causes:** failure of a sub-component within the cell (for example, an individual database becoming unavailable) or a software bug introduced in a new deploy to the cell. +**Blast Radius:** One cell--and the Namespaces within that cell--within a single region. Even though your Workers will remain healthy, they will not be able to process Workflows because the Namespace is down. + **Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) (GA) and [Multi-cloud Replication](/cloud/high-availability) (GA) replicate a Namespace into another cell in a different region or different cloud provider. [Same-region Replication](/cloud/high-availability) (Preview) replicates a Namespace into another cell in the same region. With any of these features enabled, an outage that disrupts a single cell can be mitigated by failing the Namespace over to its replica. Cell-level disruptions occur from time to time, and Temporal's replication and failover tooling has restored affected Namespaces in real-world incidents. @@ -76,6 +80,8 @@ A cloud region as a whole can become degraded, with effects that span beyond any **Example causes:** failure of a key cloud service in the region (for example, the cloud provider's DNS resolver) causing cascading failures, two or more Availability Zones failing simultaneously, or network partitions between the region and other regions. +**Blast Radius:** All Namespaces and Workers within a single cloud region are potentially affected. Namespaces and Workers in other regions of the same cloud — and in other clouds — are unaffected. + **Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) and [Multi-cloud Replication](/cloud/high-availability) place the replica outside the affected region, so a Namespace can fail over and continue serving Workflows. Same-region Replication does not protect against a Cloud Region outage, since the replica resides in the same region. Regional outages are less common than cell or AZ outages, but they happen. During the [AWS us-east-1 incident on October 20, 2025](https://temporal.io/blog/how-devs-kept-running-during-the-aws-us-east-1-oct-20-2025), Temporal Cloud's regional failover kept customer Namespaces running. @@ -95,6 +101,8 @@ On rare occasions, an issue affects most or all regions of a single cloud provid **Example causes:** a software bug rolled out to every region of a cloud provider that triggers cascading failures across the provider's infrastructure. +**Blast Radius:** Most or all regions of a single cloud provider. Every Namespace and every Worker hosted in that cloud is potentially affected. + **Temporal Cloud feature to mitigate this outage:** [Multi-cloud Replication](/cloud/high-availability) places the replica in a different cloud provider entirely, so the Namespace can fail over even when an entire cloud provider goes down. Cloud-wide outages are the rarest category, but they [have occurred](https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW). Multi-cloud Replication is designed to keep Namespaces running through such events. From c6cf12b6b70d8f13db280cef950fa17b679b9b69 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Thu, 14 May 2026 14:31:53 -0700 Subject: [PATCH 3/8] added SLA calculations to outage types --- docs/cloud/rto-rpo.mdx | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/docs/cloud/rto-rpo.mdx b/docs/cloud/rto-rpo.mdx index d11fc3c086..de76b4b4f3 100644 --- a/docs/cloud/rto-rpo.mdx +++ b/docs/cloud/rto-rpo.mdx @@ -38,6 +38,8 @@ Historically, AZ outages are the most common type of outage in the cloud, and Te **Temporal Cloud feature to mitigate this outage:** Every Namespace is automatically spread across at least three Availability Zones, and any Namespace can handle a single AZ failure without disruption to end-user Temporal operations. [High Availability](/cloud/high-availability) features are _not_ required to keep Temporal Cloud operations running through an AZ outage. +**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation. Any errors during an AZ outage count toward SLA credits, since AZ resilience is within Temporal's responsibility. + If two AZs fail simultaneously, Temporal Cloud treats the event as a [Cloud Region outage](#cloud-region-outage). In that case, Namespaces in the region may be impacted, including those using Same-region Replication (in Preview). :::note @@ -63,6 +65,8 @@ Temporal Cloud runs on a [cell architecture](https://docs.aws.amazon.com/wellarc **Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) (GA) and [Multi-cloud Replication](/cloud/high-availability) (GA) replicate a Namespace into another cell in a different region or different cloud provider. [Same-region Replication](/cloud/high-availability) (Preview) replicates a Namespace into another cell in the same region. With any of these features enabled, an outage that disrupts a single cell can be mitigated by failing the Namespace over to its replica. +**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation. Any errors during a cell outage count toward SLA credits, since mitigating cell outages is within Temporal's responsibility. + Cell-level disruptions occur from time to time, and Temporal's replication and failover tooling has restored affected Namespaces in real-world incidents. #### RTO and RPO @@ -84,6 +88,10 @@ A cloud region as a whole can become degraded, with effects that span beyond any **Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) and [Multi-cloud Replication](/cloud/high-availability) place the replica outside the affected region, so a Namespace can fail over and continue serving Workflows. Same-region Replication does not protect against a Cloud Region outage, since the replica resides in the same region. +**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation only for Namespaces that have Multi-region Replication or Multi-cloud Replication enabled with Temporal-managed failovers — in those cases, Temporal can mitigate the outage. For Namespaces without these features, a Cloud Region outage is excluded from the SLA calculation, as it is beyond Temporal's control to mitigate. + +If two or more regions in the same cloud provider experience an outage simultaneously, Temporal Cloud treats the event as a [Cloud-wide outage](#cloud-wide-outage). + Regional outages are less common than cell or AZ outages, but they happen. During the [AWS us-east-1 incident on October 20, 2025](https://temporal.io/blog/how-devs-kept-running-during-the-aws-us-east-1-oct-20-2025), Temporal Cloud's regional failover kept customer Namespaces running. #### RTO and RPO @@ -97,14 +105,16 @@ Even though the RPO target is under 1 minute, data is virtually never "lost" tha ### Cloud-wide outage -On rare occasions, an issue affects most or all regions of a single cloud provider at once. +On rare occasions, an issue affects two or more regions of a single cloud provider at once. Any simultaneous outage of two or more regions in the same cloud provider is treated as a cloud-wide outage. -**Example causes:** a software bug rolled out to every region of a cloud provider that triggers cascading failures across the provider's infrastructure. +**Example causes:** a software bug rolled out to every region of a cloud provider that triggers cascading failures across the provider's infrastructure, or two or more regions in the same cloud experiencing independent regional outages at the same time. **Blast Radius:** Most or all regions of a single cloud provider. Every Namespace and every Worker hosted in that cloud is potentially affected. **Temporal Cloud feature to mitigate this outage:** [Multi-cloud Replication](/cloud/high-availability) places the replica in a different cloud provider entirely, so the Namespace can fail over even when an entire cloud provider goes down. +**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation only for Namespaces that have Multi-cloud Replication enabled with Temporal-managed failovers — in those cases, Temporal can mitigate the outage. For Namespaces without this feature, a cloud-wide outage is excluded from the SLA calculation, as it is beyond Temporal's control to mitigate. + Cloud-wide outages are the rarest category, but they [have occurred](https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW). Multi-cloud Replication is designed to keep Namespaces running through such events. #### RTO and RPO From 9e288857f6fb001da5cd80d6e91dd384f28aa2f4 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Fri, 15 May 2026 13:52:25 -0700 Subject: [PATCH 4/8] Apply suggestions from code review Co-authored-by: Lenny Chen <55669665+lennessyy@users.noreply.github.com> Co-authored-by: Kevin Woo <3469532+kevinawoo@users.noreply.github.com> Co-authored-by: Luke Knepper --- docs/cloud/rto-rpo.mdx | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/cloud/rto-rpo.mdx b/docs/cloud/rto-rpo.mdx index de76b4b4f3..c3420443d0 100644 --- a/docs/cloud/rto-rpo.mdx +++ b/docs/cloud/rto-rpo.mdx @@ -22,17 +22,17 @@ import { ToolTipTerm } from '@site/src/components'; When a cloud outage disrupts a Namespace, Temporal Cloud takes measures to maintain the Namespace's availability and data durability. The time it takes to recover from the outage is called the "recovery time." The amount of data (event histories) lost is called the "recovery point." A durable system should have a low recovery time and recovery point. -To help users plan for keeping critical Workflows available during a cloud outage, Temporal Cloud publishes goals for the recovery time and recovery point for each kind of outage. These goals are called the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). These objectives are complementary to Temporal Cloud's [Service Level Agreement (SLA)](/cloud/sla). +To help you plan for keeping critical Workflows available during a cloud outage, Temporal Cloud publishes goals for the recovery time and recovery point for each kind of outage. These goals are called the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). These objectives are complementary to Temporal Cloud's [Service Level Agreement (SLA)](/cloud/sla). ## Types of outages Temporal Cloud designs around -Temporal Cloud is engineered to withstand four broad categories of cloud outage. The categories are listed below in order of how commonly they occur in the real world. For each category, Temporal has experienced the outage in production, and the corresponding Temporal Cloud features have successfully mitigated the impact for real customer Namespaces. +Temporal Cloud is engineered to withstand four broad categories of cloud outages. The categories are listed below in order of how commonly they occur in the real world. For each category, Temporal has experienced the outage in production, and the corresponding Temporal Cloud features have successfully mitigated the impact for real customer Namespaces. ### Availability Zone outage An [Availability Zone](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones) (AZ) is akin to an isolated datacenter managed by a cloud hyperscaler, with independent power, networking, and cooling infrastructure. Each cloud region contains multiple AZs, and an individual AZ can fail due to events such as hardware failure, power loss, or a localized network partition. -Historically, AZ outages are the most common type of outage in the cloud, and Temporal Cloud has weathered many of them transparently to its customers. +Historically, AZ outages are the most common type of outage in the cloud, and Temporal Cloud has weathered many of them transparently to our customers. **Blast Radius:** A single Availability Zone within a single cloud region. Because every Namespace's components are spread across at least three AZs, the blast radius to Temporal Cloud users is typically zero — Namespaces stay operational with little to no downtime. However, the outage will take out any Workers the user is running in that AZ. We recommend spreading Workers across multiple AZs to mitigate this. @@ -40,11 +40,11 @@ Historically, AZ outages are the most common type of outage in the cloud, and Te **SLA inclusion:** Included in the [SLA](/cloud/sla) calculation. Any errors during an AZ outage count toward SLA credits, since AZ resilience is within Temporal's responsibility. -If two AZs fail simultaneously, Temporal Cloud treats the event as a [Cloud Region outage](#cloud-region-outage). In that case, Namespaces in the region may be impacted, including those using Same-region Replication (in Preview). +If two AZs fail simultaneously, Temporal Cloud treats the event as a [Cloud Region outage](#cloud-region-outage). In that case, Namespaces in the region may be impacted, including those using [Same-region Replication](/cloud/high-availability#same-region-replication). :::note -When an AZ fails, Temporal may also trigger a failover on Namespaces that have High Availability enabled, as a precaution in case the outage scope expands. You can opt out of this behavior by disabling Temporal-managed failovers on the Namespace. +When an AZ fails, Temporal may also trigger a failover on Namespaces that have High Availability enabled, as a precaution in case the outage scope expands. You can opt out of this behavior by [disabling Temporal-managed failovers](cloud/high-availability/failovers#disabling-temporal-initiated) on the Namespace. ::: @@ -63,7 +63,7 @@ Temporal Cloud runs on a [cell architecture](https://docs.aws.amazon.com/wellarc **Blast Radius:** One cell--and the Namespaces within that cell--within a single region. Even though your Workers will remain healthy, they will not be able to process Workflows because the Namespace is down. -**Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) (GA) and [Multi-cloud Replication](/cloud/high-availability) (GA) replicate a Namespace into another cell in a different region or different cloud provider. [Same-region Replication](/cloud/high-availability) (Preview) replicates a Namespace into another cell in the same region. With any of these features enabled, an outage that disrupts a single cell can be mitigated by failing the Namespace over to its replica. +**Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) (GA) and [Multi-cloud Replication](/cloud/high-availability) (GA) replicate a Namespace into another cell in a different region or different cloud provider. [Same-region Replication](/cloud/high-availability) (Preview) replicates a Namespace into another cell within the same region. When any of these features enabled for a namespace, an outage that disrupts a single cell can be mitigated by failing the Namespace over to its replica. **SLA inclusion:** Included in the [SLA](/cloud/sla) calculation. Any errors during a cell outage count toward SLA credits, since mitigating cell outages is within Temporal's responsibility. From 165ad9957235178de1c909be7a8ffed4f7c59997 Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Fri, 15 May 2026 13:54:14 -0700 Subject: [PATCH 5/8] Apply suggestions from code review Co-authored-by: Lenny Chen <55669665+lennessyy@users.noreply.github.com> Co-authored-by: Luke Knepper --- docs/cloud/rto-rpo.mdx | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/cloud/rto-rpo.mdx b/docs/cloud/rto-rpo.mdx index c3420443d0..f7db72c708 100644 --- a/docs/cloud/rto-rpo.mdx +++ b/docs/cloud/rto-rpo.mdx @@ -76,13 +76,13 @@ When using Same-region Replication, Multi-region Replication, or Multi-cloud Rep - **RTO under 20 minutes.** Temporal detects the disruption and fails the Namespace over to its replica cell. - **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active cell. -Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when a failover occurs. +Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when the outage is over. ### Cloud Region outage A cloud region as a whole can become degraded, with effects that span beyond any single cell or Availability Zone. -**Example causes:** failure of a key cloud service in the region (for example, the cloud provider's DNS resolver) causing cascading failures, two or more Availability Zones failing simultaneously, or network partitions between the region and other regions. +**Example causes:** failure of a key cloud service in the region causing cascading failures, two or more Availability Zones failing simultaneously, or network partitions between the region and other regions. **Blast Radius:** All Namespaces and Workers within a single cloud region are potentially affected. Namespaces and Workers in other regions of the same cloud — and in other clouds — are unaffected. @@ -155,7 +155,7 @@ Notes: - The above goals are only applicable to Namespaces that have enabled Temporal-initiated failovers, which comes enabled by default. Temporal-initiated failovers are initiated by Temporal's tooling and/or on-call engineers without user action. Users can always initiate a failover on their Namespace, even when Temporal-initiated failovers are enabled. In an outage, a user-initiated failover will not cancel out or accidentally reverse a Temporal-initiated failover. -:::note +:::tip Temporal highly recommends keeping Temporal-initiated failovers enabled. When Temporal-initiated failovers are _disabled,_ Temporal Cloud cannot set an RPO and RTO for that Namespace, because it cannot control when or if the user will trigger a failover. @@ -217,7 +217,7 @@ Temporal has put extensive work into tools and processes that minimize the recov - Expert engineers on-call 24/7 monitoring Temporal Cloud Namespaces and ready to assist should an outage occur. -#### How users can achieve a lower Recovery Time +#### Tips for a lower Recovery Time To achieve the lowest possible recovery times, Temporal recommends that you: From d0b44bee9e196f5e191a724daa15ba3fc2de0def Mon Sep 17 00:00:00 2001 From: Luke Knepper Date: Wed, 20 May 2026 16:50:49 -0700 Subject: [PATCH 6/8] Update confusion about 'isolation domains' --- docs/cloud/high-availability/failovers.mdx | 8 ++++---- docs/cloud/high-availability/index.mdx | 18 +++++++++++------- docs/glossary.md | 9 +-------- 3 files changed, 16 insertions(+), 19 deletions(-) diff --git a/docs/cloud/high-availability/failovers.mdx b/docs/cloud/high-availability/failovers.mdx index 5362f33348..f94a64c08f 100644 --- a/docs/cloud/high-availability/failovers.mdx +++ b/docs/cloud/high-availability/failovers.mdx @@ -150,7 +150,7 @@ At any time only the primary or the replica is active. The only exception occurs in the event of a [network partition](https://en.wikipedia.org/wiki/Network_partition), when a Network splits into separate subnetworks. Should this occur, you can promote a replica to active status. **Caution:** This temporarily makes both regions active. -After the network partition is resolved and communication between the isolation domains/regions is restored, a conflict resolution algorithm determines whether the primary or replica remains active. +After the network partition is resolved and communication between the regions is restored, a conflict resolution algorithm determines whether the primary or replica remains active. :::tip @@ -289,7 +289,7 @@ See [Returning to the primary with failbacks](#failbacks) for details on how and After any failover, whether triggered by you or by Temporal, an event appears in both the [Temporal Cloud Web UI](https://cloud.temporal.io/namespaces) (on the Namespace detail page) and in your audit logs. The audit log entry for Failover uses the `"operation": "FailoverNamespace"` event. -After failover, the replica becomes active, taking over in the isolation domain or region. +After failover, the replica becomes active, taking over from the original region. You don't need to monitor Temporal Cloud's failover response in real time. Whenever there is a failover event, Temporal Cloud [notifies you via email](/cloud/notifications#admin-notifications) @@ -405,14 +405,14 @@ Failover testing (also known as ")" can: - **Validate replicated deployments**: In multi-region setups, failover testing ensures your app can run from another region when the primary region experiences outages. - In standard setups, failover testing instead works with an isolation domain. + In Same-region Replication setups, failover testing instead works with a separate cell within the same region. This maintains high availability in mission-critical deployments. Manual testing confirms the failover mechanism works as expected, so your system handles incidents effectively. - **Assess replication lag**: In multi-region deployment, monitoring [replication lag](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_replication_lag_p99) between regions is crucial. Check the lag before initiating a failover to avoid rolling back Workflow progress. - This is less important when using isolation domains as failover is usually instantaneous. + This is less important with Same-region Replication, as failover is usually instantaneous. Manual testing helps you practice this critical step and understand its impact. - **Assess recovery time**: diff --git a/docs/cloud/high-availability/index.mdx b/docs/cloud/high-availability/index.mdx index 10bcd57b7c..dcdd5ac73f 100644 --- a/docs/cloud/high-availability/index.mdx +++ b/docs/cloud/high-availability/index.mdx @@ -20,8 +20,12 @@ keywords: import { ToolTipTerm, DiscoverableDisclosure, CaptionedImage } from '@site/src/components'; -Temporal Cloud's High Availability features use asynchronous across multiple to provide enhanced resilience and a 99.99% [SLA](/cloud/sla). -When you enable High Availability features, Temporal deploys your primary and its in separate isolation domains, giving you control over the location of both. This redundancy, combined with capability, enhances availability during outages. +Temporal keeps your Workflows running even when a Worker crashes. But what happens when a whole data center crashes? Or a region? + +In the cloud, outages are commonplace. An outage can bring down a whole data center, cluster, region, or cloud provider. To be durable in the cloud, Workflows and applications must handle these outages smoothly, just like Temporal handles a Worker crash. + +Temporal Cloud's High Availability features add extra reliability to Temporal Cloud Namespaces by handling cloud outages. Using asynchronous between multiple regions or cloud providers, combined with automatic outage detection and failover, High Availability keeps your Workflows running even during a cloud region outage. +This extra availability comes with an enhanced [SLA](/cloud/sla) of 99.99%, _including_ cloud provider outages. :::tip White paper @@ -34,7 +38,7 @@ For an in-depth guide covering everything from why you need High Availability to Even without High Availability features, Temporal Cloud provides robust reliability and a 99.9% contractual Service Level Agreement ([SLA](/cloud/sla)) guarantee against service errors. Each standard Temporal Namespace uses replication across three [Availability Zones](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones) (AZs) to ensure high availability. -An Availability Zone is akin to an isolated datacenter managed by a cloud hyperscaler, with independent power, networking, and cooling infrastructure. +An Availability Zone is akin to an isolated data center managed by a cloud hyperscaler, with independent power, networking, and cooling infrastructure. Replication across AZs makes sure that any changes to Workflow state or History are saved in all three AZs _before_ the Temporal Service acknowledges a change back to the Client. As a result, your standard Temporal Namespace stays operational even if one of its three AZs becomes unavailable. @@ -44,7 +48,7 @@ However some critical use cases--such as customer-facing applications--require e ## High Availability features {#high-availability-features} -High Availability features extend Temporal Cloud's replication offering across even more disparate isolation domains: +High Availability features extend Temporal Cloud's replication across regions and cloud providers, so your Namespace keeps running even when a whole region or cloud provider goes down: | **Deployment** | **Description** | | --------------------------------------- | ---------------------------------------------------------- | @@ -53,7 +57,7 @@ High Availability features extend Temporal Cloud's replication offering across e ### Key features -- **Real-time replication** — Temporal replicates your Namespace across distant isolation domains with no performance impact to your Workers or Workflows. +- **Real-time replication** — Temporal replicates your Namespace across distant regions or cloud providers with no performance impact to your Workers or Workflows. - **Automatic failover with 20-minute RTO** — Temporal manages failover with a 20-minute [RTO](/cloud/rpo-rto). You can also [trigger failover](/cloud/high-availability/failovers) manually at any time, for example for testing. - **Transparent DNS routing** — On failover, DNS reroutes your [Namespace Endpoint](/cloud/namespaces#access-namespaces) to the active region. Requests that reach the replica are forwarded to the active region automatically. - **Sub-1-minute RPO** — In a failover during an outage, the [Recovery Point Objective](/cloud/rpo-rto) is under one minute. @@ -63,10 +67,10 @@ High Availability features extend Temporal Cloud's replication offering across e :::info Region availability You can usually choose your replica region, but the replica must be on the same continent as the primary region. -This means that a few Temporal Cloud regions do not yet support Multi-region Replication and/or Multi-cloud Replication. +This means that a few Temporal Cloud regions do not yet support Multi-region Replication or Multi-cloud Replication. See [Regions](/cloud/regions) for a full list of supported replica regions. -You can't enable both Multi-region Replication and Multi-cloud Replciation on the same Namepsace at the same time. +You can't enable both Multi-region Replication and Multi-cloud Replication on the same Namespace at the same time. ::: diff --git a/docs/glossary.md b/docs/glossary.md index 3786d653e2..e680202a9e 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -315,13 +315,6 @@ data integrity and prevent costly errors. -#### [Isolation Domain](/cloud/high-availability) - -An isolation domain is a defined area within Temporal Cloud's infrastructure. It helps contain failures and prevents -them from spreading to other parts of the system, providing redundancy and fault tolerance. - - - #### [List Filter](/list-filter) A List Filter is the SQL-like string that is provided as the parameter to an advanced Visibility List API. @@ -549,7 +542,7 @@ A Run Id is a globally unique, platform-level identifier for a Workflow Executio #### [Same-region Replication](/cloud/high-availability/enable) -Same-region Replication replicates Workflows and metadata to an isolation domain within the same region as the primary +Same-region Replication replicates Workflows and metadata to a separate cell within the same region as the primary Namespace. It provides a reliable failover mechanism while maintaining deployment simplicity. From eb22055c79341fd905bfe1a747cdf25e3eff1583 Mon Sep 17 00:00:00 2001 From: Lenny Chen Date: Tue, 26 May 2026 14:50:17 -0700 Subject: [PATCH 7/8] deduplicate content and clarify --- docs/cloud/rto-rpo.mdx | 410 +++++++++++++++++++++++++---------------- 1 file changed, 253 insertions(+), 157 deletions(-) diff --git a/docs/cloud/rto-rpo.mdx b/docs/cloud/rto-rpo.mdx index f7db72c708..4ee6e7a14c 100644 --- a/docs/cloud/rto-rpo.mdx +++ b/docs/cloud/rto-rpo.mdx @@ -2,7 +2,9 @@ id: rpo-rto title: Outages and Recovery Objectives (RTO / RPO) sidebar_label: Outages and Recovery Objectives (RTO / RPO) -description: Understand the types of outages Temporal Cloud is designed to handle, and the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) for each. +description: + Understand the types of outages Temporal Cloud is designed to handle, and the Recovery Point Objective (RPO) and + Recovery Time Objective (RTO) for each. slug: /cloud/rpo-rto toc_max_heading_level: 4 keywords: @@ -13,255 +15,349 @@ keywords: - Recovery Time Objective - outages tags: - - Recovery Point Objective + - Recovery Point Objective - Recovery Time Objective - Temporal Cloud --- import { ToolTipTerm } from '@site/src/components'; -When a cloud outage disrupts a Namespace, Temporal Cloud takes measures to maintain the Namespace's availability and data durability. The time it takes to recover from the outage is called the "recovery time." The amount of data (event histories) lost is called the "recovery point." A durable system should have a low recovery time and recovery point. - -To help you plan for keeping critical Workflows available during a cloud outage, Temporal Cloud publishes goals for the recovery time and recovery point for each kind of outage. These goals are called the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). These objectives are complementary to Temporal Cloud's [Service Level Agreement (SLA)](/cloud/sla). - -## Types of outages Temporal Cloud designs around - -Temporal Cloud is engineered to withstand four broad categories of cloud outages. The categories are listed below in order of how commonly they occur in the real world. For each category, Temporal has experienced the outage in production, and the corresponding Temporal Cloud features have successfully mitigated the impact for real customer Namespaces. +When a cloud outage disrupts a Namespace, Temporal Cloud takes measures to maintain the Namespace's availability and +data durability. The time it takes to recover from the outage is called the _recovery time_. The _recovery point_ is how +far back in time data must be recovered from after an outage. A durable system should have a low recovery time and a +near recovery point. -### Availability Zone outage +Temporal Cloud publishes goals for the recovery time and recovery point for each kind of outage. These goals are called +the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). For details on how each is measured, see +[How RTO and RPO are measured](#how-rto-and-rpo-are-measured). These objectives are complementary to Temporal Cloud's +[Service Level Agreement (SLA)](/cloud/sla). -An [Availability Zone](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones) (AZ) is akin to an isolated datacenter managed by a cloud hyperscaler, with independent power, networking, and cooling infrastructure. Each cloud region contains multiple AZs, and an individual AZ can fail due to events such as hardware failure, power loss, or a localized network partition. +The RTO and RPO for a Namespace depend on the type of outage and which [High Availability](/cloud/high-availability) +features the Namespace has enabled. -Historically, AZ outages are the most common type of outage in the cloud, and Temporal Cloud has weathered many of them transparently to our customers. +## RTO and RPO summary -**Blast Radius:** A single Availability Zone within a single cloud region. Because every Namespace's components are spread across at least three AZs, the blast radius to Temporal Cloud users is typically zero — Namespaces stay operational with little to no downtime. However, the outage will take out any Workers the user is running in that AZ. We recommend spreading Workers across multiple AZs to mitigate this. +The following table summarizes the RTO and RPO targets for each type of outage. These targets apply to Namespaces that +have Temporal-initiated failovers enabled, which is the default. Temporal-initiated failovers are triggered by +Temporal's tooling and on-call engineers without user action. Users can always initiate a failover independently. In an +outage, a user-initiated failover will not cancel out or reverse a Temporal-initiated failover. -**Temporal Cloud feature to mitigate this outage:** Every Namespace is automatically spread across at least three Availability Zones, and any Namespace can handle a single AZ failure without disruption to end-user Temporal operations. [High Availability](/cloud/high-availability) features are _not_ required to keep Temporal Cloud operations running through an AZ outage. +These targets are for unplanned cloud outages and do not apply to user-initiated failovers during healthy periods, such +as DR drills. Read about [triggering a failover](/cloud/high-availability/failovers) to see how a Namespace failover +performs during healthy periods. -**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation. Any errors during an AZ outage count toward SLA credits, since AZ resilience is within Temporal's responsibility. +| Outage type | Applicable Namespaces | RPO | RTO | +| ----------------------------------------------------- | --------------------------------------------------------------------- | -------------- | ---------------- | +| [Availability Zone outage](#availability-zone-outage) | All Namespaces | Zero | Near-zero | +| [Cell outage](#cell-outage) | Namespaces with Same-region, Multi-region, or Multi-cloud Replication | Under 1 minute | Under 20 minutes | +| [Cloud Region outage](#cloud-region-outage) | Namespaces with Multi-region or Multi-cloud Replication | Under 1 minute | Under 20 minutes | +| [Cloud-wide outage](#cloud-wide-outage) | Namespaces with Multi-cloud Replication | Under 1 minute | Under 20 minutes | -If two AZs fail simultaneously, Temporal Cloud treats the event as a [Cloud Region outage](#cloud-region-outage). In that case, Namespaces in the region may be impacted, including those using [Same-region Replication](/cloud/high-availability#same-region-replication). - -:::note +:::tip -When an AZ fails, Temporal may also trigger a failover on Namespaces that have High Availability enabled, as a precaution in case the outage scope expands. You can opt out of this behavior by [disabling Temporal-managed failovers](cloud/high-availability/failovers#disabling-temporal-initiated) on the Namespace. +Temporal highly recommends keeping Temporal-initiated failovers enabled. When Temporal-initiated failovers are +_disabled,_ Temporal Cloud cannot set an RPO and RTO for that Namespace, because it cannot control when or if the user +will trigger a failover. ::: -#### RTO and RPO +As soon as a cloud outage resolves, Temporal's on-call engineers work to restore service to Namespaces that were not +protected by High Availability. A cloud outage can leave lingering effects in Temporal's systems and applications, even +after the cloud provider restores the underlying service. An affected Namespace's outage may last longer than the cloud +provider's outage. -When using Temporal Cloud (no additional features required): +All Namespaces are backed up every 4 hours. If an outage causes data loss on a Namespace that was not protected by High +Availability, Temporal uses the backup to restore as much data as feasible. -- **Near-zero RTO.** When a single AZ fails, the remaining two AZs continue serving requests without a failover, so end users see little to no disruption. -- **Zero RPO.** Writes to Workflow state are synchronously replicated across all three AZs before being acknowledged back to the Client, so an AZ failure cannot cause data loss. +## Outage types and their RTO/RPO -### Cell outage +The following sections explain each type of outage in more detail, including the blast radius, Temporal Cloud features +that mitigate the outage, and whether the outage is included in the SLA calculation. -Temporal Cloud runs on a [cell architecture](https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/what-is-a-cell-based-architecture.html). Each cell contains the software and services necessary to host a Namespace, and components within a cell are distributed across at least three Availability Zones. Cells provide a strong unit of isolation: a problem inside one cell does not propagate to other cells. +### Availability Zone outage {#availability-zone-outage} -**Example causes:** failure of a sub-component within the cell (for example, an individual database becoming unavailable) or a software bug introduced in a new deploy to the cell. +An +[Availability Zone](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones) +(AZ) is akin to an isolated datacenter managed by a cloud hyperscaler, with independent power, networking, and cooling +infrastructure. Each cloud region contains multiple AZs, and an individual AZ can fail due to events such as hardware +failure, power loss, or a localized network partition. -**Blast Radius:** One cell--and the Namespaces within that cell--within a single region. Even though your Workers will remain healthy, they will not be able to process Workflows because the Namespace is down. +AZ outages are the most common type of outage, and Temporal Cloud has weathered many of them transparently. -**Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) (GA) and [Multi-cloud Replication](/cloud/high-availability) (GA) replicate a Namespace into another cell in a different region or different cloud provider. [Same-region Replication](/cloud/high-availability) (Preview) replicates a Namespace into another cell within the same region. When any of these features enabled for a namespace, an outage that disrupts a single cell can be mitigated by failing the Namespace over to its replica. +**Blast Radius:** A single Availability Zone within a single cloud region. Because every Namespace's components are +spread across at least three AZs, the blast radius to Temporal Cloud users is typically zero — Namespaces stay +operational with little to no downtime. -**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation. Any errors during a cell outage count toward SLA credits, since mitigating cell outages is within Temporal's responsibility. +:::caution -Cell-level disruptions occur from time to time, and Temporal's replication and failover tooling has restored affected Namespaces in real-world incidents. - -#### RTO and RPO - -When using Same-region Replication, Multi-region Replication, or Multi-cloud Replication for Temporal-managed failover: - -- **RTO under 20 minutes.** Temporal detects the disruption and fails the Namespace over to its replica cell. -- **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active cell. +While Temporal Cloud can withstand single AZ outages without disruption, if you have Workers that are deployed in the +impacted AZ, those Workers may be disrupted. To mitigate this risk, Temporal recommends deploying your Workers across +multiple AZs. -Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when the outage is over. - -### Cloud Region outage - -A cloud region as a whole can become degraded, with effects that span beyond any single cell or Availability Zone. +::: -**Example causes:** failure of a key cloud service in the region causing cascading failures, two or more Availability Zones failing simultaneously, or network partitions between the region and other regions. +**Mitigation:** Every Namespace is automatically spread across at least three Availability Zones, and any Namespace can +handle a single AZ failure without disruption to end-user Temporal operations. +[High Availability](/cloud/high-availability) features are _not_ required to keep Temporal Cloud operations running +through an AZ outage. -**Blast Radius:** All Namespaces and Workers within a single cloud region are potentially affected. Namespaces and Workers in other regions of the same cloud — and in other clouds — are unaffected. +**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation. Any errors during an AZ outage count toward SLA +credits, since AZ resilience is within Temporal's responsibility. -**Temporal Cloud feature to mitigate this outage:** [Multi-region Replication](/cloud/high-availability) and [Multi-cloud Replication](/cloud/high-availability) place the replica outside the affected region, so a Namespace can fail over and continue serving Workflows. Same-region Replication does not protect against a Cloud Region outage, since the replica resides in the same region. +If two AZs fail simultaneously, Temporal Cloud treats the event as a [Cloud Region outage](#cloud-region-outage). In +that case, Namespaces in the region may be impacted, including those using +[Same-region Replication](/cloud/high-availability#same-region-replication). -**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation only for Namespaces that have Multi-region Replication or Multi-cloud Replication enabled with Temporal-managed failovers — in those cases, Temporal can mitigate the outage. For Namespaces without these features, a Cloud Region outage is excluded from the SLA calculation, as it is beyond Temporal's control to mitigate. +:::info -If two or more regions in the same cloud provider experience an outage simultaneously, Temporal Cloud treats the event as a [Cloud-wide outage](#cloud-wide-outage). +When an AZ fails, Temporal may also trigger a failover on Namespaces that have High Availability enabled, as a +precaution in case the outage scope expands. You can opt out of this behavior by +[disabling Temporal-managed failovers](cloud/high-availability/failovers#disabling-temporal-initiated) on the Namespace. -Regional outages are less common than cell or AZ outages, but they happen. During the [AWS us-east-1 incident on October 20, 2025](https://temporal.io/blog/how-devs-kept-running-during-the-aws-us-east-1-oct-20-2025), Temporal Cloud's regional failover kept customer Namespaces running. +::: #### RTO and RPO -When using Multi-region Replication or Multi-cloud Replication for Temporal-managed failover: - -- **RTO under 20 minutes.** Temporal detects the regional disruption and fails the Namespace over to its replica in another region. -- **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active region. - -Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when a failover occurs. +When using Temporal Cloud (no additional features required): -### Cloud-wide outage +- **Near-zero RTO.** When a single AZ fails, the remaining two AZs continue serving requests without a failover, so end + users see little to no disruption. +- **Zero RPO.** Writes to Workflow state are synchronously replicated across all three AZs before being acknowledged + back to the Client, so an AZ failure cannot cause data loss. -On rare occasions, an issue affects two or more regions of a single cloud provider at once. Any simultaneous outage of two or more regions in the same cloud provider is treated as a cloud-wide outage. +### Cell outage {#cell-outage} -**Example causes:** a software bug rolled out to every region of a cloud provider that triggers cascading failures across the provider's infrastructure, or two or more regions in the same cloud experiencing independent regional outages at the same time. +Temporal Cloud runs on a +[cell architecture](https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/what-is-a-cell-based-architecture.html). +Each cell contains the software and services necessary to host a Namespace, and components within a cell are distributed +across at least three Availability Zones. Cells provide a strong unit of isolation: a problem inside one cell does not +propagate to other cells. A cell outage occurs when a cell becomes degraded or unavailable, disrupting the Namespaces +hosted within it. -**Blast Radius:** Most or all regions of a single cloud provider. Every Namespace and every Worker hosted in that cloud is potentially affected. +**Blast Radius:** One cell--and the Namespaces within that cell--within a single region. Even though your Workers will +remain healthy, they will not be able to process Workflows because the Namespace is down. -**Temporal Cloud feature to mitigate this outage:** [Multi-cloud Replication](/cloud/high-availability) places the replica in a different cloud provider entirely, so the Namespace can fail over even when an entire cloud provider goes down. +**Mitigation:** [Multi-region Replication](/cloud/high-availability) and +[Multi-cloud Replication](/cloud/high-availability) replicate a Namespace into another cell in a different region or +different cloud provider. [Same-region Replication](/cloud/high-availability) replicates a Namespace into another cell +within the same region. When any of these features are enabled for a Namespace, an outage that disrupts a single cell +can be mitigated by failing the Namespace over to its replica. -**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation only for Namespaces that have Multi-cloud Replication enabled with Temporal-managed failovers — in those cases, Temporal can mitigate the outage. For Namespaces without this feature, a cloud-wide outage is excluded from the SLA calculation, as it is beyond Temporal's control to mitigate. +**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation. Any errors during a cell outage count toward SLA +credits, since mitigating cell outages is within Temporal's responsibility. -Cloud-wide outages are the rarest category, but they [have occurred](https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW). Multi-cloud Replication is designed to keep Namespaces running through such events. +Cell-level disruptions occur from time to time, and Temporal's replication and failover tooling has restored affected +Namespaces in real-world incidents. #### RTO and RPO -When using Multi-cloud Replication for Temporal-managed failover: - -- **RTO under 20 minutes.** Temporal detects the cloud-wide disruption and fails the Namespace over to its replica in a different cloud provider. -- **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active region, even across cloud providers. - -Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and Conflict Resolution process, which reconciles state between the active and replica when a failover occurs. - -## How High Availability replication works - -To achieve the lowest RPO and RTO, Temporal Cloud offers [High Availability](/cloud/high-availability) features that keep Workflows operational with minimal downtime. When High Availability is enabled on a Namespace, the user chooses a region to place a "replica" that will take over in the event of a failure. The location of the replica determines the type of replication used and the categories of outage it can handle: - -- **Multi-region Replication** places the active and replica in different regions on the same cloud (for example, AWS us-east-1 and AWS us-west-2). -- **Multi-cloud Replication** places the active and replica in different cloud providers (for example, AWS and GCP). -- **Same-region Replication** (Preview) places the active and replica in the same region. - -Temporal always places the active and replica in different [cells](/cloud/overview#cell-based-infrastructure). - -As Workflows progress in the active region, history events are asynchronously replicated to the replica. Because replication is asynchronous, High Availability does not impact the latency or throughput of Workflow Executions in the active region. If an outage hits the active region or cell, Temporal Cloud will fail over to the replica so that existing Workflow Executions will continue to run and new Workflow Executions can be started. - -## Explaining Temporal Cloud's RTO and RPO +When using Same-region Replication, Multi-region Replication, or Multi-cloud Replication for Temporal-managed failover: -The Recovery Point Objective and Recovery Time Objective for Temporal Cloud depend on the type of outage and which [High Availability](/cloud/high-availability) feature your Namespace has enabled. Temporal Cloud can only set an RPO and RTO for cases where it has the ability to mitigate the outage. Therefore, the published RPOs and RTOs apply to Namespaces that have the corresponding type of replication and have enabled Temporal-initiated failovers, which comes enabled by default. +- **RTO under 20 minutes.** Temporal detects the disruption and fails the Namespace over to its replica cell. +- **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active cell. -### Summary table +Even though the RPO target is under 1 minute, data loss is virtually eliminated thanks to Temporal's built-in Recovery +and Conflict Resolution process, which reconciles state between the active and replica when the outage is over. -| Outage type | Applicable Namespaces | RPO | RTO | -| ---------------------- | ------------------------------------------------------------------------------ | -------------- | ---------------- | -| Availability Zone outage | All Namespaces | Zero | Near-zero | -| Cell outage | Namespaces with Same-region, Multi-region, or Multi-cloud Replication | Under 1 minute | Under 20 minutes | -| Cloud Region outage | Namespaces with Multi-region or Multi-cloud Replication | Under 1 minute | Under 20 minutes | -| Cloud-wide outage | Namespaces with Multi-cloud Replication | Under 1 minute | Under 20 minutes | +### Cloud Region outage {#cloud-region-outage} -Notes: +A cloud region as a whole can become degraded, with effects that span beyond any single cell or Availability Zone. -- The above goals are only applicable to Namespaces that have enabled Temporal-initiated failovers, which comes enabled by default. Temporal-initiated failovers are initiated by Temporal's tooling and/or on-call engineers without user action. Users can always initiate a failover on their Namespace, even when Temporal-initiated failovers are enabled. In an outage, a user-initiated failover will not cancel out or accidentally reverse a Temporal-initiated failover. +**Blast Radius:** All Namespaces and Workers within a single cloud region are potentially affected. -:::tip +**Mitigation:** [Multi-region Replication](/cloud/high-availability) and +[Multi-cloud Replication](/cloud/high-availability) place the replica outside the affected region, so a Namespace can +fail over and continue serving Workflows. Same-region Replication does not protect against a Cloud Region outage, since +the replica resides in the same region. -Temporal highly recommends keeping Temporal-initiated failovers enabled. When Temporal-initiated failovers are _disabled,_ Temporal Cloud cannot set an RPO and RTO for that Namespace, because it cannot control when or if the user will trigger a failover. +**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation only for Namespaces that have Multi-region Replication +or Multi-cloud Replication enabled with Temporal-managed failovers — in those cases, Temporal can mitigate the outage. +For Namespaces without these features, a Cloud Region outage is excluded from the SLA calculation, as it is beyond +Temporal's control to mitigate. -::: +If two or more regions in the same cloud provider experience an outage simultaneously, Temporal Cloud treats the event +as a [Cloud-wide outage](#cloud-wide-outage). -- The above goals are for unplanned cloud outages. They do not apply to user-initiated failovers during healthy periods (e.g., for DR drills). Read about [triggering a failover](/cloud/high-availability/failovers) to see how a Namespace failover should perform during healthy periods. +Regional outages are less common than cell or AZ outages, but they do happen. During the +[AWS us-east-1 incident on October 20, 2025](https://temporal.io/blog/how-devs-kept-running-during-the-aws-us-east-1-oct-20-2025), +Temporal Cloud's regional failover kept customer Namespaces running. -- As soon as a cloud outage resolves, Temporal's on-call engineers will work to restore service to Namespaces that were not protected by High Availability. A cloud outage can leave lingering effects in Temporal's systems and applications, even after the cloud provider restores the underlying service. Because of this, affected Namespaces may not be immediately available when the underlying service is restored. An affected Namespace's outage may last longer than the cloud provider's outage. +#### RTO and RPO -- All Namespaces are backed up every 4 hours. If an outage causes data loss on a Namespace that was not protected by High Availability, then Temporal will use the backup to restore as much data as feasible. +When using Multi-region Replication or Multi-cloud Replication for Temporal-managed failover: -- Temporal has internal goals and measurements for Recovery Time and Recovery Point, but does not publish the achieved Recovery Time and Recovery Point for each incident. +- **RTO under 20 minutes.** Temporal detects the regional disruption and fails the Namespace over to its replica in + another region. +- **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active region. -### Explaining the RPO +Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and +Conflict Resolution process, which reconciles state between the active and replica when a failover occurs. -:::note Temporal's Recovery Point is different from a traditional Recovery Point +### Cloud-wide outage {#cloud-wide-outage} -In a traditional database, data within the Recovery Point window may be permanently lost during a failover. In Temporal Cloud, that data is not lost. Cloud data stores are engineered for extreme durability (commonly 99.999999999%, or "11 nines"), so any data acknowledged by Temporal Cloud is durably persisted. After the outage resolves, Temporal's Recovery and Conflict Resolution process automatically syncs that data back into the Namespace. +On rare occasions, an issue affects two or more regions of a single cloud provider at once. Any simultaneous outage of +two or more regions in the same cloud provider is treated as a cloud-wide outage. -The Recovery Point Objective therefore reflects the maximum data that may be temporarily unavailable in the replica at the moment of failover, not the maximum data that could be permanently lost. +**Example causes:** a software bug rolled out to every region of a cloud provider that triggers cascading failures +across the provider's infrastructure, or two or more regions in the same cloud experiencing independent regional outages +at the same time. -::: +**Blast Radius:** Most or all regions of a single cloud provider. Every Namespace and every Worker hosted in that cloud +is potentially affected. -Temporal has put extensive work into tools and processes that minimize the recovery point and achieve its RPO for Temporal-initiated failovers, including: +**Mitigation:** [Multi-cloud Replication](/cloud/high-availability) places the replica in a different cloud provider +entirely, so the Namespace can fail over even when an entire cloud provider goes down. -- Best-in-class [data replication technology](https://youtu.be/mULBvv83dYM?si=RDeWb3gVsEtgGM4z&t=334) that keeps the replica up to date with the active. +**SLA inclusion:** Included in the [SLA](/cloud/sla) calculation only for Namespaces that have Multi-cloud Replication +enabled with Temporal-managed failovers — in those cases, Temporal can mitigate the outage. For Namespaces without this +feature, a cloud-wide outage is excluded from the SLA calculation, as it is beyond Temporal's control to mitigate. -- Monitoring, alerting, and internal SLOs on the replication lag for every Temporal Cloud Namespace. +Cloud-wide outages are the rarest category, but they +[have occurred](https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW). Multi-cloud Replication is designed to +keep Namespaces running through such events. -However, user actions on a Namespace can affect the recovery point. For example, suddenly spiking into much higher throughput than a Namespace has seen before could create a period of replication lag where the replica falls behind the active. - -Temporal provides a [replication lag](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_replication_lag_p99) metric for each Namespace. This metric approximates the recovery point the Namespace would achieve in a worst case failure at that given moment. +#### RTO and RPO -:::note +When using Multi-cloud Replication for Temporal-managed failover: -Temporal recommends monitoring the replication lag and alerting should it rise too high, e.g., above 1 minute. +- **RTO under 20 minutes.** Temporal detects the cloud-wide disruption and fails the Namespace over to its replica in a + different cloud provider. +- **RPO under 1 minute.** Asynchronous replication keeps the replica close to the active region, even across cloud + providers. -::: +Even though the RPO target is under 1 minute, data is virtually never "lost" thanks to Temporal's built-in Recovery and +Conflict Resolution process, which reconciles state between the active and replica when a failover occurs. -### Explaining the RTO +## How RTO and RPO are measured {#how-rto-and-rpo-are-measured} -The Recovery Time for a given incident is measured from the moment the incident begins to cause abnormal Namespace operation — for example, when unavailability or error rates rise above an acceptable level — to the moment the Namespace is restored to full functionality. +Temporal Cloud achieves its RTO and RPO targets through [High Availability](/cloud/high-availability) replication. The +following sections explain how each metric is measured and what factors can affect them. -For most incidents, the vast majority of the Recovery Time is spent detecting the incident, determining the affected boundary (a single cell, a region, or an entire cloud), and deciding to fail Namespaces over to their replicas. The actual time to complete the failover is usually a very small piece of the Recovery Time. +### RPO {#how-rpo-is-measured} -This Recovery Time covers only the Temporal Namespace. Your application's overall Recovery Time also depends on having enough healthy Workers that can reach the Namespace and process Workflows. Maintaining sufficient Worker capacity that can reach the replica region (or replica cloud) during a failover is your responsibility. +Unlike a traditional database where data within the recovery point window may be permanently lost, Temporal Cloud +durably persists all acknowledged data. After an outage resolves, Temporal's Recovery and Conflict Resolution process +automatically syncs data back into the Namespace. The RPO therefore reflects the maximum data that may be _temporarily +unavailable_ in the replica at the moment of failover, not data that is permanently lost. -#### How Temporal achieves a low Recovery Time +Temporal keeps replicas up to date using +[asynchronous replication](https://youtu.be/mULBvv83dYM?si=RDeWb3gVsEtgGM4z&t=334), with monitoring, alerting, and +internal SLOs on replication lag for every Namespace. -Temporal has put extensive work into tools and processes that minimize the recovery time and achieve its RTO for Temporal-initiated failovers, including: +User actions on a Namespace can affect the recovery point. For example, suddenly spiking into much higher throughput +than a Namespace has seen before could create a period of replication lag where the replica falls behind the active. -- History events are replicated _asynchronously_. This ensures that the Namespace can still run workflows in the active region even if there are networking blips or outages with the replica region. +Temporal provides a +[replication lag](/cloud/metrics/openmetrics/metrics-reference#temporal_cloud_v1_replication_lag_p99) metric for each +Namespace. This metric approximates the recovery point the Namespace would achieve in a worst-case failure at that +moment. Temporal recommends monitoring the replication lag and alerting if it rises above 1 minute. -- Outages are detected automatically. We have extensive internal alerting to detect disruptions to Namespaces, and are ever improving this system. +### RTO {#how-rto-is-measured} -- Battle-tested Temporal Workflows that execute failovers of all Temporal Cloud Namespaces in a given region quickly. +The Recovery Time for a given incident is measured from the moment the incident begins to cause abnormal Namespace +operation — for example, when unavailability or error rates rise above an acceptable level — to the moment the Namespace +is restored to full functionality. -- Regular drills where we fail over our internal Namespaces to test our tooling. +For most incidents, the vast majority of the Recovery Time is spent detecting the incident, determining the affected +boundary (a single cell, a region, or an entire cloud), and deciding to fail Namespaces over to their replicas. The +actual time to complete the failover is usually a very small piece of the Recovery Time. -- Expert engineers on-call 24/7 monitoring Temporal Cloud Namespaces and ready to assist should an outage occur. +This Recovery Time covers only the Temporal Namespace. Your application's overall Recovery Time also depends on having +enough healthy Workers that can reach the Namespace and process Workflows. Maintaining sufficient Worker capacity that +can reach the replica region (or replica cloud) during a failover is your responsibility. You are also responsible for +failing over any other regional dependencies your application relies on, such as replicated application databases. -#### Tips for a lower Recovery Time +## Tips for a lower Recovery Time To achieve the lowest possible recovery times, Temporal recommends that you: - Keep Temporal-initiated failovers enabled on your Namespace (the default) - Invest in a process to detect outages and trigger a manual failover. -Users can trigger manual failovers on their Namespaces even if Temporal-initiated failovers are enabled. There are several benefits to combining a manual failover process with Temporal-initiated failovers: +You can trigger manual failovers on their Namespaces even if Temporal-initiated failovers are enabled. There are several +benefits to combining a manual failover process with Temporal-initiated failovers: -- You can detect outages that Temporal doesn't. In the cloud, regional outages don't affect all services equally. It's possible that Temporal--and the services it depends on--are unaffected by the outage, even while your Workers or other cloud infrastructure are disrupted. If you [monitor services in your critical path](https://sre.google/sre-book/monitoring-distributed-systems/) and alert on unusual error rates, you may catch outages before Temporal Cloud does. +- You can detect outages that Temporal doesn't. In the cloud, regional outages don't affect all services equally. It's + possible that Temporal--and the services it depends on--are unaffected by the outage, even while your Workers or other + cloud infrastructure are disrupted. If you + [monitor services in your critical path](https://sre.google/sre-book/monitoring-distributed-systems/) and alert on + unusual error rates, you may catch outages before Temporal Cloud does. -- You can sequence your failovers in a particular order. Your cloud infrastructure probably contains more pieces than just your Temporal Namespace: Temporal Workers, compute pools, data stores, and other cloud services. If you manually fail over, you can choose the order in which these pieces switch to the replica region. You can then test that ordering with failover drills and ensure it executes smoothly without data consistency issues or bottlenecks. +- You can sequence your failovers in a particular order. Your cloud infrastructure probably contains more pieces than + just your Temporal Namespace: Temporal Workers, compute pools, data stores, and other cloud services. If you manually + fail over, you can choose the order in which these pieces switch to the replica region. You can then test that + ordering with failover drills and ensure it executes smoothly without data consistency issues or bottlenecks. -- You can proactively fail over more aggressively than Temporal. While the 20-minute RTO should be sufficient for most use cases, some may strive to hit an even lower RTO. For workloads like high frequency trading, auctions, or popular sporting events, an outage at the wrong time could cause tremendous lost revenue per minute. You can adopt a posture that fails over more eagerly than Temporal does. For example, you could trigger a manual failover at the first sign of a possible disruption, before knowing whether there's a true regional outage. +- You can proactively fail over more aggressively than Temporal. While the 20-minute RTO should be sufficient for most + use cases, some may strive to hit an even lower RTO. For workloads like high frequency trading, auctions, or popular + sporting events, an outage at the wrong time could cause tremendous lost revenue per minute. You can adopt a posture + that fails over more eagerly than Temporal does. For example, you could trigger a manual failover at the first sign of + a possible disruption, before knowing whether there's a true regional outage. -- Even if you have robust tooling to detect an outage and trigger a failover, leaving Temporal-initiated failovers enabled provides a "safety net" in case your automation misses an outage. It also gives Temporal leeway to preemptively fail over your Namespace if we detect that it may be disrupted soon, e.g., by a rolling failure that has impacted other Namespaces but not yours, yet. +- Even if you have robust tooling to detect an outage and trigger a failover, leaving Temporal-initiated failovers + enabled provides a "safety net" in case your automation misses an outage. It also gives Temporal leeway to + preemptively fail over your Namespace if we detect that it may be disrupted soon, e.g., by a rolling failure that has + impacted other Namespaces but not yours, yet. -#### Comparing RTO and SLA +## Comparing RTO and SLA -Temporal has both a Recovery Time Objective (RTO) and a Service Level Agreement (SLA). They serve complementary purposes and apply in different situations. +Temporal has both a Recovery Time Objective (RTO) and a Service Level Agreement (SLA). They serve complementary purposes +and apply in different situations. -| Aspect | RTO | SLA | -|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| What is it? | An objective, or high-priority goal, for the total time that an outage disrupts a Namespace. | A contractual agreement that sets an upper bound on the service error rate, with financial repercussions. | -| How is it measured? | The achieved recovery time is measured in terms of minutes per outage. | The achieved service error rate is measured in terms of error rate per month. | +| Aspect | RTO | SLA | +| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| What is it? | An objective, or high-priority goal, for the total time that an outage disrupts a Namespace. | A contractual agreement that sets an upper bound on the service error rate, with financial repercussions. | +| How is it measured? | The achieved recovery time is measured in terms of minutes per outage. | The achieved service error rate is measured in terms of error rate per month. | | How is the calculation performed? | The achieved recovery time in a given outage is the total time between when a disruption to a Namespace began and when the Namespace was restored to full functionality, either after a failover to a healthy region or after the outage has been mitigated. | Temporal measures the percentage of requests to Temporal Cloud that fail, and applies a [formula](/cloud/sla) to get the final percentage for the month. | -| Do partial degradations count? | Most outages contain periods of __partial degradation__ where some percentage of Namespace operations fail while the rest complete as normal. When they disrupt a Namespace, periods of partial degradation count in the calculation of the recovery time. | Partial degradations only partially count for the service error rate calculation. A 5-minute window with a 10% error rate would count less than a 5-minute window with a 100% error rate. | -| What is excluded? | For partial degradations, what counts as a disruption to a Namespace is subject to Temporal's expert judgment, but a good rule of thumb is a service error rate >=10%. | We exclude outages that are out of Temporal's control to mitigate, e.g., a failure of the underlying cloud provider infrastructure that affects a Namespace without High Availability and Temporal-initiated failovers enabled. If a Namespace has the relevant High Availability feature and has Temporal-initiated failovers enabled, then Temporal can act to mitigate the outage and it does usually count against the SLA. Full exclusions on the [SLA page](/cloud/sla). | - -The following examples illustrate the RTO and SLA calculations for different types of in a regional outage. These hypothetical Namespaces are based on actual Temporal Cloud performance in a [real-world outage](https://temporal.io/blog/how-devs-kept-running-during-the-aws-us-east-1-oct-20-2025). - -Suppose that region `middle-earth-1` experienced a cascading failure starting at 10:00:00 UTC, causing various instances and machines to fail over time. Temporal's automatic failover triggered for all Namespaces and completed at 10:15:00 UTC. - -- Namespace 0 was in the region but its cell was not affected by the outage. The only downtime it had was for a few seconds during the failover operation. It experienced a near-zero Recovery Time, and its service error rate was negligible. Graceful failover was successful, and this Namespace achieved a recovery point of 0. - -- Namespace 1_A was in the region and its cell experienced a partial degradation that caused 10% of requests to fail in the first 5 minutes, 25% in the second five minutes, and 50% in the third five minutes. Since it was significantly impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If it had no other service errors that month, then its service error rate for the month would be: ( (1 - 10%) + (1 - 25%) + (1 - 50%) + 8925 * 100% ) / 8928 = 99.990%. (Note: there are 8928 5-minute periods in a 31-day month.) Graceful failover was successful, and this Namespace achieved a recovery point of 0. - -- Namespace 1_B was in the same cell as Namespace 2_A, so it also experienced a partial degradation that caused 10% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually fail over at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 * (1 - 10%) + 8927 * 100% ) / 8928 = 99.998%. Graceful failover was successful, and this Namespace achieved a recovery point of 0. - -- Namespace 2_A was in the region and its cell was fully network partitioned at the start of the outage, causing 100% of requests to fail. Since it was significantly impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If it had no other service errors that month, then its service error rate for the month would be: ( 3 * (1 - 100%) + 8928 * 100% ) / 8640 5-minute periods per month = 99.97%. Because the Namespace was network partitioned, graceful failover did not succeed, and forced failover was used. The recovery point achieved was equal to the replication lag at the time of the network partition, which was a few seconds. - -- Namespace 2_B was in the region and was fully network partitioned, causing 100% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually fail over at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 5-minute periods * (1 - 100%) + 8639 5-minute periods * 100% ) / 8640 5-minute periods per month = 99.99%. Because the Namespace was network partitioned, graceful failover did not succeed, and forced failover was used. The recovery point achieved was equal to the replication lag at the time of the network partition, which was a few seconds. - -All of the above Namespaces were in the affected region and beat the 1-minute RPO. But they achieved varying recovery times and service error rates. - -- Notice how Namespace 1_A and Namespace 2_A were both automatically failed over with **the same recovery time but different service error rates**. Notice how Namespace 2_B and Namespace 1_A happen to have the **same service error rate but different recovery times**. This illustrates how RTO and SLA can differ, even in the same outage. Both are valuable tools for Temporal Cloud users to measure the availability of their Namespaces. - -- Notice how the Namespaces that were manually failed over (Namespace 1_B and Namespace 2_B) achieved lower recovery times than the Namespaces that were automatically failed over (Namespace 1_A and Namespace 2_A). This illustrates how **proactive, aggressive manual failover can achieve a better recovery time than automatic failover**. +| Do partial degradations count? | Most outages contain periods of **partial degradation** where some percentage of Namespace operations fail while the rest complete as normal. When they disrupt a Namespace, periods of partial degradation count in the calculation of the recovery time. | Partial degradations only partially count for the service error rate calculation. A 5-minute window with a 10% error rate would count less than a 5-minute window with a 100% error rate. | +| What is excluded? | For partial degradations, what counts as a disruption to a Namespace is subject to Temporal's expert judgment, but a good rule of thumb is a service error rate >=10%. | We exclude outages that are out of Temporal's control to mitigate, e.g., a failure of the underlying cloud provider infrastructure that affects a Namespace without High Availability and Temporal-initiated failovers enabled. If a Namespace has the relevant High Availability feature and has Temporal-initiated failovers enabled, then Temporal can act to mitigate the outage and it does usually count against the SLA. Full exclusions on the [SLA page](/cloud/sla). | + +The following examples illustrate the RTO and SLA calculations for different types of in a regional outage. These +hypothetical Namespaces are based on actual Temporal Cloud performance in a +[real-world outage](https://temporal.io/blog/how-devs-kept-running-during-the-aws-us-east-1-oct-20-2025). + +Suppose that region `middle-earth-1` experienced a cascading failure starting at 10:00:00 UTC, causing various instances +and machines to fail over time. Temporal's automatic failover triggered for all Namespaces and completed at 10:15:00 +UTC. + +- Namespace 0 was in the region but its cell was not affected by the outage. The only downtime it had was for a few + seconds during the failover operation. It experienced a near-zero Recovery Time, and its service error rate was + negligible. Graceful failover was successful, and this Namespace achieved a recovery point of 0. + +- Namespace 1_A was in the region and its cell experienced a partial degradation that caused 10% of requests to fail in + the first 5 minutes, 25% in the second five minutes, and 50% in the third five minutes. Since it was significantly + impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If it had no other service errors that month, + then its service error rate for the month would be: ( (1 - 10%) + (1 - 25%) + (1 - 50%) + 8925 \* 100% ) / 8928 = + 99.990%. (Note: there are 8928 5-minute periods in a 31-day month.) Graceful failover was successful, and this + Namespace achieved a recovery point of 0. + +- Namespace 1*B was in the same cell as Namespace 2_A, so it also experienced a partial degradation that caused 10% of + requests to fail. However, its owner detected the outage via their own tooling and decided to manually fail over at + 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 * (1 - 10%) + 8927 \_ + 100% ) / 8928 = 99.998%. Graceful failover was successful, and this Namespace achieved a recovery point of 0. + +- Namespace 2*A was in the region and its cell was fully network partitioned at the start of the outage, causing 100% of + requests to fail. Since it was significantly impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If + it had no other service errors that month, then its service error rate for the month would be: ( 3 * (1 - 100%) + 8928 + \_ 100% ) / 8640 5-minute periods per month = 99.97%. Because the Namespace was network partitioned, graceful failover + did not succeed, and forced failover was used. The recovery point achieved was equal to the replication lag at the + time of the network partition, which was a few seconds. + +- Namespace 2*B was in the region and was fully network partitioned, causing 100% of requests to fail. However, its + owner detected the outage via their own tooling and decided to manually fail over at 10:05:00. This Namespace achieved + a recovery time of 5 minutes and a service error rate of ( 1 5-minute periods * (1 - 100%) + 8639 5-minute periods \_ + 100% ) / 8640 5-minute periods per month = 99.99%. Because the Namespace was network partitioned, graceful failover + did not succeed, and forced failover was used. The recovery point achieved was equal to the replication lag at the + time of the network partition, which was a few seconds. + +All of the above Namespaces were in the affected region and beat the 1-minute RPO. But they achieved varying recovery +times and service error rates. + +- Notice how Namespace 1_A and Namespace 2_A were both automatically failed over with **the same recovery time but + different service error rates**. Notice how Namespace 2_B and Namespace 1_A happen to have the **same service error + rate but different recovery times**. This illustrates how RTO and SLA can differ, even in the same outage. Both are + valuable tools for Temporal Cloud users to measure the availability of their Namespaces. + +- Notice how the Namespaces that were manually failed over (Namespace 1_B and Namespace 2_B) achieved lower recovery + times than the Namespaces that were automatically failed over (Namespace 1_A and Namespace 2_A). This illustrates how + **proactive, aggressive manual failover can achieve a better recovery time than automatic failover**. From f7fc87caf1f605735c69956d3a555ae44b1ef934 Mon Sep 17 00:00:00 2001 From: Lenny Chen Date: Tue, 26 May 2026 15:03:47 -0700 Subject: [PATCH 8/8] link fix --- docs/cloud/rto-rpo.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/cloud/rto-rpo.mdx b/docs/cloud/rto-rpo.mdx index 4ee6e7a14c..94e6c33a4f 100644 --- a/docs/cloud/rto-rpo.mdx +++ b/docs/cloud/rto-rpo.mdx @@ -112,7 +112,7 @@ that case, Namespaces in the region may be impacted, including those using When an AZ fails, Temporal may also trigger a failover on Namespaces that have High Availability enabled, as a precaution in case the outage scope expands. You can opt out of this behavior by -[disabling Temporal-managed failovers](cloud/high-availability/failovers#disabling-temporal-initiated) on the Namespace. +[disabling Temporal-managed failovers](/cloud/high-availability/failovers#disabling-temporal-initiated) on the Namespace. :::