[SPARK-53339][CONNECT] Fix an issue which occurs when an operation in pending state is interrupted by sarutak · Pull Request #52083 · apache/spark

sarutak · 2025-08-20T15:35:49Z

What changes were proposed in this pull request?

This PR fixes an issue which occurs when an operation in pending state is interrupted.
Once an operation in pending state is interrupted, the interruption and following all interruption for the operation never work correctly.
You can easily reproduce this issue by modifying SparkConnectExecutionManager#createExecuteHolderAndAttach like as follows.

     val executeHolder = createExecuteHolder(executeKey, request, sessionHolder)
     try {
+      Thread.sleep(1000)
       executeHolder.eventsManager.postStarted()
       executeHolder.start()
     } catch {

And then run a test interrupt all - background queries, foreground interrupt in SparkSessionE2ESuite.

$ build/sbt 'connect-client-jvm/testOnly org.apache.spark.sql.connect.SparkSessionE2ESuite -- -z "interrupt all - background queries, foreground interrupt"'

You will see the following error.

[info] - interrupt all - background queries, foreground interrupt *** FAILED *** (20 seconds, 344 milliseconds)
[info]   The code passed to eventually never returned normally. Attempted 28 times over 20.285258458 seconds. Last failure message: Some("unexpected failure in q2: org.apache.spark.SparkException: java.lang.IllegalStateException: Operation was orphaned because of an internal error.") was not empty Error not empty: Some(unexpected failure in q2: org.apache.spark.SparkException: java.lang.IllegalStateException: Operation was orphaned because of an internal error.). (SparkSessionE2ESuite.scala:72)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info]   at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219)
[info]   at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
[info]   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
[info]   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
[info]   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457)
[info]   at org.apache.spark.sql.connect.SparkSessionE2ESuite.$anonfun$new$1(SparkSessionE2ESuite.scala:72)

If an operation in pending state is interrupted, the interruption is handled in ExecuteHolder#interrupt and ErrorUtils.handleError is called in ErrorUtils#handleError, the operation status transitions to Canceled by calling executeEventsManager.postCanceled.
But postCanceled does not expect transition from pending state so an exception is thrown and propagated to the caller of ExecuteThreadRunner#interrupt.

The reason following all interruptions for the same operation never works correctly is that ExecuteThreadRunner#state has already been changed to interrupted here at the first call of ExecuteThreadRunner#interrupt and following interruptions don't enter this loop and this method always returns false, causing the result of interruption is not correctly recognized.

The solution in this PR includes:

Allow transition from pending state to canceled state
Protect transition started state and canceled state by lock because these state changes can happens asynchronously.

Why are the changes needed?

Bug fix.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add new tests.
I also confirmed that SparkSessionE2ESuite mentioned above succeeded.

Was this patch authored or co-authored using generative AI tooling?

No.

sarutak · 2025-08-25T16:14:49Z

cc: @peter-toth

sarutak · 2025-08-26T00:56:28Z

cc: @dongjoon-hyun too.
This issue is one of the obstacles which blocks SPARK-48139.

dongjoon-hyun · 2025-08-26T02:28:08Z

Thank you for pinging me, @sarutak . I just came back from my vacation today.

dongjoon-hyun · 2025-08-26T02:35:24Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteHolder.scala

   */
  def interrupt(): Boolean = {
+    if (eventsManager.status == ExecuteStatus.Pending) {
+      return false


According to the function description, false is already occupied for the status where it was already interrupted.

For the code change, according to the state transition, can we change the status to ExecuteStatus.Canceled status directly from the current ExecuteStatus.Pending because it's not started yet? In this case, we can return true.

spark/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteEventsManager.scala

Lines 38 to 47 in 79a0ca7

object ExecuteStatus {

case object Pending extends ExecuteStatus(0)

case object Started extends ExecuteStatus(1)

case object Analyzed extends ExecuteStatus(2)

case object ReadyForExecution extends ExecuteStatus(3)

case object Finished extends ExecuteStatus(4)

case object Failed extends ExecuteStatus(5)

case object Canceled extends ExecuteStatus(6)

case object Closed extends ExecuteStatus(7)

}

For the code change, according to the state transition, can we change the status to ExecuteStatus.Canceled status directly from the current ExecuteStatus.Pending because it's not started yet? In this case, we can return true.

Actually, that was my first idea to solve this issue. But as I mentioned in the description, I found that didn't work because transitioning from Pending to Canceled causes another issue.

According to the function description, false is already occupied for the status where it was already interrupted.

Hmm, if it's OK to ignore interruption to a pending state operation and we need exactly tell already interrupted from interruption failed, how about returning the exact interruption result rather than boolean?

Got it. Thank you. As long as the code and description are consistent, I'm okay for both. (1) Updating the description by changing the meaning of false and (2) changing the return types.

Thank you for your suggestion. I'll simply update the description.

dongjoon-hyun · 2025-08-26T02:39:04Z

BTW, thank you for the investigation to identify the root cause.

cc @grundprinzip , @hvanhovell , @zhengruifeng to ask if this was the intentional design of state transition or not.

grundprinzip · 2025-08-26T08:30:34Z

There are two things at play here: the internal state of the operation itself and the notification on the listener bus.

If this patch simply ignores the interrupt on an operation in a pending state, there is a new edge case where we can never cancel this operation if it's stuck in a pending state for whatever reason. Previously, it seems that while the physical query was cancelled, only the observable operation state on the listener bus was not properly handled.

I understand that there is another race condition when the interrupt happens right between the incoming request and the different posting states. I think the better solution is not to ignore the interruption, but we need to figure out how to avoid double-posting of events.

dongjoon-hyun · 2025-08-27T02:21:42Z

Given that this is one of the long standing Spark Connect interrupt operation issues which frequently happen in Apache Spark CIs in all live release branches, I'd like to suggest to document this situation as a known issue in Apache Spark 4.0.1 and 3.5.7 independently from this PR. WDYT, @sarutak and @grundprinzip ?

At the same time, we can discuss more in order to figure out the correct steps in Apache Spark 4.1.0 timeframe as @grundprinzip suggested in the above.

sarutak · 2025-08-27T17:34:08Z

@dongjoon-hyun
I'm OK to document this as known issue but let me confirm if this issue affects 4.0.1 and 3.5.7 too.

dongjoon-hyun · 2025-08-27T18:36:59Z

Thank you.

FYI, the problematic logic was added at Apache Spark 3.5.0.

[SPARK-43755][CONNECT] Move execution out of SparkExecutePlanStreamHandler and to a different thread #41315

  /**
   * Interrupt the execution. Interrupts the running thread, which cancels all running Spark Jobs
   * and makes the execution throw an OPERATION_CANCELED error.
   */
  def interrupt(): Unit = {
    runner.interrupt()
  }

As you know, the flakiness on interrupt operation was a long standing issue which we couldn't move away although we tried various workaround.

sarutak · 2025-08-28T08:21:42Z

@dongjoon-hyun
I confirmed this issue affects 4.0.1 but doesn't affect 3.5.7.
The implementation of interruption is quite different between 3.5 and 4.0+.
In 3.5, transition to pending state on interruption is not handled by ExecuteThreadRunner#interrupt but done in ExecutionThread by calling ErrorUtils.handle, and the thread is spawned after the operation has transitioned from pending state to started state. So interruption doesn't affect an operation in pending state.
On the other hand, in 4.0+, interruption on an operation in pending state and started state but ExecutionThread is not yet started is handled in ExecuteThreadRunner#interrupt.

So, this issue should be documented only for 4.0.1.
Release note for 4.0.1 is the appropriate place to document this issue right?

Also, thank you for sharing related PRs. As far as I know, we have only one issue which blocks SPARK-48139 besides this issue, and I believe that's the last one.
I'll open a PR for the rest one once this issue resolved for 4.0+.

sarutak · 2025-09-03T06:55:11Z

@grundprinzip
In the current implementation, postStarted and postCanceled can asynchronously happen so I tried to protect them using a lock, and updated the PR description as well.
What do you think of the latest change?

sarutak · 2025-09-11T15:19:53Z

@grundprinzip
This is a cause of the long standing issue so I think it's great to fix for 4.1.
WDYT?

sarutak · 2025-10-14T15:07:01Z

@grundprinzip Gentle ping.

sarutak · 2025-10-14T20:46:14Z

cc: @hvanhovell

sarutak · 2025-10-22T14:03:21Z

@hvanhovell Gentle ping.

sarutak · 2025-11-14T06:01:39Z

cc: @HyukjinKwon too.

…3339

…`postStarted()` and allowing Pending to Canceled/Failed transition ### What changes were proposed in this pull request? This PR aims to solve SPARK-53339 using a different approach than #52083. The issue is that interrupting an operation in `Pending` state causes an `IllegalStateException` and leaves the operation in a broken state where subsequent interrupts never work. The root cause is that in `SparkConnectExecutionManager#createExecuteHolderAndAttach`, there was a window between `createExecuteHolder` (which registers the operation) and `postStarted()` where the operation was registered but still in `Pending` state. If an interrupt arrived during this window: 1. `ExecuteThreadRunner#interrupt()` transitioned `state` from `notStarted` to `interrupted` via CAS 2. `ErrorUtils.handleError` was called with `isInterrupted=true`, which called `postCanceled()` 3. `postCanceled()` threw `IllegalStateException` because `Pending` was not in its allowed source statuses 4. All subsequent interrupts for the same operation failed silently because `ExecuteThreadRunner.state` was already in the terminal `interrupted` state This issue can be reproduced by inserting `Thread.sleep(100)` into `SparkConnectExecutionManager#createExecuteHolderAndAttach` like as follows: ``` val executeHolder = createExecuteHolder(executeKey, request, sessionHolder) try { + Thread.sleep(1000) executeHolder.eventsManager.postStarted() executeHolder.start() } catch { ``` And then run a test `interrupt all - background queries, foreground interrupt` in `SparkSessionE2ESuite`. ``` $ build/sbt 'connect-client-jvm/testOnly org.apache.spark.sql.connect.SparkSessionE2ESuite -- -z "interrupt all - background queries, foreground interrupt"' ``` The fix consists of: 1. **Move `postStarted()` into `ExecuteThreadRunner#executeInternal()`** — Previously, `postStarted()` was called in `createExecuteHolderAndAttach` before `start()`, creating a window where an interrupt could race with the status transition. By moving `postStarted()` to right after the `notStarted -> started` CAS in `executeInternal()`, the status transition and the CAS are now sequenced — if interrupt wins the CAS (`notStarted -> interrupted`), `postStarted()` is never called. 2. **Allow `Pending -> Canceled` and `Pending -> Failed` transitions** — When interrupt wins the CAS before `postStarted()` is called, `ExecuteEventsManager._status` is still `Pending`. The `postCanceled()` call from `ErrorUtils.handleError` needs to transition from `Pending` to `Canceled`. Similarly, `postFailed()` needs to handle the case where `postStarted()` itself throws an exception (e.g., session state check failure) while `_status` is still `Pending`. 3. **Remove plan validation from `postStarted()`** — `postStarted()` previously threw `UnsupportedOperationException` for unknown `OpTypeCase` values (e.g., `OPTYPE_NOT_SET`). This was an implicit validation that doesn't belong in `postStarted()`, whose responsibility is status transition and listener event firing. The `case _` branch now falls back to `request.getPlan` instead of throwing, since the `plan` variable is only used for generating the `statement` text in the listener event. Actual plan validation is handled by `executeInternal()`. 4. **Add early plan validation in `createExecuteHolderAndAttach`** — Since `postStarted()` was moved into `executeInternal()` (change 1) and no longer validates the plan (change 3), invalid plans that previously failed synchronously in `postStarted()` would now fail asynchronously inside the execution thread. This means the existing `catch` block in `createExecuteHolderAndAttach` — which calls `removeExecuteHolder` to clean up the holder — would no longer be triggered for invalid plans. To preserve this behavior, an explicit `OpTypeCase` validation is added before `start()`, ensuring that invalid plans are still caught synchronously and the holder is properly removed from the `executions` map. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add new tests. I also confirmed that `SparkSessionE2ESuite` mentioned above succeeded. ### Was this patch authored or co-authored using generative AI tooling? Kiro CLI / Opus 4.6 Closes #54774 from sarutak/SPARK-53339-2. Authored-by: Kousuke Saruta <sarutak@amazon.co.jp> Signed-off-by: Kousuke Saruta <sarutak@apache.org>

…`postStarted()` and allowing Pending to Canceled/Failed transition ### What changes were proposed in this pull request? This PR aims to solve SPARK-53339 using a different approach than #52083. The issue is that interrupting an operation in `Pending` state causes an `IllegalStateException` and leaves the operation in a broken state where subsequent interrupts never work. The root cause is that in `SparkConnectExecutionManager#createExecuteHolderAndAttach`, there was a window between `createExecuteHolder` (which registers the operation) and `postStarted()` where the operation was registered but still in `Pending` state. If an interrupt arrived during this window: 1. `ExecuteThreadRunner#interrupt()` transitioned `state` from `notStarted` to `interrupted` via CAS 2. `ErrorUtils.handleError` was called with `isInterrupted=true`, which called `postCanceled()` 3. `postCanceled()` threw `IllegalStateException` because `Pending` was not in its allowed source statuses 4. All subsequent interrupts for the same operation failed silently because `ExecuteThreadRunner.state` was already in the terminal `interrupted` state This issue can be reproduced by inserting `Thread.sleep(100)` into `SparkConnectExecutionManager#createExecuteHolderAndAttach` like as follows: ``` val executeHolder = createExecuteHolder(executeKey, request, sessionHolder) try { + Thread.sleep(1000) executeHolder.eventsManager.postStarted() executeHolder.start() } catch { ``` And then run a test `interrupt all - background queries, foreground interrupt` in `SparkSessionE2ESuite`. ``` $ build/sbt 'connect-client-jvm/testOnly org.apache.spark.sql.connect.SparkSessionE2ESuite -- -z "interrupt all - background queries, foreground interrupt"' ``` The fix consists of: 1. **Move `postStarted()` into `ExecuteThreadRunner#executeInternal()`** — Previously, `postStarted()` was called in `createExecuteHolderAndAttach` before `start()`, creating a window where an interrupt could race with the status transition. By moving `postStarted()` to right after the `notStarted -> started` CAS in `executeInternal()`, the status transition and the CAS are now sequenced — if interrupt wins the CAS (`notStarted -> interrupted`), `postStarted()` is never called. 2. **Allow `Pending -> Canceled` and `Pending -> Failed` transitions** — When interrupt wins the CAS before `postStarted()` is called, `ExecuteEventsManager._status` is still `Pending`. The `postCanceled()` call from `ErrorUtils.handleError` needs to transition from `Pending` to `Canceled`. Similarly, `postFailed()` needs to handle the case where `postStarted()` itself throws an exception (e.g., session state check failure) while `_status` is still `Pending`. 3. **Remove plan validation from `postStarted()`** — `postStarted()` previously threw `UnsupportedOperationException` for unknown `OpTypeCase` values (e.g., `OPTYPE_NOT_SET`). This was an implicit validation that doesn't belong in `postStarted()`, whose responsibility is status transition and listener event firing. The `case _` branch now falls back to `request.getPlan` instead of throwing, since the `plan` variable is only used for generating the `statement` text in the listener event. Actual plan validation is handled by `executeInternal()`. 4. **Add early plan validation in `createExecuteHolderAndAttach`** — Since `postStarted()` was moved into `executeInternal()` (change 1) and no longer validates the plan (change 3), invalid plans that previously failed synchronously in `postStarted()` would now fail asynchronously inside the execution thread. This means the existing `catch` block in `createExecuteHolderAndAttach` — which calls `removeExecuteHolder` to clean up the holder — would no longer be triggered for invalid plans. To preserve this behavior, an explicit `OpTypeCase` validation is added before `start()`, ensuring that invalid plans are still caught synchronously and the holder is properly removed from the `executions` map. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add new tests. I also confirmed that `SparkSessionE2ESuite` mentioned above succeeded. ### Was this patch authored or co-authored using generative AI tooling? Kiro CLI / Opus 4.6 Closes #54774 from sarutak/SPARK-53339-2. Authored-by: Kousuke Saruta <sarutak@amazon.co.jp> Signed-off-by: Kousuke Saruta <sarutak@apache.org> (cherry picked from commit 09979af) Signed-off-by: Kousuke Saruta <sarutak@apache.org>

sarutak · 2026-03-16T07:56:42Z

This issue was resolved by another PR (#54774).

sarutak added 2 commits August 20, 2025 18:44

Prevent from interruption when an operation status is pending

d4c7abd

Add test

223d608

github-actions bot added SQL CONNECT labels Aug 20, 2025

sarutak changed the title ~~[SPARK-53339][CONNECT] Fix a race condition issue which occurs when an operation in pending state is interrupted~~ [SPARK-53339][CONNECT] Fix an issue which occurs when an operation in pending state is interrupted Aug 20, 2025

dongjoon-hyun reviewed Aug 26, 2025

View reviewed changes

sarutak added 2 commits August 26, 2025 14:28

Modify the comment to be consistent with the change

bcb8a70

Fix style

b7430bd

sarutak added 2 commits September 1, 2025 13:56

Protect status transition by lock

14ed938

Remove SparkConnectExecuteHolderSuite

029e3ef

sarutak mentioned this pull request Sep 23, 2025

[SPARK-53673][CONNECT][TESTS] Fix a flaky test failure in SparkSessionE2ESuite - interrupt tag caused by the usage of ForkJoinPool #52417

Closed

Merge branch 'master' of https://github.com/apache/spark into SPARK-5…

0173cf7

…3339

sarutak mentioned this pull request Mar 12, 2026

[SPARK-53339][CONNECT] Fix interrupt on pending operations by moving postStarted() and allowing Pending to Canceled/Failed transition #54774

Closed

sarutak closed this Mar 16, 2026

	object ExecuteStatus {
	case object Pending extends ExecuteStatus(0)
	case object Started extends ExecuteStatus(1)
	case object Analyzed extends ExecuteStatus(2)
	case object ReadyForExecution extends ExecuteStatus(3)
	case object Finished extends ExecuteStatus(4)
	case object Failed extends ExecuteStatus(5)
	case object Canceled extends ExecuteStatus(6)
	case object Closed extends ExecuteStatus(7)
	}

Conversation

sarutak commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sarutak commented Aug 25, 2025

Uh oh!

sarutak commented Aug 26, 2025

Uh oh!

dongjoon-hyun commented Aug 26, 2025

Uh oh!

dongjoon-hyun Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

sarutak Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

sarutak Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

sarutak Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grundprinzip commented Aug 26, 2025

Uh oh!

dongjoon-hyun commented Aug 27, 2025

Uh oh!

sarutak commented Aug 27, 2025

Uh oh!

dongjoon-hyun commented Aug 27, 2025

Uh oh!

sarutak commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarutak commented Sep 3, 2025

Uh oh!

sarutak commented Sep 11, 2025

Uh oh!

sarutak commented Oct 14, 2025

Uh oh!

sarutak commented Oct 14, 2025

Uh oh!

sarutak commented Oct 22, 2025

Uh oh!

sarutak commented Nov 14, 2025

Uh oh!

sarutak commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sarutak commented Aug 20, 2025 •

edited

Loading

dongjoon-hyun commented Aug 26, 2025 •

edited

Loading

sarutak commented Aug 28, 2025 •

edited

Loading