Commit 09979af
committed
[SPARK-53339][CONNECT] Fix interrupt on pending operations by moving
### What changes were proposed in this pull request?
This PR aims to solve SPARK-53339 using a different approach than #52083.
The issue is that interrupting an operation in `Pending` state causes an `IllegalStateException` and leaves the operation in a broken state where subsequent interrupts never work.
The root cause is that in `SparkConnectExecutionManager#createExecuteHolderAndAttach`, there was a window between `createExecuteHolder` (which registers the operation) and `postStarted()` where the operation was registered but still in `Pending` state. If an interrupt arrived during this window:
1. `ExecuteThreadRunner#interrupt()` transitioned `state` from `notStarted` to `interrupted` via CAS
2. `ErrorUtils.handleError` was called with `isInterrupted=true`, which called `postCanceled()`
3. `postCanceled()` threw `IllegalStateException` because `Pending` was not in its allowed source statuses
4. All subsequent interrupts for the same operation failed silently because `ExecuteThreadRunner.state` was already in the terminal `interrupted` state
This issue can be reproduced by inserting `Thread.sleep(100)` into `SparkConnectExecutionManager#createExecuteHolderAndAttach` like as follows:
```
val executeHolder = createExecuteHolder(executeKey, request, sessionHolder)
try {
+ Thread.sleep(1000)
executeHolder.eventsManager.postStarted()
executeHolder.start()
} catch {
```
And then run a test `interrupt all - background queries, foreground interrupt` in `SparkSessionE2ESuite`.
```
$ build/sbt 'connect-client-jvm/testOnly org.apache.spark.sql.connect.SparkSessionE2ESuite -- -z "interrupt all - background queries, foreground interrupt"'
```
The fix consists of:
1. **Move `postStarted()` into `ExecuteThreadRunner#executeInternal()`** — Previously, `postStarted()` was called in `createExecuteHolderAndAttach` before `start()`, creating a window where an interrupt could race with the status transition. By moving `postStarted()` to right after the `notStarted -> started` CAS in `executeInternal()`, the status transition and the CAS are now sequenced — if interrupt wins the CAS (`notStarted -> interrupted`), `postStarted()` is never called.
2. **Allow `Pending -> Canceled` and `Pending -> Failed` transitions** — When interrupt wins the CAS before `postStarted()` is called, `ExecuteEventsManager._status` is still `Pending`. The `postCanceled()` call from `ErrorUtils.handleError` needs to transition from `Pending` to `Canceled`. Similarly, `postFailed()` needs to handle the case where `postStarted()` itself throws an exception (e.g., session state check failure) while `_status` is still `Pending`.
3. **Remove plan validation from `postStarted()`** — `postStarted()` previously threw `UnsupportedOperationException` for unknown `OpTypeCase` values (e.g., `OPTYPE_NOT_SET`). This was an implicit validation that doesn't belong in `postStarted()`, whose responsibility is status transition and listener event firing. The `case _` branch now falls back to `request.getPlan` instead of throwing, since the `plan` variable is only used for generating the `statement` text in the listener event. Actual plan validation is handled by `executeInternal()`.
4. **Add early plan validation in `createExecuteHolderAndAttach`** — Since `postStarted()` was moved into `executeInternal()` (change 1) and no longer validates the plan (change 3), invalid plans that previously failed synchronously in `postStarted()` would now fail asynchronously inside the execution thread. This means the existing `catch` block in `createExecuteHolderAndAttach` — which calls `removeExecuteHolder` to clean up the holder — would no longer be triggered for invalid plans. To preserve this behavior, an explicit `OpTypeCase` validation is added before `start()`, ensuring that invalid plans are still caught synchronously and the holder is properly removed from the `executions` map.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add new tests.
I also confirmed that `SparkSessionE2ESuite` mentioned above succeeded.
### Was this patch authored or co-authored using generative AI tooling?
Kiro CLI / Opus 4.6
Closes #54774 from sarutak/SPARK-53339-2.
Authored-by: Kousuke Saruta <sarutak@amazon.co.jp>
Signed-off-by: Kousuke Saruta <sarutak@apache.org>postStarted() and allowing Pending to Canceled/Failed transition1 parent ceba3da commit 09979af
4 files changed
Lines changed: 53 additions & 4 deletions
File tree
- sql/connect/server/src
- main/scala/org/apache/spark/sql/connect
- execution
- service
- test/scala/org/apache/spark/sql/connect/service
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/execution/ExecuteThreadRunner.scala
Lines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
193 | 193 | | |
194 | 194 | | |
195 | 195 | | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
196 | 201 | | |
197 | 202 | | |
198 | 203 | | |
| |||
Lines changed: 7 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
188 | 188 | | |
189 | 189 | | |
190 | 190 | | |
191 | | - | |
192 | | - | |
193 | | - | |
| 191 | + | |
194 | 192 | | |
195 | 193 | | |
196 | 194 | | |
| |||
248 | 246 | | |
249 | 247 | | |
250 | 248 | | |
| 249 | + | |
| 250 | + | |
251 | 251 | | |
252 | 252 | | |
| 253 | + | |
253 | 254 | | |
254 | 255 | | |
255 | 256 | | |
| |||
269 | 270 | | |
270 | 271 | | |
271 | 272 | | |
| 273 | + | |
| 274 | + | |
272 | 275 | | |
273 | 276 | | |
| 277 | + | |
274 | 278 | | |
275 | 279 | | |
276 | 280 | | |
| |||
Lines changed: 11 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| 38 | + | |
38 | 39 | | |
39 | 40 | | |
40 | 41 | | |
| |||
191 | 192 | | |
192 | 193 | | |
193 | 194 | | |
194 | | - | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
195 | 205 | | |
196 | 206 | | |
197 | 207 | | |
| |||
Lines changed: 30 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
141 | 171 | | |
142 | 172 | | |
143 | 173 | | |
| |||
0 commit comments