[SPARK-56837][PYTHON][TESTS] Pass ArrowBatchedUDF benchmark input_type via EvalConf by Yicong-Huang · Pull Request #55834 · apache/spark

Yicong-Huang · 2026-05-12T18:57:44Z

What changes were proposed in this pull request?

_ArrowBatchedBenchMixin._write_scenario in python/benchmarks/bench_eval_type.py wrote the input_type schema JSON as a length-prefixed UTF-8 string before the UDF payload. This was the old wire-protocol shape. Since SPARK-56340 (move input_type schema to eval conf), the worker reads input_type via EvalConf instead, so the extra prefix gets parsed as the UDF count and the worker exits with UnicodeDecodeError while reading subsequent UTF-8 fields.

This PR moves the schema to eval_conf={"input_type": schema.json()}, matching the pattern already used by the _ArrowTableUDFBenchMixin.

Why are the changes needed?

Running any ArrowBatchedUDFTimeBench / ArrowBatchedUDFPeakmemBench ASV benchmark currently fails with:

File "pyspark/worker.py", line 3581, in main
    init_info = WorkerInitInfo.from_stream(infile)
  ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 353: invalid start byte

The bench file is the only SQL_ARROW_BATCHED_UDF mock writer in the tree and was missed when the worker protocol changed.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

Running both bench classes locally now succeeds. Numbers from one run:

=== bench_eval_type.ArrowBatchedUDFTimeBench.time_worker ===
scenario             identity_udf   stringify_udf   nullcheck_udf
sm_batch_few_col      44.3+/-0.3ms    46.9+/-0.3ms    45.0+/-0.4ms
sm_batch_many_col     112+/-0.7ms     113+/-1ms       112+/-0.5ms
lg_batch_few_col      106+/-0.7ms     113+/-2ms       106+/-0.4ms
lg_batch_many_col     448+/-1ms       449+/-0.3ms     447+/-3ms
pure_ints             157+/-1ms       162+/-1ms       156+/-2ms
pure_floats           148+/-0.2ms     170+/-1ms       149+/-2ms
pure_strings          302+/-0.5ms     305+/-3ms       295+/-0.7ms
mixed_types           226+/-0.9ms     230+/-1ms       222+/-0.9ms

=== bench_eval_type.ArrowBatchedUDFPeakmemBench.peakmem_worker ===
scenario             identity_udf   stringify_udf   nullcheck_udf
sm_batch_few_col      464M           464M            464M
sm_batch_many_col     469M           469M            469M
lg_batch_few_col      469M           470M            469M
lg_batch_many_col     509M           510M            509M
pure_ints             469M           470M            469M
pure_floats           469M           470M            469M
pure_strings          473M           473M            473M
mixed_types           471M           471M            470M

Run commands:

COLUMNS=120 asv run --bench ArrowBatchedUDFTimeBench   -a repeat=3 --python=same
COLUMNS=120 asv run --bench ArrowBatchedUDFPeakmemBench -a repeat=3 --python=same

Smoke-tested all 40 benchmark classes in the file (every other class still passes; only the two ArrowBatched classes were broken).

Was this patch authored or co-authored using generative AI tooling?

No.

fix: pass ArrowBatchedUDF benchmark input_type via EvalConf

322654a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56837][PYTHON][TESTS] Pass ArrowBatchedUDF benchmark input_type via EvalConf#55834

[SPARK-56837][PYTHON][TESTS] Pass ArrowBatchedUDF benchmark input_type via EvalConf#55834
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-56837/fix/bench-arrow-batched-input-type

Yicong-Huang commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yicong-Huang commented May 12, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant