Skip to content

[SPARK-56837][PYTHON][TESTS] Pass ArrowBatchedUDF benchmark input_type via EvalConf#55834

Open
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-56837/fix/bench-arrow-batched-input-type
Open

[SPARK-56837][PYTHON][TESTS] Pass ArrowBatchedUDF benchmark input_type via EvalConf#55834
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-56837/fix/bench-arrow-batched-input-type

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

_ArrowBatchedBenchMixin._write_scenario in python/benchmarks/bench_eval_type.py wrote the input_type schema JSON as a length-prefixed UTF-8 string before the UDF payload. This was the old wire-protocol shape. Since SPARK-56340 (move input_type schema to eval conf), the worker reads input_type via EvalConf instead, so the extra prefix gets parsed as the UDF count and the worker exits with UnicodeDecodeError while reading subsequent UTF-8 fields.

This PR moves the schema to eval_conf={"input_type": schema.json()}, matching the pattern already used by the _ArrowTableUDFBenchMixin.

Why are the changes needed?

Running any ArrowBatchedUDFTimeBench / ArrowBatchedUDFPeakmemBench ASV benchmark currently fails with:

File "pyspark/worker.py", line 3581, in main
    init_info = WorkerInitInfo.from_stream(infile)
  ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 353: invalid start byte

The bench file is the only SQL_ARROW_BATCHED_UDF mock writer in the tree and was missed when the worker protocol changed.

Does this PR introduce any user-facing change?

No. Test-only change.

How was this patch tested?

Running both bench classes locally now succeeds. Numbers from one run:

=== bench_eval_type.ArrowBatchedUDFTimeBench.time_worker ===
scenario             identity_udf   stringify_udf   nullcheck_udf
sm_batch_few_col      44.3+/-0.3ms    46.9+/-0.3ms    45.0+/-0.4ms
sm_batch_many_col     112+/-0.7ms     113+/-1ms       112+/-0.5ms
lg_batch_few_col      106+/-0.7ms     113+/-2ms       106+/-0.4ms
lg_batch_many_col     448+/-1ms       449+/-0.3ms     447+/-3ms
pure_ints             157+/-1ms       162+/-1ms       156+/-2ms
pure_floats           148+/-0.2ms     170+/-1ms       149+/-2ms
pure_strings          302+/-0.5ms     305+/-3ms       295+/-0.7ms
mixed_types           226+/-0.9ms     230+/-1ms       222+/-0.9ms

=== bench_eval_type.ArrowBatchedUDFPeakmemBench.peakmem_worker ===
scenario             identity_udf   stringify_udf   nullcheck_udf
sm_batch_few_col      464M           464M            464M
sm_batch_many_col     469M           469M            469M
lg_batch_few_col      469M           470M            469M
lg_batch_many_col     509M           510M            509M
pure_ints             469M           470M            469M
pure_floats           469M           470M            469M
pure_strings          473M           473M            473M
mixed_types           471M           471M            470M

Run commands:

COLUMNS=120 asv run --bench ArrowBatchedUDFTimeBench   -a repeat=3 --python=same
COLUMNS=120 asv run --bench ArrowBatchedUDFPeakmemBench -a repeat=3 --python=same

Smoke-tested all 40 benchmark classes in the file (every other class still passes; only the two ArrowBatched classes were broken).

Was this patch authored or co-authored using generative AI tooling?

No.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant