Skip to content

[SPARK-56819][SQL] Add option to trim CHAR trailing spaces on read#55820

Open
llphxd wants to merge 1 commit into
apache:masterfrom
llphxd:SPARK-56819-char-trim-on-read
Open

[SPARK-56819][SQL] Add option to trim CHAR trailing spaces on read#55820
llphxd wants to merge 1 commit into
apache:masterfrom
llphxd:SPARK-56819-char-trim-on-read

Conversation

@llphxd
Copy link
Copy Markdown
Contributor

@llphxd llphxd commented May 12, 2026

What changes were proposed in this pull request?

This PR adds a new SQL configuration, spark.sql.charTrimTrailingSpacesOnRead, to trim trailing spaces from CHAR(N) columns and fields when reading table data.
The new configuration is disabled by default, so the existing Spark behavior is preserved. When it is enabled, it takes precedence over spark.sql.readSideCharPadding.
This is intended to provide an opt-in compatibility mode for systems such as MySQL, where CHAR values are commonly returned without trailing spaces unless PAD_CHAR_TO_FULL_LENGTH is enabled.

Why are the changes needed?

Spark currently enforces fixed-length CHAR(N) semantics by padding CHAR values on write, and by applying read-side padding when spark.sql.readSideCharPadding is enabled.
I tested this behavior across several Spark versions with MySQL tables. In Spark 3.3.1 and Spark 3.4.4, MySQL CHAR and VARCHAR columns were simply treated as Spark STRING, so trailing-space behavior was closer to the old string-based behavior. In Spark 3.5.2 and Spark 4.0.1, Spark maps MySQL character types to more standard and stricter Spark CHAR types, which can expose behavior differences for CHAR columns compared with older Spark versions.
This makes migration or upgrade harder for workloads that rely on the previous string-like behavior or on MySQL's default CHAR retrieval behavior, where trailing spaces are removed on read. Users may otherwise need to wrap many CHAR columns with rtrim() manually in queries.
This PR provides an opt-in configuration to make this behavior easier to control without changing Spark's default semantics.

Does this PR introduce any user-facing change?

Yes.
This PR adds a new SQL configuration:

spark.sql.charTrimTrailingSpacesOnRead

The default value is false, so existing behavior is unchanged.

When set to true, Spark trims trailing spaces from CHAR(N) columns and fields when reading table data. The option does not affect VARCHAR or STRING, and it does not change write-side CHAR/VARCHAR length checks.

Example:

SET spark.sql.charTrimTrailingSpacesOnRead=true;

CREATE TABLE t (c CHAR(4), v VARCHAR(4), s STRING) USING parquet;
INSERT INTO t VALUES ('12', '12 ', '12 ');

SELECT c, length(c), v, length(v), s, length(s) FROM t;

With the new configuration enabled, the CHAR(4) value is returned without trailing spaces, while VARCHAR and STRING remain unchanged.

How was this patch tested?

Added test coverage in CharVarcharTestSuite for trimming trailing spaces from CHAR columns and nested CHAR fields on read, while keeping VARCHAR and STRING unchanged.

Tested with:
./dev/scalastyle
build/sbt "sql/testOnly *CharVarcharTestSuite"

Was this patch authored or co-authored using generative AI tooling?

Assisted by ChatGPT-5.5

@llphxd
Copy link
Copy Markdown
Contributor Author

llphxd commented May 12, 2026

jira is ready: SPARK-56819

@llphxd
Copy link
Copy Markdown
Contributor Author

llphxd commented May 12, 2026

One possible question is why this new option is needed when spark.sql.legacy.charVarcharAsString already exists.

I think the two options serve different purposes. spark.sql.legacy.charVarcharAsString disables Spark's CHAR/VARCHAR type semantics broadly by treating CHAR/VARCHAR as STRING. This restores older Spark behavior, but it also disables length checks and CHAR padding semantics, so it is a coarse-grained legacy compatibility switch.

The proposed option is narrower. It only changes the read-side representation of CHAR values by trimming trailing spaces when explicitly enabled. It does not affect VARCHAR or STRING, and it does not disable write-side CHAR/VARCHAR length checks. This allows users to keep Spark's stricter CHAR/VARCHAR type handling while opting into MySQL-compatible CHAR retrieval behavior.

This is useful for migration/upgrade scenarios where users want to preserve standard CHAR/VARCHAR validation in Spark, but need the returned CHAR values to match MySQL's default behavior or previous string-like query results more closely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant