[SPARK-56819][SQL] Add option to trim CHAR trailing spaces on read#55820
[SPARK-56819][SQL] Add option to trim CHAR trailing spaces on read#55820llphxd wants to merge 1 commit into
Conversation
|
jira is ready: SPARK-56819 |
|
One possible question is why this new option is needed when I think the two options serve different purposes. The proposed option is narrower. It only changes the read-side representation of CHAR values by trimming trailing spaces when explicitly enabled. It does not affect VARCHAR or STRING, and it does not disable write-side CHAR/VARCHAR length checks. This allows users to keep Spark's stricter CHAR/VARCHAR type handling while opting into MySQL-compatible CHAR retrieval behavior. This is useful for migration/upgrade scenarios where users want to preserve standard CHAR/VARCHAR validation in Spark, but need the returned CHAR values to match MySQL's default behavior or previous string-like query results more closely. |
What changes were proposed in this pull request?
This PR adds a new SQL configuration,
spark.sql.charTrimTrailingSpacesOnRead, to trim trailing spaces fromCHAR(N)columns and fields when reading table data.The new configuration is disabled by default, so the existing Spark behavior is preserved. When it is enabled, it takes precedence over
spark.sql.readSideCharPadding.This is intended to provide an opt-in compatibility mode for systems such as MySQL, where
CHARvalues are commonly returned without trailing spaces unlessPAD_CHAR_TO_FULL_LENGTHis enabled.Why are the changes needed?
Spark currently enforces fixed-length
CHAR(N)semantics by paddingCHARvalues on write, and by applying read-side padding whenspark.sql.readSideCharPaddingis enabled.I tested this behavior across several Spark versions with MySQL tables. In Spark 3.3.1 and Spark 3.4.4, MySQL
CHARandVARCHARcolumns were simply treated as SparkSTRING, so trailing-space behavior was closer to the old string-based behavior. In Spark 3.5.2 and Spark 4.0.1, Spark maps MySQL character types to more standard and stricter SparkCHARtypes, which can expose behavior differences forCHARcolumns compared with older Spark versions.This makes migration or upgrade harder for workloads that rely on the previous string-like behavior or on MySQL's default
CHARretrieval behavior, where trailing spaces are removed on read. Users may otherwise need to wrap manyCHARcolumns withrtrim()manually in queries.This PR provides an opt-in configuration to make this behavior easier to control without changing Spark's default semantics.
Does this PR introduce any user-facing change?
Yes.
This PR adds a new SQL configuration:
The default value is false, so existing behavior is unchanged.
When set to true, Spark trims trailing spaces from CHAR(N) columns and fields when reading table data. The option does not affect VARCHAR or STRING, and it does not change write-side CHAR/VARCHAR length checks.
Example:
With the new configuration enabled, the CHAR(4) value is returned without trailing spaces, while VARCHAR and STRING remain unchanged.
How was this patch tested?
Added test coverage in CharVarcharTestSuite for trimming trailing spaces from CHAR columns and nested CHAR fields on read, while keeping VARCHAR and STRING unchanged.
Tested with:
./dev/scalastyle
build/sbt "sql/testOnly *CharVarcharTestSuite"
Was this patch authored or co-authored using generative AI tooling?
Assisted by ChatGPT-5.5