[SPARK-55294][PYTHON] Add support for PyArrow-backed dtypes in pandas API#55783
[SPARK-55294][PYTHON] Add support for PyArrow-backed dtypes in pandas API#55783adith-os wants to merge 1 commit into
Conversation
|
Hi @gaogaotiantian, Could you please review the PR when you get a chance and let me know whether this approach matches your expectations, or if you had a different direction in mind? Thanks! |
|
cc @devin-petersohn too |
devin-petersohn
left a comment
There was a problem hiding this comment.
This feels both incomplete and too verbose. ne and eq were the only APIs updated. There's some copy/pasted code as well. I left some high level comments.
I also haven't fully checked, but I think this will be a problem for pandas 3 and the string dtype.
| return types.DoubleType() | ||
| if extension_arrow_dtypes_available and isinstance(tpe, ArrowDtype): | ||
| pyarrow_type = tpe.pyarrow_dtype | ||
| if pa.types.is_string(pyarrow_type) or pa.types.is_large_string(pyarrow_type): |
There was a problem hiding this comment.
Can we just reuse existing functionality?
spark/python/pyspark/sql/pandas/types.py
Lines 347 to 350 in 1360e5c
| storage = getattr(tpe, "storage", None) | ||
| return isinstance(storage, str) and storage.startswith("pyarrow") and tpe.na_value is pd.NA |
There was a problem hiding this comment.
StringDtype.storage is part of the API. This feels overly defensive.
| use_extension_dtypes = handle_dtype_as_extension_dtype(left.dtype) or ( | ||
| isinstance(right, IndexOpsMixin) and handle_dtype_as_extension_dtype(right.dtype) | ||
| ) | ||
| use_arrow_dtypes = is_pyarrow_backed_dtype(left.dtype) or ( | ||
| isinstance(right, IndexOpsMixin) and is_pyarrow_backed_dtype(right.dtype) | ||
| ) | ||
| field = left._internal.data_fields[0].copy( | ||
| dtype=spark_type_to_pandas_dtype( | ||
| BooleanType(), | ||
| use_extension_dtypes=use_extension_dtypes, | ||
| use_arrow_dtypes=use_arrow_dtypes, | ||
| ), | ||
| spark_type=BooleanType(), | ||
| nullable=False, | ||
| ) | ||
| left_scol = left._with_new_scol(F.lit(True), field=field) |
There was a problem hiding this comment.
This looks identical to the code block above. Should be a helper if it's this many lines IMO
|
Honestly, I'm a bit conservative when a new contributor making complex changes with LLM to support pandas 3. Our CI does not check pandas 3 (the nightly CI simply fails and can not be used against a certain PR). I've had a few cases where people just feed my description to LLM and generate some patch that of course passes the CI because we do not test pandas 3, while the code may not even work properly. At this point, it's very difficult and time consuming to validate the patch and the approach. We have very limited resources on many different matters. I would say that we deprioritize supporting pandas 3, especially from new contributors with LLM, so that we can keep everything else going. In the future, when we have enough resources to deal with it, we can put more effort into it. |
What changes were proposed in this pull request?
This PR adds support for PyArrow-backed dtypes (e.g., bool[pyarrow]) in PySpark's pandas API. The changes include:
Why are the changes needed?
Starting with pandas 3.0, when PyArrow is installed, pandas automatically uses PyArrow-backed dtypes for string columns. Without this PR, PySpark's pandas API loses dtype information when working with PyArrow-backed dtypes
Does this PR introduce any user-facing change?
Yes. Users working with PyArrow-backed dtypes will now see their dtypes preserved through PySpark operations
Before the PR:
After the PR:
How was this patch tested?
New tests in python/pyspark/pandas/tests/test_typedef.py:
Integration test in python/pyspark/pandas/tests/data_type_ops/test_string_ops.py:
Was this patch authored or co-authored using generative AI tooling?
Yes. Generated-by: GPT 5.5