GH-49255: Fix pandas deprecation warnings in Categorical tests#49271
GH-49255: Fix pandas deprecation warnings in Categorical tests#49271shashbha14 wants to merge 1 commit intoapache:mainfrom
Conversation
…H-49255) Replace pd.Categorical() calls that specify categories containing values not in the categories list with the recommended pattern: create the Categorical first, then use .set_categories() to restrict. Fixes deprecation warnings: - test_category: cat_strings_with_na - test_category_implicit_from_pandas: two Categorical instances Fixes apache#49255
80babf4 to
ab9ee88
Compare
AlenkaF
left a comment
There was a problem hiding this comment.
Thanks for the PR! I would suggest looking at the use case we are testing in order to avoid using deprecated Pandas behavior to construct Categorical with NaNs. Added two comments related to this.
There is also one more case that needs update, from the CI logs:
https://github.com/apache/arrow/actions/runs/21982380017/job/63507952455?pr=49271#step:6:6410
I need to open a separate issue for the deprecation warning related to the dataframe interchange protocol. But am thinking if you would be willing to also fix the UserWarning in this PR?
| v3 = [b'foo', None, b'bar', b'qux', np.nan] | ||
|
|
||
| cat_strings = pd.Categorical(v1 * repeats) | ||
| cat_strings_with_na = cat_strings.set_categories(['foo', 'bar']) |
There was a problem hiding this comment.
We are probably still getting into missing categories being silently converted to NaN here and Pandas is moving away from that AFAIU.
As the idea in the test is to have NaN in the constructed categorical array, we might simply remove this line as the cat_strings actually already includes them:
In [19]: pd.Categorical(v1 * repeats)
...:
Out[19]:
['foo', NaN, 'bar', 'qux', NaN, ..., 'foo', NaN, 'bar', 'qux', NaN]
Length: 25
Categories (3, str): ['bar', 'foo', 'qux']What we can do is to add:
v0 = ['foo', 'bar', 'qux']and use this for cat_strings? This way we do not have to look for a workaround where we use deprecated behavior to construct NaN values due to missing categories.
| pd.Categorical(['a', 'b', 'c'], categories=['a', 'b']), | ||
| pd.Categorical(['a', 'b', 'c'], categories=['a', 'b'], | ||
| ordered=True) | ||
| base.set_categories(['a', 'b']), | ||
| base.set_categories(['a', 'b']).as_ordered(), |
There was a problem hiding this comment.
I think it would be simpler to do:
pd.Categorical(['a', 'b', None], categories=['a', 'b'])This should be the same use case as reported here #19704.
Fixes the pandas deprecation warnings we're seeing in the test suite.
What was happening
Pandas started warning when you create a
Categoricalwith values that aren't in the categories list. We had a few places in the tests doing this:test_category: Creatingcat_strings_with_nawith categories['foo', 'bar']but the data includes'qux'test_category_implicit_from_pandas: Two places creating Categoricals with['a', 'b', 'c']but only allowing['a', 'b']in categoriesWhat I changed
Instead of passing
categoriesdirectly topd.Categorical(), I:.set_categories()to restrict it to what we wantThis is the recommended way to do it and avoids the warnings.
Testing
Fixes #49255