Skip to content

GH-49255: Fix pandas deprecation warnings in Categorical tests#49271

Open
shashbha14 wants to merge 1 commit intoapache:mainfrom
shashbha14:gh-49255-fix-pandas-deprecation
Open

GH-49255: Fix pandas deprecation warnings in Categorical tests#49271
shashbha14 wants to merge 1 commit intoapache:mainfrom
shashbha14:gh-49255-fix-pandas-deprecation

Conversation

@shashbha14
Copy link
Copy Markdown

@shashbha14 shashbha14 commented Feb 13, 2026

Fixes the pandas deprecation warnings we're seeing in the test suite.

What was happening

Pandas started warning when you create a Categorical with values that aren't in the categories list. We had a few places in the tests doing this:

  • test_category: Creating cat_strings_with_na with categories ['foo', 'bar'] but the data includes 'qux'
  • test_category_implicit_from_pandas: Two places creating Categoricals with ['a', 'b', 'c'] but only allowing ['a', 'b'] in categories

What I changed

Instead of passing categories directly to pd.Categorical(), I:

  1. Create the Categorical first with all the values
  2. Then use .set_categories() to restrict it to what we want

This is the recommended way to do it and avoids the warnings.

Testing

  • Tests still pass (functionality unchanged)
  • No more deprecation warnings
  • No linter errors

Fixes #49255

Comment thread docs/source/python/ipc.rst Outdated
@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Feb 13, 2026
…H-49255)

Replace pd.Categorical() calls that specify categories containing
values not in the categories list with the recommended pattern:
create the Categorical first, then use .set_categories() to restrict.

Fixes deprecation warnings:
- test_category: cat_strings_with_na
- test_category_implicit_from_pandas: two Categorical instances

Fixes apache#49255
@shashbha14 shashbha14 force-pushed the gh-49255-fix-pandas-deprecation branch from 80babf4 to ab9ee88 Compare February 13, 2026 09:51
@github-actions github-actions Bot added awaiting change review Awaiting change review and removed Component: Documentation awaiting changes Awaiting changes labels Feb 13, 2026
Copy link
Copy Markdown
Member

@AlenkaF AlenkaF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I would suggest looking at the use case we are testing in order to avoid using deprecated Pandas behavior to construct Categorical with NaNs. Added two comments related to this.

There is also one more case that needs update, from the CI logs:
https://github.com/apache/arrow/actions/runs/21982380017/job/63507952455?pr=49271#step:6:6410

I need to open a separate issue for the deprecation warning related to the dataframe interchange protocol. But am thinking if you would be willing to also fix the UserWarning in this PR?

v3 = [b'foo', None, b'bar', b'qux', np.nan]

cat_strings = pd.Categorical(v1 * repeats)
cat_strings_with_na = cat_strings.set_categories(['foo', 'bar'])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are probably still getting into missing categories being silently converted to NaN here and Pandas is moving away from that AFAIU.

As the idea in the test is to have NaN in the constructed categorical array, we might simply remove this line as the cat_strings actually already includes them:

In [19]: pd.Categorical(v1 * repeats)
    ...: 
Out[19]: 
['foo', NaN, 'bar', 'qux', NaN, ..., 'foo', NaN, 'bar', 'qux', NaN]
Length: 25
Categories (3, str): ['bar', 'foo', 'qux']

What we can do is to add:

v0 = ['foo', 'bar', 'qux']

and use this for cat_strings? This way we do not have to look for a workaround where we use deprecated behavior to construct NaN values due to missing categories.

Comment on lines -3100 to +3106
pd.Categorical(['a', 'b', 'c'], categories=['a', 'b']),
pd.Categorical(['a', 'b', 'c'], categories=['a', 'b'],
ordered=True)
base.set_categories(['a', 'b']),
base.set_categories(['a', 'b']).as_ordered(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be simpler to do:

pd.Categorical(['a', 'b', None], categories=['a', 'b'])

This should be the same use case as reported here #19704.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Fix DeprecationWarnings in PyArrow tests

3 participants