Skip to content

GH-49410: [C++] Fix if_else null-scalar fast paths for sliced BaseBinary arrays#49443

Open
Ebraam-Ashraf wants to merge 4 commits intoapache:mainfrom
Ebraam-Ashraf:GH-49410-fix-if-else-sliced-string-chunks
Open

GH-49410: [C++] Fix if_else null-scalar fast paths for sliced BaseBinary arrays#49443
Ebraam-Ashraf wants to merge 4 commits intoapache:mainfrom
Ebraam-Ashraf:GH-49410-fix-if-else-sliced-string-chunks

Conversation

@Ebraam-Ashraf
Copy link

@Ebraam-Ashraf Ebraam-Ashraf commented Mar 3, 2026

Rationale for this change

if_else with a null scalar and a sliced BaseBinary array (offset != 0) produces invalid output. The ASA and AAS shortcut paths in scalar_if_else.cc copy offsets without adjusting for the slice offset, and copy data from byte 0 instead of data + offsets[0].

What changes are included in this PR?

Fix (scalar_if_else.cc): In both the ASA and AAS fast paths:

  • Replace the single memcpy for the offsets buffer with a loop that normalizes each offset by subtracting offsets[0] as a base, so output offsets always start at 0.
  • Change the data copy to start from data + base instead of data, so only the bytes belonging to the slice are copied.

Regression test (scalar_if_else_test.cc): Added TYPED_TEST(TestIfElseBaseBinary, IfElseBaseBinarySliced) covering utf8, binary, large_utf8, and large_binary types, testing both the ASA path and AAS path with a sliced array.

Are these changes tested?

Yes. The new IfElseBaseBinarySliced typed test reproduces the bug and passes after the fix. All 416 existing tests continue to pass.

This PR contains a "Critical Fix". The bug causes incorrect/invalid data to be produced when if_else is called with a null scalar and a sliced BaseBinary array.

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

⚠️ GitHub issue #49410 has been automatically assigned in GitHub to PR creator.

@Ebraam-Ashraf
Copy link
Author

hi @kou
thanks for your guidance
as requested here's the test that reproduces the bug
Looking forward to any feedback also I'll add the fix in a follow up commit once you confirm the test is fine

{ArrayFromJSON(int64(), "[-1]"), ArrayFromJSON(int32(), "[0]")}));
}

TEST_F(TestIfElseKernel, IfElseBaseBinarySlicedChunk) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this TestIfEleseKernel test to the location where other TestIfEleseKernel tests exist instead of adding it after the TestChooseKernel test?

Can we use better name than IfElseBaseBinarySlicedChunk like other existing test names?

}

TEST_F(TestIfElseKernel, IfElseBaseBinarySlicedChunk) {
for (auto type : {utf8(), binary(), large_utf8(), large_binary()}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test all these types?

If so, TYPED_TEST(TestIfElseBaseBinary, ...) may be better than TEST_F(TestifEleseKernel, ...).

Comment on lines +3787 to +3809
auto cond_asa = ArrayFromJSON(boolean(), "[true, false, false]");
ASSERT_OK_AND_ASSIGN(auto result_asa,
CallFunction("if_else", {cond_asa, MakeNullScalar(type), chunk1}));
ASSERT_OK(result_asa.make_array()->ValidateFull());
AssertArraysEqual(*ArrayFromJSON(type, R"([null, "x", "x"])"),
*result_asa.make_array(), true);

auto cond_aas = ArrayFromJSON(boolean(), "[false, true, true]");
ASSERT_OK_AND_ASSIGN(auto result_aas,
CallFunction("if_else", {cond_aas, chunk1, MakeNullScalar(type)}));
ASSERT_OK(result_aas.make_array()->ValidateFull());
AssertArraysEqual(*ArrayFromJSON(type, R"([null, "x", "x"])"),
*result_aas.make_array(), true);

auto arr1 = std::make_shared<ChunkedArray>(ArrayVector{chunk0, chunk1});
auto mask = *CallFunction("is_null", {arr1});
ASSERT_OK_AND_ASSIGN(auto arr2_datum,
CallFunction("if_else", {Datum(true), *Concatenate(arr1->chunks()), arr1}));
ASSERT_OK(arr2_datum.chunked_array()->ValidateFull());
ASSERT_OK_AND_ASSIGN(auto arr3_datum,
CallFunction("if_else", {mask, MakeNullScalar(type), arr2_datum}));
ASSERT_OK(arr3_datum.chunked_array()->ValidateFull());
AssertDatumsEqual(Datum(arr1), arr3_datum);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reproduce this problem without chunked array, right?

If so, we don't need to test chunked array.

@Ebraam-Ashraf
Copy link
Author

thanks for the feedback @kou

I switched from TEST_F(TestIfElseKernel, ...) to TYPED_TEST(TestIfElseBaseBinary, IfElseBaseBinarySliced)
and moved the test to its right place
also removed the chunked array portion

[----------] Global test environment tear-down
[==========] 4 tests from 4 test suites ran. (125 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] TestIfElseBaseBinary/0.IfElseBaseBinarySliced, where TypeParam = arrow::BinaryType
[  FAILED  ] TestIfElseBaseBinary/1.IfElseBaseBinarySliced, where TypeParam = arrow::LargeBinaryType
[  FAILED  ] TestIfElseBaseBinary/2.IfElseBaseBinarySliced, where TypeParam = arrow::StringType
[  FAILED  ] TestIfElseBaseBinary/3.IfElseBaseBinarySliced, where TypeParam = arrow::LargeStringType

Offset invariant failure: offset for slot 2 out of bounds: 3 > 2

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Could you push a fix for this problem to this branch?

Comment on lines +614 to +615
auto full_arr = ArrayFromJSON(type, R"([null, "x", "x", null, "x", "x"])");
auto sliced = full_arr->Slice(3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify test data?

Suggested change
auto full_arr = ArrayFromJSON(type, R"([null, "x", "x", null, "x", "x"])");
auto sliced = full_arr->Slice(3);
auto full_arr = ArrayFromJSON(type, R"(["not used", null, "x", "x"])");
auto sliced = full_arr->Slice(1);

@Ebraam-Ashraf
Copy link
Author

Tests before fix

[   RUN    ] TestIfElseBaseBinary/0.IfElseBaseBinarySliced
[  FAILED  ] TestIfElseBaseBinary/0.IfElseBaseBinarySliced (20 ms)
[  FAILED  ] TestIfElseBaseBinary/1.IfElseBaseBinarySliced (1 ms)
[  FAILED  ] TestIfElseBaseBinary/2.IfElseBaseBinarySliced (1 ms)
[  FAILED  ] TestIfElseBaseBinary/3.IfElseBaseBinarySliced (0 ms)
[  PASSED  ] 0 tests.
[  FAILED  ] 4 tests.

Invalid: Offset invariant failure: offset for slot 1 out of bounds: 8 > 2

What the fix does

Replaced the single memcpy for the offsets buffer with a loop that subtracts offsets[0] as a base from each value
so the output offsets are always normalized to start at 0.

Then changed the data copy to start from data + base instead of data
so only the bytes belonging to the slice are copied.

The same two changes are applied symmetrically in both the ASA and AAS paths.

Tests after fix

[==========] 416 tests from 132 test suites ran. (31991 ms total)
[  PASSED  ] 416 tests.

@Ebraam-Ashraf Ebraam-Ashraf requested a review from kou March 6, 2026 07:13
@Ebraam-Ashraf
Copy link
Author

@kou

@kou
Copy link
Member

kou commented Mar 7, 2026

Could you fix the lint failure and update the PR title and description?

@Ebraam-Ashraf Ebraam-Ashraf changed the title GH-49410: [C++] Add regression test for if_else with sliced BaseBinary chunks GH-49410: [C++] Fix if_else null-scalar fast paths for sliced BaseBinary arrays Mar 7, 2026
@Ebraam-Ashraf
Copy link
Author

thanks @kou
I really appreciate your patience and guidance throughout this PR
looking forward to your review 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants