Skip to content

Conversation

@tadeja
Copy link
Contributor

@tadeja tadeja commented Nov 26, 2025

Rationale for this change

To close or discuss #48167
inverse_permutation and scatter functions got implemented via #44393, PR #44394.

What changes are included in this PR?

Python tests for scatter, inverse_permutation kernels and bindings for InversePermutationOptions and ScatterOptions.

Are these changes tested?

Yes, tests added in test_compute.py.

Are there any user-facing changes?

Bindings for InversePermutationOptions and ScatterOptions are added.

This PR includes breaking changes to public APIs.

Options InversePermutationOptions changed from accepting parameter
std::shared_ptr<DataType> output_type = NULLPTR
to
std::optional<std::shared_ptr<DataType>> output_type = std::nullopt

@github-actions
Copy link

⚠️ GitHub issue #48167 has been automatically assigned in GitHub to PR creator.

@tadeja
Copy link
Contributor Author

tadeja commented Nov 26, 2025

The background of inverse_permutation and scatter functions is nicely explained under this umbrella issue

Does it make sense to add Python wrappers/bindings for InversePermutationOptions and ScatterOptions already at this point in time? (Or at all?) What would you say, @felipecrv, @zanmato1984 ?

Currently both InversePermutationOptions and ScatterOptions do not seem to get accepted by inverse_permutation and scatter as demonstrated in added tests of test_compute.py.

@github-actions
Copy link

⚠️ GitHub issue #48167 has been automatically assigned in GitHub to PR creator.

1 similar comment
@github-actions
Copy link

⚠️ GitHub issue #48167 has been automatically assigned in GitHub to PR creator.

@zanmato1984
Copy link
Contributor

Hi @tadeja , thanks for working on this. Appreciate it. I'm glad to see we are adding python bindings for scatter and inverse_permutation. To your questions:

Does it make sense to add Python wrappers/bindings for InversePermutationOptions and ScatterOptions already at this point in time? (Or at all?) What would you say, @felipecrv, @zanmato1984 ?

I think it does make sense to have them.

Currently both InversePermutationOptions and ScatterOptions do not seem to get accepted by inverse_permutation and scatter as demonstrated in added tests of test_compute.py.

I'm not sure what you mean. The underlying C++ APIs will accept those options. So I would expect no problem for their python bindings. Are you saying that they are not working?

@tadeja tadeja force-pushed the bindings_for_inverse_permutation_and_scatter branch from a3b6fb1 to 47e3e8c Compare November 27, 2025 22:05
@tadeja
Copy link
Contributor Author

tadeja commented Nov 28, 2025

Thank you for the input, @zanmato1984, that helped.

I'm not sure what you mean. The underlying C++ APIs will accept those options. So I would expect no problem for their python bindings. Are you saying that they are not working?

It looks like Python bindings got fixed by adding option class names to FunctionDoc in vector_swizzle.cc.

Now test_option_class_equality is failing by default with
pyarrow.lib.ArrowInvalid: Could not serialize field output_type of options type InversePermutationOptions: shared_ptr<DataType> is nullptr
unless I pass output_type there, e.g. pc.InversePermutationOptions(output_type=pa.int32()).

If output_type is NULLPTR in C++ (which happens if output_type is None in Python) then InversePermutationOptions can't be serialized.
Any advice on the approach for fixing this? (Should Python users always specify output_type, or perhaps output_type be optional in C++? Or is it fine to leave it as it is?)

See example below:

import pyarrow.compute as pc
pc.InversePermutationOptions()

>>> InversePermutationOptions(max_index=-1, output_type=<NULLPTR>)
import pyarrow.compute as pc
pc.InversePermutationOptions().serialize()

>>> Traceback (most recent call last):
  File "<python-input-3>", line 1, in <module>
    pc.InversePermutationOptions().serialize()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "pyarrow/_compute.pyx", line 623, in pyarrow._compute.FunctionOptions.serialize
    shared_ptr[CBuffer] c_buf = GetResultValue(res)
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowInvalid: Could not serialize field output_type of options type InversePermutationOptions: shared_ptr<DataType> is nullptr

cc @pitrou

@tadeja tadeja marked this pull request as ready for review November 28, 2025 17:31
@zanmato1984
Copy link
Contributor

Thanks @tadeja for the further explanation! I now see the problem. Let me check the code and get back to you soon.

@zanmato1984
Copy link
Contributor

zanmato1984 commented Dec 4, 2025

It looks like Python bindings got fixed by adding option class names to FunctionDoc in vector_swizzle.cc.

Nice catch! I forgot to declare the option type in function doc when I was adding these functions, and didn't realize it affects python binding until you fixed it. Really neat finding, thank you!

If output_type is NULLPTR in C++ (which happens if output_type is None in Python) then InversePermutationOptions can't be serialized. Any advice on the approach for fixing this? (Should Python users always specify output_type, or perhaps output_type be optional in C++? Or is it fine to leave it as it is?)

I want to suggest the following changes to allow std::shared_ptr<DataType> being null, when doing function options serde:

static inline Result<std::shared_ptr<Scalar>> GenericToScalar(
const std::shared_ptr<DataType>& value) {
if (!value) {
return Status::Invalid("shared_ptr<DataType> is nullptr");
}
return MakeNullScalar(value);
}

static inline Result<std::shared_ptr<Scalar>> GenericToScalar(
    const std::shared_ptr<DataType>& value) {
  if (!value) {
-    return Status::Invalid("shared_ptr<DataType> is nullptr");
+    return std::make_shared<NullScalar>();
  }
  return MakeNullScalar(value);
}

and

template <typename T>
static inline enable_if_same_result<T, std::shared_ptr<DataType>> GenericFromScalar(
const std::shared_ptr<Scalar>& value) {
return value->type;
}

template <typename T>
static inline enable_if_same_result<T, std::shared_ptr<DataType>> GenericFromScalar(
    const std::shared_ptr<Scalar>& value) {
+  if (value->type->id() == Type::NA) {
+    return std::shared_ptr<NullType>();
+  }
  return value->type;
}

But I'll need to consult @pitrou if this is appropriate.

@tadeja
Copy link
Contributor Author

tadeja commented Dec 4, 2025

@zanmato1984 I've included your suggested changes - 208fd24
Checks are passing so far. Now waiting for feedback from @pitrou

@zanmato1984
Copy link
Contributor

Hi @pitrou , do you think the changes to data type pointer serde make sense? Thanks.

@pitrou
Copy link
Member

pitrou commented Dec 8, 2025

Hi @pitrou , do you think the changes to data type pointer serde make sense? Thanks.

That will have annoying side-effects, for example an explicit null DataType will be deserialized as a null pointer.

We should try to find another way of serializing a null pointer as a scalar, though I'm not sure how.

@pitrou
Copy link
Member

pitrou commented Dec 8, 2025

How about this instead:

diff --git a/cpp/src/arrow/compute/function_internal.h b/cpp/src/arrow/compute/function_internal.h
index 9d8928466b..96bdbacf04 100644
--- a/cpp/src/arrow/compute/function_internal.h
+++ b/cpp/src/arrow/compute/function_internal.h
@@ -347,7 +347,8 @@ static inline Result<std::shared_ptr<Scalar>> GenericToScalar(
 static inline Result<std::shared_ptr<Scalar>> GenericToScalar(
     const std::shared_ptr<DataType>& value) {
   if (!value) {
-    return Status::Invalid("shared_ptr<DataType> is nullptr");
+    // An omitted DataType is serialized as boolean true
+    return MakeScalar(boolean(), true);
   }
   return MakeNullScalar(value);
 }
@@ -448,6 +449,10 @@ static inline enable_if_same_result<T, SortKey> GenericFromScalar(
 template <typename T>
 static inline enable_if_same_result<T, std::shared_ptr<DataType>> GenericFromScalar(
     const std::shared_ptr<Scalar>& value) {
+  if (value->type->id() == Type::BOOL && value->is_valid) {
+    // Boolean true represents an omitted DataType (nullptr)
+    return nullptr;
+  }
   return value->type;
 }

Also, should add a test in expression_test.cc (see SerializationRoundTrips)

@zanmato1984
Copy link
Contributor

Hi @pitrou , do you think the changes to data type pointer serde make sense? Thanks.

That will have annoying side-effects, for example an explicit null DataType will be deserialized as a null pointer.

We should try to find another way of serializing a null pointer as a scalar, though I'm not sure how.

Some alternative approaches in mind, both changing the C++ code:

  • Changing the type of output_type from std::shared_ptr<DataType> to std::optional<std::shared_ptr<DataType>>;
  • Or if we think std::optional<std::shared_ptr<DataType>> is problematic, we can in turn change output_type to std::optional<Type::type>.

@pitrou what do you think?

@zanmato1984
Copy link
Contributor

How about this instead:

diff --git a/cpp/src/arrow/compute/function_internal.h b/cpp/src/arrow/compute/function_internal.h
index 9d8928466b..96bdbacf04 100644
--- a/cpp/src/arrow/compute/function_internal.h
+++ b/cpp/src/arrow/compute/function_internal.h
@@ -347,7 +347,8 @@ static inline Result<std::shared_ptr<Scalar>> GenericToScalar(
 static inline Result<std::shared_ptr<Scalar>> GenericToScalar(
     const std::shared_ptr<DataType>& value) {
   if (!value) {
-    return Status::Invalid("shared_ptr<DataType> is nullptr");
+    // An omitted DataType is serialized as boolean true
+    return MakeScalar(boolean(), true);
   }
   return MakeNullScalar(value);
 }
@@ -448,6 +449,10 @@ static inline enable_if_same_result<T, SortKey> GenericFromScalar(
 template <typename T>
 static inline enable_if_same_result<T, std::shared_ptr<DataType>> GenericFromScalar(
     const std::shared_ptr<Scalar>& value) {
+  if (value->type->id() == Type::BOOL && value->is_valid) {
+    // Boolean true represents an omitted DataType (nullptr)
+    return nullptr;
+  }
   return value->type;
 }

Also, should add a test in expression_test.cc (see SerializationRoundTrips)

Feels a bit hacky imho.

@pitrou
Copy link
Member

pitrou commented Dec 8, 2025

Feels a bit hacky imho.

Definitely (though less fragile than the solution in the current PR :-)).

@pitrou
Copy link
Member

pitrou commented Dec 8, 2025

Some alternative approaches in mind, both changing the C++ code:

  • Changing the type of output_type from std::shared_ptr<DataType> to std::optional<std::shared_ptr<DataType>>;
  • Or if we think std::optional<std::shared_ptr<DataType>> is problematic, we can in turn change output_type to std::optional<Type::type>.

I would be ok with either, but that's an API breakage in both cases.

I think @bkietz contributed the original serialization code, perhaps we would like to chime in?

@tadeja
Copy link
Contributor Author

tadeja commented Dec 18, 2025

See my attempt here as per one of the mentioned alternative approaches:

  • Changing the type of output_type from std::shared_ptr<DataType> to std::optional<std::shared_ptr<DataType>>

Please check, @zanmato1984, @pitrou, what do you say?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 19, 2025
@zanmato1984
Copy link
Contributor

See my attempt here as per one of the mentioned alternative approaches:

  • Changing the type of output_type from std::shared_ptr<DataType> to std::optional<std::shared_ptr<DataType>>

Please check, @zanmato1984, @pitrou, what do you say?

I'm OK with this approach. Since @bkietz is not joining us, shall we move on with this approach, @pitrou ?

@pitrou
Copy link
Member

pitrou commented Dec 19, 2025

I'm OK with this approach. Since @bkietz is not joining us, shall we move on with this approach, @pitrou ?

Let's do it!

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@zanmato1984
Copy link
Contributor

I've approved this PR. Let's wait for some while to hear from other reviewers. Then I'll merge it.

Thanks a lot @tadeja for working on this!

@tadeja tadeja changed the title GH-48167: [Python][Compute] Add python bindings for scatter, inverse_permutation GH-48167: [Python][C++][Compute] Add python bindings for scatter, inverse_permutation Dec 23, 2025
@zanmato1984
Copy link
Contributor

@github-actions crossbow submit -g cpp -g python

@github-actions
Copy link

Revision: 45ba764

Submitted crossbow builds: ursacomputing/crossbow @ actions-fa19036110

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-1.3.4-numpy-1.21.2 GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.12-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.13-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.13-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.14 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-fedora-42-python-3 GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

@zanmato1984
Copy link
Contributor

I've made a trivial commit to merely adjust the code order.

Now CI is good. The failures are unrelated. I'm merging now.

@zanmato1984 zanmato1984 merged commit 7de2f61 into apache:main Dec 29, 2025
46 checks passed
@zanmato1984 zanmato1984 removed the awaiting committer review Awaiting committer review label Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants