arrow/docs/source/developers/python/development.rst at 1fb6f0aaa5c1d8402976a96819aa0c0b11e07fd4 · apache/arrow

.. currentmodule:: pyarrow

Developing PyArrow

Coding Style

We follow a similar PEP8-like coding style to the pandas project. To fix style issues, use the pre-commit command:

$ pre-commit run --show-diff-on-failure --color=always --all-files python

Unit Testing

We are using pytest to develop our unit test suite. After building the project you can run its unit tests like so:

$ pushd arrow/python
$ python -m pytest pyarrow
$ popd

Package requirements to run the unit tests are found in requirements-test.txt and can be installed if needed with pip install -r requirements-test.txt.

If you get import errors for pyarrow._lib or another PyArrow module when trying to run the tests, run python -m pytest arrow/python/pyarrow and check if the editable version of pyarrow was installed correctly.

The project has a number of custom command line options for its test suite. Some tests are disabled by default, for example. To see all the options, run

$ python -m pytest pyarrow --help

and look for the "custom options" section.

Note

There are a few low-level tests written directly in C++. These tests are implemented in pyarrow/src/arrow/python/python_test.cc, but they are also wrapped in a pytest-based test module run automatically as part of the PyArrow test suite.

Test Groups

We have many tests that are grouped together using pytest marks. Some of these are disabled by default. To enable a test group, pass --$GROUP_NAME, e.g. --parquet. To disable a test group, prepend disable, so --disable-parquet for example. To run only the unit tests for a particular group, prepend only- instead, for example --only-parquet.

The test groups currently include:

dataset: Apache Arrow Dataset tests
flight: Flight RPC tests
gandiva: tests for Gandiva expression compiler (uses LLVM)
hdfs: tests that use libhdfs to access the Hadoop filesystem
hypothesis: tests that use the hypothesis module for generating random test cases. Note that --hypothesis doesn't work due to a quirk with pytest, so you have to pass --enable-hypothesis
large_memory: Test requiring a large amount of system RAM
orc: Apache ORC tests
parquet: Apache Parquet tests
s3: Tests for Amazon S3
tensorflow: Tests that involve TensorFlow

Doctest

We are using doctest to check that docstring examples are up-to-date and correct. You can also do that locally by running:

$ pushd arrow/python
$ python -m pytest --doctest-modules
$ python -m pytest --doctest-modules path/to/module.py # checking single file
$ popd

for .py files or

$ pushd arrow/python
$ python -m pytest --doctest-cython
$ python -m pytest --doctest-cython path/to/module.pyx # checking single file
$ popd

for .pyx and .pxi files. In this case you will also need to install the pytest-cython plugin.

Testing Documentation Examples

Documentation examples in .rst files under docs/source/python/ use doctest syntax and can be tested locally using:

$ pushd arrow/python
$ pytest --doctest-glob="*.rst" docs/source/python/file.rst # checking single file
$ pytest --doctest-glob="*.rst" docs/source/python # checking entire directory
$ popd

The examples use standard doctest syntax with >>> for Python prompts and ... for continuation lines. The conftest.py fixture automatically handles temporary directory setup for examples that create files.

Debugging

Debug build

Since PyArrow depends on the Arrow C++ libraries, debugging can frequently involve crossing between Python and C++ shared libraries. For the best experience, make sure you've built both Arrow C++ (-DCMAKE_BUILD_TYPE=Debug) and PyArrow (export PYARROW_BUILD_TYPE=debug) in debug mode.

Using gdb on Linux

To debug the C++ libraries with gdb while running the Python unit tests, first start pytest with gdb:

$ gdb --args python -m pytest pyarrow/tests/test_to_run.py -k $TEST_TO_MATCH

To set a breakpoint, use the same gdb syntax that you would when debugging a C++ program, for example:

(gdb) b src/arrow/python/arrow_to_pandas.cc:1874
No source file named src/arrow/python/arrow_to_pandas.cc.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (src/arrow/python/arrow_to_pandas.cc:1874) pending.

.. seealso::

   The :ref:`GDB extension for Arrow C++ <cpp_gdb_extension>`.

Similarly, use lldb when debugging on macOS.

Benchmarking

For running the benchmarks, see :ref:`python-benchmarks`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Developing PyArrow

Coding Style

Unit Testing

Test Groups

Doctest

Testing Documentation Examples

Debugging

Debug build

Using gdb on Linux

Benchmarking

FilesExpand file tree

development.rst

Latest commit

History

development.rst

File metadata and controls

Developing PyArrow

Coding Style

Unit Testing

Test Groups

Doctest

Testing Documentation Examples

Debugging

Debug build

Using gdb on Linux

Benchmarking