gh-151289: Add a wide int fast path for add/sub by KRRT7 · Pull Request #151290 · python/cpython

KRRT7 · 2026-06-10T22:31:46Z

gh-151289: Add a wide int fast path for add/sub

This adds a separate fast path for exact PyLong add/sub operands that fit in signed 64-bit integers, while preserving the existing compact-int specialization.

This keeps the compact-int hot path unchanged and avoids broad opcode churn there, while allowing wide exact ints to bypass the slower generic long arithmetic path.

Performance: representative interpreter-only results with JIT disabled:

add_wide:
sub_wide:
add_compact/sub_compact:

Related issue:

Optimize int add/sub for wide exact ints #151289

…declaration Add inline infrastructure to pycore_long.h for the upcoming wide int addition fast path: - _PY_LONG_MAX_DIGITS_FOR_INT64: macro for the maximum digit count that can still fit in int64_t (2 on 30-bit builds, 5 on 15-bit) - _PyLong_FitsInt64(): cheap tag-based check; fast-paths compact and small-digit ints before inspecting the boundary digit - _PyLong_CheckExactAndFitsInt64(): exact-type + fits-int64 guard for use in specialization guards - _PyLong_TryAsInt64Exact(): no-exception int64 extraction; special-cases the ndigits==2/30-bit path for the common case - PyAPI_FUNC declaration for _PyCompactLong_AddWide()

Add three new micro-ops and update the BINARY_OP_ADD_INT macro to use them, replacing the compact-only path: - _GUARD_TOS_INT_WIDE / _GUARD_NOS_INT_WIDE: type guards that accept any exact int fitting in int64_t (via _PyLong_CheckExactAndFitsInt64) - _BINARY_OP_ADD_INT_WIDE: calls _PyCompactLong_AddWide; EXIT_IF on int64 overflow (deopt), ERROR_IF on OOM The existing _GUARD_TOS_INT / _GUARD_NOS_INT compact guards are kept unchanged — they are still used by BINARY_OP_SUBTRACT_INT, BINARY_OP_MULTIPLY_INT, COMPARE_OP_INT, and all subscr ops. Regenerate: generated_cases.c.h, executor_cases.c.h, optimizer_cases.c.h, pycore_opcode_metadata.h, pycore_uop_ids.h, pycore_uop_metadata.h, test_cases.c.h

Change the add specialization condition from _PyLong_CheckExactAndCompact to _PyLong_CheckExactAndFitsInt64 so that exact int operands in the full int64 range (not just compact/single-digit values) are specialized to BINARY_OP_ADD_INT. Subtract and multiply retain their compact-only conditions.

BINARY_OP_ADD_INT now specializes for non-compact int64-range operands (e.g. 10_000_000_000). Update the test accordingly: - Assert BINARY_OP_ADD_INT is used for wide int add - Keep the assertions that BINARY_OP_SUBTRACT_INT and BINARY_OP_MULTIPLY_INT are not used for non-compact ints

…Exact Verify that _PyLong_TryAsInt64Exact correctly handles INT64_MIN (abs_val == INT64_MAX + 1 with negative sign), INT64_MAX, and that values outside the int64 range gracefully fall back to the slow path.

Non-compact (2-digit) int results previously bypassed the freelist and called PyObject_Malloc directly. Add an `ints2` freelist alongside the existing `ints` (1-digit) freelist. - `long_alloc(2)` checks `ints2` before `PyObject_Malloc` - `_PyLong_ExactDealloc` and `long_dealloc` recycle exact 2-digit ints to `ints2` instead of immediately freeing them - `_PyObject_ClearFreeLists` clears `ints2` the same way as `ints`

Extends the ints2 freelist pattern to 3-digit objects, which cover the range [2^60, 2^63-1] (positive) and [-2^63, -2^60] (negative) on 30-bit builds - including INT64_MAX, INT64_MIN, and nanosecond-precision timestamps. Also fuses the two _PyLong_IsCompact + _PyLong_DigitCount checks in long_dealloc under a single PyLong_CheckExact branch. Benchmark (5M ops, 30-bit build): 2-digit+2-digit -> 3-digit result: 19.6 ns -> 17.0 ns (-13%) 3-digit+compact -> 3-digit result: 18.3 ns -> 15.4 ns (-16%) INT64_MAX + 0: 18.2 ns -> 15.9 ns (-13%) INT64_MIN + 0: 18.1 ns -> 16.2 ns (-10%)

…T-free - Remove the dead `_BINARY_OP_ADD_INT` micro-op (no longer referenced by the macro); remove its abstract op from optimizer_bytecodes.c. - Annotate `_GUARD_TOS_INT_WIDE`, `_GUARD_NOS_INT_WIDE`, and `_BINARY_OP_ADD_INT_WIDE` as `tier1`-only so the JIT executor and optimizer generator skip them entirely. The JIT defers to tier 1 for any `BINARY_OP_ADD_INT` trace; no new JIT code paths are introduced. - Add a compact fast-path to `_PyCompactLong_AddWide` so compact-only int addition retains its original `medium_value` cost and avoids the int64-extraction overhead. - Use `__builtin_add_overflow` in `_Py_i64_add_overflow` on GCC/Clang (single instruction on x86-64 / ARM64). - Peel the last loop iteration in `_PyLong_TryAsInt64Exact` to hoist the max-digit overflow-guard out of the inner loop body.

Change the subtract specialization condition to accept exact ints in the full int64 range, matching the widened add path while keeping multiply compact-only.

skirpichev

As I said in the issue thread, I'm not sure if this worth code complications.

Other than this, few remarks:

Probably, you should split this pr into several. For instance, separate freelists addition looks unrelated.
I don't think you should add benchmark script to the sources. Just include this code in pr description, for example.

KRRT7 · 2026-06-11T07:37:44Z

For instance, separate freelists addition looks unrelated

sure, I'll revert and update the numbers, though I think @peendebak mentioned he had a independent freelist PR somewhere so I'll just drop mine entirely.

I don't think you should add benchmark script to the sources. Just include this code in pr description, for example.

sure,though it's standard practices to have a tests/benchmarks/ dir (see pytest-benchmark and projects that adopt it)

skirpichev · 2026-06-11T07:40:06Z

Do not click the "Update branch" button without a good reason because it notifies everyone watching the PR that there are new changes, when there are not, and it uses up limited CI resources.

eendebakpt · 2026-06-11T08:41:17Z

                specialize(instr, BINARY_OP_ADD_INT);
                return;
            }
+            if (_PyLong_CheckExactAndFitsInt64(lhs) && _PyLong_CheckExactAndFitsInt64(rhs)) {


The performance gain in the PR is partly due to having specialized ops, and partly due to the special int64 arithmetic. What is the gain if we only do the int64 arithmetic (with a fast path in long_add)?

KRRT7 added 11 commits June 10, 2026 19:10

test(longobject): add INT64_MIN boundary tests for _PyLong_TryAsInt64…

5b69f64

…Exact Verify that _PyLong_TryAsInt64Exact correctly handles INT64_MIN (abs_val == INT64_MAX + 1 with negative sign), INT64_MAX, and that values outside the int64 range gracefully fall back to the slow path.

perf(specialize): widen BINARY_OP_SUBTRACT_INT to full int64 range

81713ed

Change the subtract specialization condition to accept exact ints in the full int64 range, matching the widened add path while keeping multiply compact-only.

perf(longobject): keep wide int helper local

a4b3e95

perf(longobject): add wide int fast path

8d2d3c9

KRRT7 requested review from Fidget-Spinner, ZeroIntensity, ericsnowcurrently, markshannon, savannahostrowski and tomasr8 as code owners June 10, 2026 22:31

bedevere-app Bot added the awaiting review label Jun 10, 2026

bedevere-app Bot mentioned this pull request Jun 10, 2026

Optimize int add/sub for wide exact ints #151289

Open

This comment was marked as resolved.

Sign in to view

Merge remote-tracking branch 'upstream/main' into wide-int-accel

d8b9f3f

This comment was marked as resolved.

Sign in to view

KRRT7 added 5 commits June 10, 2026 17:41

Misc/NEWS: add blurb for wide int fast path

05023f4

regen opcode cases for wide int fast path

c1a95ef

perf(longobject): restore JIT optimizer cases for wide ints

540d96c

test: make wide int benchmark more stable

0f42443

test: make wide int benchmark import-safe

4864750

skirpichev reviewed Jun 11, 2026

View reviewed changes

Merge branch 'main' into wide-int-accel

24a31a1

cleanup: drop ints2/ints3 freelists and benchmark script

d361dc9

eendebakpt reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-151289: Add a wide int fast path for add/sub#151290

gh-151289: Add a wide int fast path for add/sub#151290
KRRT7 wants to merge 19 commits into
python:mainfrom
KRRT7:wide-int-accel

KRRT7 commented Jun 10, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

skirpichev left a comment

Uh oh!

KRRT7 commented Jun 11, 2026 •

edited

Loading

Uh oh!

skirpichev commented Jun 11, 2026

Uh oh!

eendebakpt Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

KRRT7 commented Jun 10, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

skirpichev left a comment

Choose a reason for hiding this comment

Uh oh!

KRRT7 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skirpichev commented Jun 11, 2026

Uh oh!

eendebakpt Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KRRT7 commented Jun 11, 2026 •

edited

Loading