gh-151289: Add a wide int fast path for add/sub#151290
Conversation
…declaration Add inline infrastructure to pycore_long.h for the upcoming wide int addition fast path: - _PY_LONG_MAX_DIGITS_FOR_INT64: macro for the maximum digit count that can still fit in int64_t (2 on 30-bit builds, 5 on 15-bit) - _PyLong_FitsInt64(): cheap tag-based check; fast-paths compact and small-digit ints before inspecting the boundary digit - _PyLong_CheckExactAndFitsInt64(): exact-type + fits-int64 guard for use in specialization guards - _PyLong_TryAsInt64Exact(): no-exception int64 extraction; special-cases the ndigits==2/30-bit path for the common case - PyAPI_FUNC declaration for _PyCompactLong_AddWide()
Add three new micro-ops and update the BINARY_OP_ADD_INT macro to use them, replacing the compact-only path: - _GUARD_TOS_INT_WIDE / _GUARD_NOS_INT_WIDE: type guards that accept any exact int fitting in int64_t (via _PyLong_CheckExactAndFitsInt64) - _BINARY_OP_ADD_INT_WIDE: calls _PyCompactLong_AddWide; EXIT_IF on int64 overflow (deopt), ERROR_IF on OOM The existing _GUARD_TOS_INT / _GUARD_NOS_INT compact guards are kept unchanged — they are still used by BINARY_OP_SUBTRACT_INT, BINARY_OP_MULTIPLY_INT, COMPARE_OP_INT, and all subscr ops. Regenerate: generated_cases.c.h, executor_cases.c.h, optimizer_cases.c.h, pycore_opcode_metadata.h, pycore_uop_ids.h, pycore_uop_metadata.h, test_cases.c.h
Change the add specialization condition from _PyLong_CheckExactAndCompact to _PyLong_CheckExactAndFitsInt64 so that exact int operands in the full int64 range (not just compact/single-digit values) are specialized to BINARY_OP_ADD_INT. Subtract and multiply retain their compact-only conditions.
BINARY_OP_ADD_INT now specializes for non-compact int64-range operands (e.g. 10_000_000_000). Update the test accordingly: - Assert BINARY_OP_ADD_INT is used for wide int add - Keep the assertions that BINARY_OP_SUBTRACT_INT and BINARY_OP_MULTIPLY_INT are not used for non-compact ints
…Exact Verify that _PyLong_TryAsInt64Exact correctly handles INT64_MIN (abs_val == INT64_MAX + 1 with negative sign), INT64_MAX, and that values outside the int64 range gracefully fall back to the slow path.
Non-compact (2-digit) int results previously bypassed the freelist and called PyObject_Malloc directly. Add an `ints2` freelist alongside the existing `ints` (1-digit) freelist. - `long_alloc(2)` checks `ints2` before `PyObject_Malloc` - `_PyLong_ExactDealloc` and `long_dealloc` recycle exact 2-digit ints to `ints2` instead of immediately freeing them - `_PyObject_ClearFreeLists` clears `ints2` the same way as `ints`
Extends the ints2 freelist pattern to 3-digit objects, which cover the range [2^60, 2^63-1] (positive) and [-2^63, -2^60] (negative) on 30-bit builds - including INT64_MAX, INT64_MIN, and nanosecond-precision timestamps. Also fuses the two _PyLong_IsCompact + _PyLong_DigitCount checks in long_dealloc under a single PyLong_CheckExact branch. Benchmark (5M ops, 30-bit build): 2-digit+2-digit -> 3-digit result: 19.6 ns -> 17.0 ns (-13%) 3-digit+compact -> 3-digit result: 18.3 ns -> 15.4 ns (-16%) INT64_MAX + 0: 18.2 ns -> 15.9 ns (-13%) INT64_MIN + 0: 18.1 ns -> 16.2 ns (-10%)
…T-free - Remove the dead `_BINARY_OP_ADD_INT` micro-op (no longer referenced by the macro); remove its abstract op from optimizer_bytecodes.c. - Annotate `_GUARD_TOS_INT_WIDE`, `_GUARD_NOS_INT_WIDE`, and `_BINARY_OP_ADD_INT_WIDE` as `tier1`-only so the JIT executor and optimizer generator skip them entirely. The JIT defers to tier 1 for any `BINARY_OP_ADD_INT` trace; no new JIT code paths are introduced. - Add a compact fast-path to `_PyCompactLong_AddWide` so compact-only int addition retains its original `medium_value` cost and avoids the int64-extraction overhead. - Use `__builtin_add_overflow` in `_Py_i64_add_overflow` on GCC/Clang (single instruction on x86-64 / ARM64). - Peel the last loop iteration in `_PyLong_TryAsInt64Exact` to hoist the max-digit overflow-guard out of the inner loop body.
Change the subtract specialization condition to accept exact ints in the full int64 range, matching the widened add path while keeping multiply compact-only.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
skirpichev
left a comment
There was a problem hiding this comment.
As I said in the issue thread, I'm not sure if this worth code complications.
Other than this, few remarks:
- Probably, you should split this pr into several. For instance, separate freelists addition looks unrelated.
- I don't think you should add benchmark script to the sources. Just include this code in pr description, for example.
sure, I'll revert and update the numbers, though I think @peendebak mentioned he had a independent freelist PR somewhere so I'll just drop mine entirely.
sure,though it's standard practices to have a |
|
Do not click the "Update branch" button without a good reason because it notifies everyone watching the PR that there are new changes, when there are not, and it uses up limited CI resources. |
| specialize(instr, BINARY_OP_ADD_INT); | ||
| return; | ||
| } | ||
| if (_PyLong_CheckExactAndFitsInt64(lhs) && _PyLong_CheckExactAndFitsInt64(rhs)) { |
There was a problem hiding this comment.
The performance gain in the PR is partly due to having specialized ops, and partly due to the special int64 arithmetic. What is the gain if we only do the int64 arithmetic (with a fast path in long_add)?
gh-151289: Add a wide int fast path for add/sub
This adds a separate fast path for exact PyLong add/sub operands that fit in signed 64-bit integers, while preserving the existing compact-int specialization.
This keeps the compact-int hot path unchanged and avoids broad opcode churn there, while allowing wide exact ints to bypass the slower generic long arithmetic path.
Performance: representative interpreter-only results with JIT disabled:
Related issue: