[codex] Add native pHash deduplication acceleration by vinay553 · Pull Request #464 · scaleapi/nucleus-python-client

vinay553 · 2026-06-08T22:14:27Z

Summary

Adds native C acceleration for the local deduplicate_by_phash utility and releases it as 0.18.5.

The public SDK API is unchanged. When nucleus._native_dedup is available, deduplicate_by_phash sends the sorted pHash values to native code and reconstructs the existing LocalDeduplicationResult from the returned kept indexes. If the extension is unavailable, the existing pure-Python implementation is still used.

Native Threshold Routing

Native code now handles every valid threshold:

0 <= threshold <= 11: chunked Hamming index
12 <= threshold <= 63: native linear scan using uint64_t XOR + popcount
threshold == 64: keep-first fast path

The threshold-index cap moves from 10 to 11 because floor(11 / 6) == 1, so it has the same one-bit chunk-radius behavior as threshold 10.

Packaging / CI Notes

This adds a Poetry build script backed by setuptools.Extension. The extension is marked optional=True, so environments that cannot compile the C module can still build/install the SDK and use the Python fallback.

The CircleCI tag-release workflow now builds release artifacts in separate jobs and publishes the combined dist/ directory:

source distribution
Linux x86_64 wheels for CPython 3.10 through 3.14
macOS universal2 wheels for CPython 3.10 through 3.14
Windows amd64 wheels for CPython 3.10 through 3.14

Each wheel job uses cibuildwheel and runs a smoke test that imports nucleus._native_dedup and verifies an indexed threshold result. The final pypi_publish job attaches the wheel/sdist workspace and uses the existing PyPI credentials context before running poetry publish.

Validation

NUCLEUS_PYTEST_API_KEY=dummy poetry run pytest tests/test_local_deduplication.py -q
NUCLEUS_PYTEST_API_KEY=dummy poetry run pre-commit run --all-files
poetry build
poetry run python -m pip wheel --no-deps --no-cache-dir -w /tmp/nucleus-native-0185-sdist-wheel dist/scale_nucleus-0.18.5.tar.gz
Installed local wheel and smoke-tested thresholds 0, 10, 11, 12, and 64 through the public API, plus direct native threshold 12 output.
Parsed .circleci/config.yml locally with PyYAML.

Greptile Summary

This PR adds a native C extension (nucleus._native_dedup) to accelerate pHash deduplication and releases it as version 0.18.6. The public Python API is unchanged; when the compiled extension is present, all thresholds (0–64) are handled natively, with a pure-Python fallback otherwise. A new multi-job CircleCI pipeline builds platform wheels (Linux x86_64, macOS universal2, Windows amd64) for Python 3.10–3.14 using cibuildwheel and publishes them alongside the sdist.

C extension (nucleus/_native_dedup.c): implements three paths — a two-partition chunked Hamming index for thresholds 0–11, a linear XOR+popcount scan for 12–63, and a keep-first fast path for threshold 64. Memory management and Python C API error handling are correct throughout.
Python integration (nucleus/local_deduplication.py): wraps the import in try/except ImportError, passes sorted phash_value integers to the native function, and reconstructs _DeduplicationRecord results from the returned index list.
CI (.circleci/config.yml): splits the old single pypi_publish job into build_sdist, three platform wheel builders, and a final pypi_publish that attaches the combined workspace before uploading.

Confidence Score: 5/5

Safe to merge. The C extension is mathematically correct, memory management is clean, and the Python fallback is properly guarded.

The two-partition chunked Hamming index holds its pigeonhole invariant for all handled thresholds (0–11), the linear scan and keep-first paths are straightforward, and every allocation/cleanup path in the C code is covered. The Python integration correctly wraps the import, passes sorted integer hashes, and reconstructs results from returned indexes. The CIBW_ environment-variable prefix is used correctly across all three platform wheel jobs.

.circleci/config.yml — the wheel jobs could be made to depend on build_sdist to avoid wasting CI minutes when the PyPI version check fails, but this does not affect correctness.

Important Files Changed

Filename	Overview
nucleus/_native_dedup.c	New C extension: three-path deduplication (chunked Hamming index, linear scan, keep-first). Memory allocation, error propagation, and cleanup are all correct; pigeonhole proof holds for chunk_radius guarantees.
nucleus/local_deduplication.py	Adds _deduplicate_with_native helper and bumps LOCAL_INDEX_MAX_THRESHOLD from 10 to 11 (consistent with C code). Native path is correctly short-circuited before the Python dispatch chain.
.circleci/config.yml	Adds four new jobs (build_sdist, three platform wheel builders, reorganised pypi_publish). CIBW_ prefix is used correctly in all three wheel jobs. Minor: wheel builds start in parallel with build_sdist, so they will run even if the sdist validation step fails.
build.py	Poetry build hook wires the C extension through setuptools with -O3 / -std=c11 on non-Windows and optional=True for graceful fallback.
tests/test_local_deduplication.py	New parametrised tests cover the native path via monkeypatching and, when the extension is built, directly validate index (10, 11) and linear-scan (12) thresholds.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[deduplicate_by_phash called] --> B[validate threshold]
    B --> C[normalize & sort records by phash_value, stable_id]
    C --> D{_NATIVE_DEDUP available?}
    D -- Yes --> E[_deduplicate_with_native]
    E --> F{threshold routing in C}
    F -- threshold == 64 --> G[deduplicate_keep_first]
    F -- 0 to 11 --> H[deduplicate_with_index two-partition chunked Hamming]
    F -- 12 to 63 --> I[deduplicate_with_linear_scan XOR + popcount]
    G & H & I --> J[return kept_indexes list]
    J --> K[reconstruct records from indexes]
    D -- No --> L{Python fallback dispatch}
    L -- threshold == 0 --> M[_deduplicate_exact]
    L -- threshold == 64 --> N[keep first record]
    L -- 1 to 11 --> O[_deduplicate_with_index Python HammingIndex]
    L -- 12 to 63 --> P[_deduplicate_with_linear_scan]
    K & M & N & O & P --> Q[build LocalDeduplicationResult]

Comments Outside Diff (1)

.circleci/config.yml, line 58-63 (link)

Wrong cibuildwheel environment variable prefix

All cibuildwheel configuration variables use the CIBW_ prefix (e.g., CIBW_BUILD, CIBW_SKIP, CIBW_ARCHS_LINUX, CIBW_TEST_COMMAND), not CIB_. With the wrong prefix the variables are silently ignored: cibuildwheel will build its default set of targets (all detected Python versions, including PyPy, and all default architectures), the musllinux skip won't apply, and the smoke test won't run. This affects all three wheel jobs (build_linux_wheels, build_macos_wheels, build_windows_wheels).

Replace every CIB_ prefix with CIBW_ across all three wheel jobs.

Prompt To Fix With AI

This is a comment left during a code review.
Path: .circleci/config.yml
Line: 58-63

Comment:
**Wrong `cibuildwheel` environment variable prefix**

All `cibuildwheel` configuration variables use the `CIBW_` prefix (e.g., `CIBW_BUILD`, `CIBW_SKIP`, `CIBW_ARCHS_LINUX`, `CIBW_TEST_COMMAND`), not `CIB_`. With the wrong prefix the variables are silently ignored: `cibuildwheel` will build its default set of targets (all detected Python versions, including PyPy, and all default architectures), the musllinux skip won't apply, and the smoke test won't run. This affects all three wheel jobs (`build_linux_wheels`, `build_macos_wheels`, `build_windows_wheels`).

Replace every `CIB_` prefix with `CIBW_` across all three wheel jobs.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
.circleci/config.yml:295-322
**Wheel builds run even when sdist validation fails**

`build_linux_wheels`, `build_macos_wheels`, and `build_windows_wheels` all require only `build_test`, not `build_sdist`. The `build_sdist` job performs the "Validate SDK Version Increment" check that exits early if the version already exists on PyPI — but since the three wheel jobs start in parallel with `build_sdist`, they will spin up and consume CI time even when that validation would have blocked the release. `pypi_publish` requires all four, so publishing is still blocked, but the wheel-build minutes are wasted.

Adding `build_sdist` to each wheel job's `requires` list would short-circuit the pipeline on a validation failure before wheel builds start.

_{Reviews (4): Last reviewed commit: "Address native dedup review nits" | Re-trigger Greptile}

vinay553

The naive implementation (that we use for large thresholds) is hopefully straightforward. The index implementation is a little complicated from an algo perspective, so I would try to ignore the algo itself (assume the algorithm works) and focus on making sure this is a correct implementation of the algorithm. At a basic mechanical level we are:

Storing an index of every chunk of 11 bits of each pHash.
Storing an index of every chunk of 11 bits of each pHash shifted by 8 bits.
For each new pHash, collect a list of all pHashes whose bits differ by the new one by less than 2 for ANY chunk
Among the list we got in the previous step narrow it down to those whose bits differ by the new one by less than 2 for ANY chunk in the rotated forms.
Doing a full 64 bit check among the remaining candidates.

These steps above aren't that complicated, but because you're not familiar with C there is a lot of boilerplate that may look confusing. I have added inline comments to explain this boiler plate but definitely lmk if something doesn't make sense.

vinay553 · 2026-06-09T17:09:54Z

+    Py_ssize_t size;
+    Py_ssize_t capacity;


These are bookkeeping variables needed to create any kind of dynamically sized list in python. Size is the current size of the list, capacity is the amount of memory allocated for this list, and values is actually just a pointer to the first value in the list. Subsequent elements in a list are stored as neighbors in the array so you can get the 5th element of the list by doing *(values + 5). C has a shortcut where you can do values[5] just like other languages, but putting this here to explain why the struct looks like this. Size and capacity need to be tracked in order to add new elements to the end and potentially re-allocate the memory for the array if the array doesn't fit in its memory slot anymore.

In general in this PR there are a few times you'll see something like *x, x_size, x_capacity. These are the boiler plate variables needed to essentially keep a dynamically sized array whose name is x, you should interpret them as "we are storing a list as a variable named x" in cases where we don't know the size of the list ahead of time.

This makes sense. My old C classes starting to come back to me lol. One small thing - you said "needed to create any kind of dynamically sized list in python" - do you mean "in c"?

vinay553 · 2026-06-09T17:25:13Z

+    int count = 0;
+    while (value != 0) {
+        value &= value - 1;
+        count++;
+    }
+    return count;


Neat and fast way of counting the number of 1s in a 64 bit number. Basically (number & (number - 1)) will drop the last set bit of that number. So for example:
13: 1101
13 & 12 = 1101 & 1100 = 1100 (12)
12 & 11 = 1100 & 1011 = 1000 (8)
8 & 7 = 1000 & 0111 = 0000 (0)

It's easy to see why the last bit gets dropped when you do (x & (x-1)) for odd numbers, and if you think about it a bit it's not hard to see why it also works for even numbers

It took 3 iterations to get to 0 so we know there's 3 bits that are 1 in 13. This is useful because finding the bitwise difference between 2 pHashes (a and b) is the same as finding the number of 1s in (a ^ b), a^b is very fast, and this technique is also fast.

vinay553 · 2026-06-09T17:26:23Z

+}
+
+static uint64_t rotate_right_64(uint64_t value, int bits) {
+    return (value >> bits) | (value << (64 - bits));


value >> bits — low bits fall off; upper bits shift down
value << (64 - bits) — those dropped bits move to the top
| — combine into one rotated value

vinay553 · 2026-06-09T17:33:59Z

+typedef struct {
+    Bucket *buckets;
+    Py_ssize_t bucket_count;
+} ChunkIndex;


List of a list of buckets. Bucket_count is either 2^10 or 2^11 depending on whether the chunk corresponds to a chunk size of 10 or 11 (see below comment for why)

vinay553 · 2026-06-09T17:41:55Z

+typedef struct {
+    int threshold;
+    int chunk_radius;
+    uint64_t *hashes;


Hashes are a list of kept hashes (see top comment to understand how *hashes is essentially List[int] hashes).

vinay553 · 2026-06-09T18:20:36Z

+
+        for (bit_index = 0; bit_index < CHUNK_BITS[chunk_index];
+             bit_index++) {
+            variant_value = chunk_value ^ (((uint64_t)1) << bit_index);


Bitwise operator way of perturbing each bit in case the radius > 0 (radius can only 0 or 1)

vinay553 · 2026-06-09T18:22:18Z

+    free(kept_indexes);
+    return result;
+}
+


All the code above is to support this indexing algo which is fast for low thresholds. Everything below is to handle other cases.

vinay553 · 2026-06-09T18:26:30Z

+static PyObject *deduplicate_with_linear_scan(
+    const uint64_t *phashes, Py_ssize_t input_count, int threshold
+) {
+    Py_ssize_t *kept_indexes = NULL;
+    Py_ssize_t i;


This is the "naive" way of doing deduplication where you just compare all new elements to everything kept. Lmk if something doesn't makes sense

vinay553 · 2026-06-09T18:27:01Z

+static PyObject *deduplicate_keep_first(Py_ssize_t input_count) {
+    PyObject *result;
+    PyObject *index_obj;
+


This is a "special case" when the threshold is 64 where we just return the first element.

vinay553 · 2026-06-09T19:53:32Z

+    return result;
+}
+
+static PyObject *native_deduplicate_phashes(


vinay553 · 2026-06-09T20:12:19Z

@greptile-apps review this pr

vinay553 · 2026-06-09T20:12:57Z

+        int("1" * 64, 2),
+    ]
+
+    assert _native_dedup.deduplicate_phashes(phashes, 10) == [0, 2, 3]


Explicitly testing the _native_dedup.

edwinpav

Overall looks great, just some curiosity/nit comments. Also the greptile comments in the pr summary seem valid - about adding tests for the new index threshold of 11.

edwinpav · 2026-06-12T21:42:09Z

+    Py_ssize_t size;
+    Py_ssize_t capacity;


This makes sense. My old C classes starting to come back to me lol. One small thing - you said "needed to create any kind of dynamically sized list in python" - do you mean "in c"?

edwinpav · 2026-06-12T21:44:28Z

+typedef enum {
+    CANDIDATE_MARK_UNMARKED = 0,   /* default; not involved in this query */
+    CANDIDATE_MARK_CANDIDATE = 1,  /* partition 0 hit; check in partition 1 */
+    CANDIDATE_MARK_REJECTED = 2,   /* partition 1 checked; not within threshold */
+} CandidateMark;


nit: why aren't the enum values just UNMARKED, CANDIDATE, REJECTED if the name of the enum is already CandidateMark? Is the current form standard C syntax?

Hm it's sort of just C convention, but these are really just numbers, you can use them directly like CANDIDATE_MARK_UNMARKED, not CandidateMark.CANDIDATE_MARK_UNMARKED.

edwinpav · 2026-06-12T21:48:28Z

+    int count = 0;
+    while (value != 0) {
+        value &= value - 1;
+        count++;
+    }
+    return count;


edwinpav · 2026-06-12T21:54:58Z

+    Py_ssize_t *touched_indexes;
+    Py_ssize_t touched_count;
+    Py_ssize_t touched_capacity;
+    ChunkIndex indexes[PARTITION_COUNT][CHUNK_COUNT];


So is my understanding correct? "indexes[0][2] slot 41 holding [235, 876] means "kept hashes #235 and #876 both have value 41 in chunk 2 of the un-rotated hash.""

edwinpav · 2026-06-12T21:59:28Z

+    return (int)__popcnt64(value);
+#elif defined(__GNUC__) || defined(__clang__)
+    return __builtin_popcountll(value);
+#else


nit: maybe add that this is Kernighan's algorithm as a comment: // Kernighan's algorithm

edwinpav · 2026-06-15T15:08:44Z

-  pypi_publish:
+  build_sdist:
    docker:
      - image: cimg/python:3.10


nit: should this be the same as line 14? - image: python:3.10-bullseye

nah it doesn't matter, the previous tests used the debian bullseye image so I just stuck to that convention

edwinpav · 2026-06-15T15:27:11Z

+    native_unique_records = _deduplicate_with_native(records, threshold)
+    if native_unique_records is not None:
+        unique_records = native_unique_records
+    elif threshold == 0:


So if the native c dedup throws at runtime, we want it to error out and not fallback to python, right?

Yeah, c only really throws for OOM which would be an issue in python as well

edwinpav · 2026-06-15T15:28:30Z

+from collections.abc import Iterable
+
+def deduplicate_phashes(
+    phashes: Iterable[int], threshold: int


seems like the c code uses PySequence_Fast, would Sequence[int] be better for type of phashes?

PySequence_Fast kind of a misnomer, it has nothing to do with Sequences. Since Sequences are a more constrained type than Iterable, I'm gonna leave it as is. you could pass any kind of iterable here and we'd be able to convert it to a PySequence_Fast object which is more permissive than a Sequence.

edwinpav · 2026-06-15T15:30:13Z

+    int is_duplicate;
+    PyObject *result = NULL;
+
+    memset(&index, 0, sizeof(index));


Doesn't this happen inside of hamming_index_init already? This might be duplicated logic?

edwinpav · 2026-06-15T15:38:23Z

+    Py_ssize_t i;
+    int threshold;
+
+    (void)self;


instead of this, can you have the arg be PyObject *Py_UNUSED(self)?

Replace magic 0/1/2 literals with a documented CandidateMark enum so Hamming-index query state is easier to follow. Co-authored-by: Cursor <cursoragent@cursor.com>

vinay553 force-pushed the codex/native-dedup-acceleration branch 2 times, most recently from 8b46947 to 268e9f6 Compare June 8, 2026 22:25

vinay553 marked this pull request as ready for review June 8, 2026 22:38

vinay553 force-pushed the codex/native-dedup-acceleration branch from 268e9f6 to f054278 Compare June 8, 2026 22:40

vinay553 commented Jun 9, 2026

View reviewed changes

vinay553 force-pushed the codex/native-dedup-acceleration branch from f054278 to 846c5a6 Compare June 9, 2026 20:06

vinay553 requested a review from edwinpav June 9, 2026 20:12

vinay553 commented Jun 9, 2026

View reviewed changes

edwinpav approved these changes Jun 15, 2026

View reviewed changes

vinay553 and others added 3 commits June 15, 2026 16:37

Add native pHash deduplication acceleration

e61de6b

Name candidate mark states in native dedup index.

539909f

Replace magic 0/1/2 literals with a documented CandidateMark enum so Hamming-index query state is easier to follow. Co-authored-by: Cursor <cursoragent@cursor.com>

Address native dedup review nits

fc377e0

vinay553 force-pushed the codex/native-dedup-acceleration branch from 253f2a3 to fc377e0 Compare June 15, 2026 20:38

vinay553 merged commit 6d4c2a7 into master Jun 17, 2026
9 checks passed

vinay553 deleted the codex/native-dedup-acceleration branch June 17, 2026 13:46

Conversation

vinay553 commented Jun 8, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Native Threshold Routing

Packaging / CI Notes

Validation

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

vinay553 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinay553 Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinay553 commented Jun 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edwinpav left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinay553 commented Jun 8, 2026 •

edited by greptile-apps Bot

Loading

vinay553 left a comment •

edited

Loading

vinay553 Jun 9, 2026 •

edited

Loading