Skip to content

[codex] Add native pHash deduplication acceleration#464

Merged
vinay553 merged 3 commits into
masterfrom
codex/native-dedup-acceleration
Jun 17, 2026
Merged

[codex] Add native pHash deduplication acceleration#464
vinay553 merged 3 commits into
masterfrom
codex/native-dedup-acceleration

Conversation

@vinay553

@vinay553 vinay553 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds native C acceleration for the local deduplicate_by_phash utility and releases it as 0.18.5.

The public SDK API is unchanged. When nucleus._native_dedup is available, deduplicate_by_phash sends the sorted pHash values to native code and reconstructs the existing LocalDeduplicationResult from the returned kept indexes. If the extension is unavailable, the existing pure-Python implementation is still used.

Native Threshold Routing

Native code now handles every valid threshold:

  • 0 <= threshold <= 11: chunked Hamming index
  • 12 <= threshold <= 63: native linear scan using uint64_t XOR + popcount
  • threshold == 64: keep-first fast path

The threshold-index cap moves from 10 to 11 because floor(11 / 6) == 1, so it has the same one-bit chunk-radius behavior as threshold 10.

Packaging / CI Notes

This adds a Poetry build script backed by setuptools.Extension. The extension is marked optional=True, so environments that cannot compile the C module can still build/install the SDK and use the Python fallback.

The CircleCI tag-release workflow now builds release artifacts in separate jobs and publishes the combined dist/ directory:

  • source distribution
  • Linux x86_64 wheels for CPython 3.10 through 3.14
  • macOS universal2 wheels for CPython 3.10 through 3.14
  • Windows amd64 wheels for CPython 3.10 through 3.14

Each wheel job uses cibuildwheel and runs a smoke test that imports nucleus._native_dedup and verifies an indexed threshold result. The final pypi_publish job attaches the wheel/sdist workspace and uses the existing PyPI credentials context before running poetry publish.

Validation

  • NUCLEUS_PYTEST_API_KEY=dummy poetry run pytest tests/test_local_deduplication.py -q
  • NUCLEUS_PYTEST_API_KEY=dummy poetry run pre-commit run --all-files
  • poetry build
  • poetry run python -m pip wheel --no-deps --no-cache-dir -w /tmp/nucleus-native-0185-sdist-wheel dist/scale_nucleus-0.18.5.tar.gz
  • Installed local wheel and smoke-tested thresholds 0, 10, 11, 12, and 64 through the public API, plus direct native threshold 12 output.
  • Parsed .circleci/config.yml locally with PyYAML.

Greptile Summary

This PR adds a native C extension (nucleus._native_dedup) to accelerate pHash deduplication and releases it as version 0.18.6. The public Python API is unchanged; when the compiled extension is present, all thresholds (0–64) are handled natively, with a pure-Python fallback otherwise. A new multi-job CircleCI pipeline builds platform wheels (Linux x86_64, macOS universal2, Windows amd64) for Python 3.10–3.14 using cibuildwheel and publishes them alongside the sdist.

  • C extension (nucleus/_native_dedup.c): implements three paths — a two-partition chunked Hamming index for thresholds 0–11, a linear XOR+popcount scan for 12–63, and a keep-first fast path for threshold 64. Memory management and Python C API error handling are correct throughout.
  • Python integration (nucleus/local_deduplication.py): wraps the import in try/except ImportError, passes sorted phash_value integers to the native function, and reconstructs _DeduplicationRecord results from the returned index list.
  • CI (.circleci/config.yml): splits the old single pypi_publish job into build_sdist, three platform wheel builders, and a final pypi_publish that attaches the combined workspace before uploading.

Confidence Score: 5/5

Safe to merge. The C extension is mathematically correct, memory management is clean, and the Python fallback is properly guarded.

The two-partition chunked Hamming index holds its pigeonhole invariant for all handled thresholds (0–11), the linear scan and keep-first paths are straightforward, and every allocation/cleanup path in the C code is covered. The Python integration correctly wraps the import, passes sorted integer hashes, and reconstructs results from returned indexes. The CIBW_ environment-variable prefix is used correctly across all three platform wheel jobs.

.circleci/config.yml — the wheel jobs could be made to depend on build_sdist to avoid wasting CI minutes when the PyPI version check fails, but this does not affect correctness.

Important Files Changed

Filename Overview
nucleus/_native_dedup.c New C extension: three-path deduplication (chunked Hamming index, linear scan, keep-first). Memory allocation, error propagation, and cleanup are all correct; pigeonhole proof holds for chunk_radius guarantees.
nucleus/local_deduplication.py Adds _deduplicate_with_native helper and bumps LOCAL_INDEX_MAX_THRESHOLD from 10 to 11 (consistent with C code). Native path is correctly short-circuited before the Python dispatch chain.
.circleci/config.yml Adds four new jobs (build_sdist, three platform wheel builders, reorganised pypi_publish). CIBW_ prefix is used correctly in all three wheel jobs. Minor: wheel builds start in parallel with build_sdist, so they will run even if the sdist validation step fails.
build.py Poetry build hook wires the C extension through setuptools with -O3 / -std=c11 on non-Windows and optional=True for graceful fallback.
tests/test_local_deduplication.py New parametrised tests cover the native path via monkeypatching and, when the extension is built, directly validate index (10, 11) and linear-scan (12) thresholds.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[deduplicate_by_phash called] --> B[validate threshold]
    B --> C[normalize & sort records by phash_value, stable_id]
    C --> D{_NATIVE_DEDUP available?}
    D -- Yes --> E[_deduplicate_with_native]
    E --> F{threshold routing in C}
    F -- threshold == 64 --> G[deduplicate_keep_first]
    F -- 0 to 11 --> H[deduplicate_with_index two-partition chunked Hamming]
    F -- 12 to 63 --> I[deduplicate_with_linear_scan XOR + popcount]
    G & H & I --> J[return kept_indexes list]
    J --> K[reconstruct records from indexes]
    D -- No --> L{Python fallback dispatch}
    L -- threshold == 0 --> M[_deduplicate_exact]
    L -- threshold == 64 --> N[keep first record]
    L -- 1 to 11 --> O[_deduplicate_with_index Python HammingIndex]
    L -- 12 to 63 --> P[_deduplicate_with_linear_scan]
    K & M & N & O & P --> Q[build LocalDeduplicationResult]
Loading

Comments Outside Diff (1)

  1. .circleci/config.yml, line 58-63 (link)

    P1 Wrong cibuildwheel environment variable prefix

    All cibuildwheel configuration variables use the CIBW_ prefix (e.g., CIBW_BUILD, CIBW_SKIP, CIBW_ARCHS_LINUX, CIBW_TEST_COMMAND), not CIB_. With the wrong prefix the variables are silently ignored: cibuildwheel will build its default set of targets (all detected Python versions, including PyPy, and all default architectures), the musllinux skip won't apply, and the smoke test won't run. This affects all three wheel jobs (build_linux_wheels, build_macos_wheels, build_windows_wheels).

    Replace every CIB_ prefix with CIBW_ across all three wheel jobs.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: .circleci/config.yml
    Line: 58-63
    
    Comment:
    **Wrong `cibuildwheel` environment variable prefix**
    
    All `cibuildwheel` configuration variables use the `CIBW_` prefix (e.g., `CIBW_BUILD`, `CIBW_SKIP`, `CIBW_ARCHS_LINUX`, `CIBW_TEST_COMMAND`), not `CIB_`. With the wrong prefix the variables are silently ignored: `cibuildwheel` will build its default set of targets (all detected Python versions, including PyPy, and all default architectures), the musllinux skip won't apply, and the smoke test won't run. This affects all three wheel jobs (`build_linux_wheels`, `build_macos_wheels`, `build_windows_wheels`).
    
    Replace every `CIB_` prefix with `CIBW_` across all three wheel jobs.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
.circleci/config.yml:295-322
**Wheel builds run even when sdist validation fails**

`build_linux_wheels`, `build_macos_wheels`, and `build_windows_wheels` all require only `build_test`, not `build_sdist`. The `build_sdist` job performs the "Validate SDK Version Increment" check that exits early if the version already exists on PyPI — but since the three wheel jobs start in parallel with `build_sdist`, they will spin up and consume CI time even when that validation would have blocked the release. `pypi_publish` requires all four, so publishing is still blocked, but the wheel-build minutes are wasted.

Adding `build_sdist` to each wheel job's `requires` list would short-circuit the pipeline on a validation failure before wheel builds start.

Reviews (4): Last reviewed commit: "Address native dedup review nits" | Re-trigger Greptile

@vinay553 vinay553 force-pushed the codex/native-dedup-acceleration branch 2 times, most recently from 8b46947 to 268e9f6 Compare June 8, 2026 22:25
@vinay553 vinay553 marked this pull request as ready for review June 8, 2026 22:38
@vinay553 vinay553 force-pushed the codex/native-dedup-acceleration branch from 268e9f6 to f054278 Compare June 8, 2026 22:40

@vinay553 vinay553 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naive implementation (that we use for large thresholds) is hopefully straightforward. The index implementation is a little complicated from an algo perspective, so I would try to ignore the algo itself (assume the algorithm works) and focus on making sure this is a correct implementation of the algorithm. At a basic mechanical level we are:

  1. Storing an index of every chunk of 11 bits of each pHash.
  2. Storing an index of every chunk of 11 bits of each pHash shifted by 8 bits.
  3. For each new pHash, collect a list of all pHashes whose bits differ by the new one by less than 2 for ANY chunk
  4. Among the list we got in the previous step narrow it down to those whose bits differ by the new one by less than 2 for ANY chunk in the rotated forms.
  5. Doing a full 64 bit check among the remaining candidates.

These steps above aren't that complicated, but because you're not familiar with C there is a lot of boilerplate that may look confusing. I have added inline comments to explain this boiler plate but definitely lmk if something doesn't make sense.

Comment thread nucleus/_native_dedup.c
Comment on lines +22 to +23
Py_ssize_t size;
Py_ssize_t capacity;

@vinay553 vinay553 Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are bookkeeping variables needed to create any kind of dynamically sized list in python. Size is the current size of the list, capacity is the amount of memory allocated for this list, and values is actually just a pointer to the first value in the list. Subsequent elements in a list are stored as neighbors in the array so you can get the 5th element of the list by doing *(values + 5). C has a shortcut where you can do values[5] just like other languages, but putting this here to explain why the struct looks like this. Size and capacity need to be tracked in order to add new elements to the end and potentially re-allocate the memory for the array if the array doesn't fit in its memory slot anymore.

In general in this PR there are a few times you'll see something like *x, x_size, x_capacity. These are the boiler plate variables needed to essentially keep a dynamically sized array whose name is x, you should interpret them as "we are storing a list as a variable named x" in cases where we don't know the size of the list ahead of time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. My old C classes starting to come back to me lol. One small thing - you said "needed to create any kind of dynamically sized list in python" - do you mean "in c"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup sorry

Comment thread nucleus/_native_dedup.c
Comment on lines +50 to +55
int count = 0;
while (value != 0) {
value &= value - 1;
count++;
}
return count;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat and fast way of counting the number of 1s in a 64 bit number. Basically (number & (number - 1)) will drop the last set bit of that number. So for example:
13: 1101
13 & 12 = 1101 & 1100 = 1100 (12)
12 & 11 = 1100 & 1011 = 1000 (8)
8 & 7 = 1000 & 0111 = 0000 (0)

It's easy to see why the last bit gets dropped when you do (x & (x-1)) for odd numbers, and if you think about it a bit it's not hard to see why it also works for even numbers

It took 3 iterations to get to 0 so we know there's 3 bits that are 1 in 13. This is useful because finding the bitwise difference between 2 pHashes (a and b) is the same as finding the number of 1s in (a ^ b), a^b is very fast, and this technique is also fast.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woah

Comment thread nucleus/_native_dedup.c
}

static uint64_t rotate_right_64(uint64_t value, int bits) {
return (value >> bits) | (value << (64 - bits));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value >> bits — low bits fall off; upper bits shift down
value << (64 - bits) — those dropped bits move to the top
| — combine into one rotated value

Comment thread nucleus/_native_dedup.c
typedef struct {
Bucket *buckets;
Py_ssize_t bucket_count;
} ChunkIndex;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List of a list of buckets. Bucket_count is either 2^10 or 2^11 depending on whether the chunk corresponds to a chunk size of 10 or 11 (see below comment for why)

Comment thread nucleus/_native_dedup.c
typedef struct {
int threshold;
int chunk_radius;
uint64_t *hashes;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashes are a list of kept hashes (see top comment to understand how *hashes is essentially List[int] hashes).

Comment thread nucleus/_native_dedup.c

for (bit_index = 0; bit_index < CHUNK_BITS[chunk_index];
bit_index++) {
variant_value = chunk_value ^ (((uint64_t)1) << bit_index);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bitwise operator way of perturbing each bit in case the radius > 0 (radius can only 0 or 1)

Comment thread nucleus/_native_dedup.c
free(kept_indexes);
return result;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the code above is to support this indexing algo which is fast for low thresholds. Everything below is to handle other cases.

Comment thread nucleus/_native_dedup.c
Comment on lines +442 to +446
static PyObject *deduplicate_with_linear_scan(
const uint64_t *phashes, Py_ssize_t input_count, int threshold
) {
Py_ssize_t *kept_indexes = NULL;
Py_ssize_t i;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the "naive" way of doing deduplication where you just compare all new elements to everything kept. Lmk if something doesn't makes sense

Comment thread nucleus/_native_dedup.c
Comment on lines +488 to +491
static PyObject *deduplicate_keep_first(Py_ssize_t input_count) {
PyObject *result;
PyObject *index_obj;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a "special case" when the threshold is 64 where we just return the first element.

Comment thread nucleus/_native_dedup.c
return result;
}

static PyObject *native_deduplicate_phashes(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entrypoint

@vinay553 vinay553 force-pushed the codex/native-dedup-acceleration branch from f054278 to 846c5a6 Compare June 9, 2026 20:06
@vinay553

vinay553 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@greptile-apps review this pr

@vinay553 vinay553 requested a review from edwinpav June 9, 2026 20:12
Comment thread tests/test_local_deduplication.py Outdated
int("1" * 64, 2),
]

assert _native_dedup.deduplicate_phashes(phashes, 10) == [0, 2, 3]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly testing the _native_dedup.

@edwinpav edwinpav left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, just some curiosity/nit comments. Also the greptile comments in the pr summary seem valid - about adding tests for the new index threshold of 11.

Comment thread nucleus/_native_dedup.c
Comment on lines +22 to +23
Py_ssize_t size;
Py_ssize_t capacity;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. My old C classes starting to come back to me lol. One small thing - you said "needed to create any kind of dynamically sized list in python" - do you mean "in c"?

Comment thread nucleus/_native_dedup.c
Comment on lines +20 to +24
typedef enum {
CANDIDATE_MARK_UNMARKED = 0, /* default; not involved in this query */
CANDIDATE_MARK_CANDIDATE = 1, /* partition 0 hit; check in partition 1 */
CANDIDATE_MARK_REJECTED = 2, /* partition 1 checked; not within threshold */
} CandidateMark;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why aren't the enum values just UNMARKED, CANDIDATE, REJECTED if the name of the enum is already CandidateMark? Is the current form standard C syntax?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm it's sort of just C convention, but these are really just numbers, you can use them directly like CANDIDATE_MARK_UNMARKED, not CandidateMark.CANDIDATE_MARK_UNMARKED.

Comment thread nucleus/_native_dedup.c
Comment on lines +50 to +55
int count = 0;
while (value != 0) {
value &= value - 1;
count++;
}
return count;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woah

Comment thread nucleus/_native_dedup.c
Py_ssize_t *touched_indexes;
Py_ssize_t touched_count;
Py_ssize_t touched_capacity;
ChunkIndex indexes[PARTITION_COUNT][CHUNK_COUNT];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is my understanding correct? "indexes[0][2] slot 41 holding [235, 876] means "kept hashes #235 and #876 both have value 41 in chunk 2 of the un-rotated hash.""

Comment thread nucleus/_native_dedup.c
return (int)__popcnt64(value);
#elif defined(__GNUC__) || defined(__clang__)
return __builtin_popcountll(value);
#else

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add that this is Kernighan's algorithm as a comment: // Kernighan's algorithm

Comment thread .circleci/config.yml
pypi_publish:
build_sdist:
docker:
- image: cimg/python:3.10

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be the same as line 14? - image: python:3.10-bullseye

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah it doesn't matter, the previous tests used the debian bullseye image so I just stuck to that convention

Comment on lines +213 to +216
native_unique_records = _deduplicate_with_native(records, threshold)
if native_unique_records is not None:
unique_records = native_unique_records
elif threshold == 0:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the native c dedup throws at runtime, we want it to error out and not fallback to python, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, c only really throws for OOM which would be an issue in python as well

Comment thread nucleus/_native_dedup.pyi
from collections.abc import Iterable

def deduplicate_phashes(
phashes: Iterable[int], threshold: int

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the c code uses PySequence_Fast, would Sequence[int] be better for type of phashes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PySequence_Fast kind of a misnomer, it has nothing to do with Sequences. Since Sequences are a more constrained type than Iterable, I'm gonna leave it as is. you could pass any kind of iterable here and we'd be able to convert it to a PySequence_Fast object which is more permissive than a Sequence.

Comment thread nucleus/_native_dedup.c Outdated
int is_duplicate;
PyObject *result = NULL;

memset(&index, 0, sizeof(index));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this happen inside of hamming_index_init already? This might be duplicated logic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

Comment thread nucleus/_native_dedup.c Outdated
Py_ssize_t i;
int threshold;

(void)self;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, can you have the arg be PyObject *Py_UNUSED(self)?

vinay553 and others added 3 commits June 15, 2026 16:37
Replace magic 0/1/2 literals with a documented CandidateMark enum so Hamming-index query state is easier to follow.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vinay553 vinay553 force-pushed the codex/native-dedup-acceleration branch from 253f2a3 to fc377e0 Compare June 15, 2026 20:38
@vinay553 vinay553 merged commit 6d4c2a7 into master Jun 17, 2026
9 checks passed
@vinay553 vinay553 deleted the codex/native-dedup-acceleration branch June 17, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants