[codex] Add native pHash deduplication acceleration#464
Conversation
8b46947 to
268e9f6
Compare
268e9f6 to
f054278
Compare
There was a problem hiding this comment.
The naive implementation (that we use for large thresholds) is hopefully straightforward. The index implementation is a little complicated from an algo perspective, so I would try to ignore the algo itself (assume the algorithm works) and focus on making sure this is a correct implementation of the algorithm. At a basic mechanical level we are:
- Storing an index of every chunk of 11 bits of each pHash.
- Storing an index of every chunk of 11 bits of each pHash shifted by 8 bits.
- For each new pHash, collect a list of all pHashes whose bits differ by the new one by less than 2 for ANY chunk
- Among the list we got in the previous step narrow it down to those whose bits differ by the new one by less than 2 for ANY chunk in the rotated forms.
- Doing a full 64 bit check among the remaining candidates.
These steps above aren't that complicated, but because you're not familiar with C there is a lot of boilerplate that may look confusing. I have added inline comments to explain this boiler plate but definitely lmk if something doesn't make sense.
| Py_ssize_t size; | ||
| Py_ssize_t capacity; |
There was a problem hiding this comment.
These are bookkeeping variables needed to create any kind of dynamically sized list in python. Size is the current size of the list, capacity is the amount of memory allocated for this list, and values is actually just a pointer to the first value in the list. Subsequent elements in a list are stored as neighbors in the array so you can get the 5th element of the list by doing *(values + 5). C has a shortcut where you can do values[5] just like other languages, but putting this here to explain why the struct looks like this. Size and capacity need to be tracked in order to add new elements to the end and potentially re-allocate the memory for the array if the array doesn't fit in its memory slot anymore.
In general in this PR there are a few times you'll see something like *x, x_size, x_capacity. These are the boiler plate variables needed to essentially keep a dynamically sized array whose name is x, you should interpret them as "we are storing a list as a variable named x" in cases where we don't know the size of the list ahead of time.
There was a problem hiding this comment.
This makes sense. My old C classes starting to come back to me lol. One small thing - you said "needed to create any kind of dynamically sized list in python" - do you mean "in c"?
| int count = 0; | ||
| while (value != 0) { | ||
| value &= value - 1; | ||
| count++; | ||
| } | ||
| return count; |
There was a problem hiding this comment.
Neat and fast way of counting the number of 1s in a 64 bit number. Basically (number & (number - 1)) will drop the last set bit of that number. So for example:
13: 1101
13 & 12 = 1101 & 1100 = 1100 (12)
12 & 11 = 1100 & 1011 = 1000 (8)
8 & 7 = 1000 & 0111 = 0000 (0)
It's easy to see why the last bit gets dropped when you do (x & (x-1)) for odd numbers, and if you think about it a bit it's not hard to see why it also works for even numbers
It took 3 iterations to get to 0 so we know there's 3 bits that are 1 in 13. This is useful because finding the bitwise difference between 2 pHashes (a and b) is the same as finding the number of 1s in (a ^ b), a^b is very fast, and this technique is also fast.
| } | ||
|
|
||
| static uint64_t rotate_right_64(uint64_t value, int bits) { | ||
| return (value >> bits) | (value << (64 - bits)); |
There was a problem hiding this comment.
value >> bits — low bits fall off; upper bits shift down
value << (64 - bits) — those dropped bits move to the top
| — combine into one rotated value
| typedef struct { | ||
| Bucket *buckets; | ||
| Py_ssize_t bucket_count; | ||
| } ChunkIndex; |
There was a problem hiding this comment.
List of a list of buckets. Bucket_count is either 2^10 or 2^11 depending on whether the chunk corresponds to a chunk size of 10 or 11 (see below comment for why)
| typedef struct { | ||
| int threshold; | ||
| int chunk_radius; | ||
| uint64_t *hashes; |
There was a problem hiding this comment.
Hashes are a list of kept hashes (see top comment to understand how *hashes is essentially List[int] hashes).
|
|
||
| for (bit_index = 0; bit_index < CHUNK_BITS[chunk_index]; | ||
| bit_index++) { | ||
| variant_value = chunk_value ^ (((uint64_t)1) << bit_index); |
There was a problem hiding this comment.
Bitwise operator way of perturbing each bit in case the radius > 0 (radius can only 0 or 1)
| free(kept_indexes); | ||
| return result; | ||
| } | ||
|
|
There was a problem hiding this comment.
All the code above is to support this indexing algo which is fast for low thresholds. Everything below is to handle other cases.
| static PyObject *deduplicate_with_linear_scan( | ||
| const uint64_t *phashes, Py_ssize_t input_count, int threshold | ||
| ) { | ||
| Py_ssize_t *kept_indexes = NULL; | ||
| Py_ssize_t i; |
There was a problem hiding this comment.
This is the "naive" way of doing deduplication where you just compare all new elements to everything kept. Lmk if something doesn't makes sense
| static PyObject *deduplicate_keep_first(Py_ssize_t input_count) { | ||
| PyObject *result; | ||
| PyObject *index_obj; | ||
|
|
There was a problem hiding this comment.
This is a "special case" when the threshold is 64 where we just return the first element.
| return result; | ||
| } | ||
|
|
||
| static PyObject *native_deduplicate_phashes( |
f054278 to
846c5a6
Compare
|
@greptile-apps review this pr |
| int("1" * 64, 2), | ||
| ] | ||
|
|
||
| assert _native_dedup.deduplicate_phashes(phashes, 10) == [0, 2, 3] |
There was a problem hiding this comment.
Explicitly testing the _native_dedup.
edwinpav
left a comment
There was a problem hiding this comment.
Overall looks great, just some curiosity/nit comments. Also the greptile comments in the pr summary seem valid - about adding tests for the new index threshold of 11.
| Py_ssize_t size; | ||
| Py_ssize_t capacity; |
There was a problem hiding this comment.
This makes sense. My old C classes starting to come back to me lol. One small thing - you said "needed to create any kind of dynamically sized list in python" - do you mean "in c"?
| typedef enum { | ||
| CANDIDATE_MARK_UNMARKED = 0, /* default; not involved in this query */ | ||
| CANDIDATE_MARK_CANDIDATE = 1, /* partition 0 hit; check in partition 1 */ | ||
| CANDIDATE_MARK_REJECTED = 2, /* partition 1 checked; not within threshold */ | ||
| } CandidateMark; |
There was a problem hiding this comment.
nit: why aren't the enum values just UNMARKED, CANDIDATE, REJECTED if the name of the enum is already CandidateMark? Is the current form standard C syntax?
There was a problem hiding this comment.
Hm it's sort of just C convention, but these are really just numbers, you can use them directly like CANDIDATE_MARK_UNMARKED, not CandidateMark.CANDIDATE_MARK_UNMARKED.
| int count = 0; | ||
| while (value != 0) { | ||
| value &= value - 1; | ||
| count++; | ||
| } | ||
| return count; |
| Py_ssize_t *touched_indexes; | ||
| Py_ssize_t touched_count; | ||
| Py_ssize_t touched_capacity; | ||
| ChunkIndex indexes[PARTITION_COUNT][CHUNK_COUNT]; |
There was a problem hiding this comment.
So is my understanding correct? "indexes[0][2] slot 41 holding [235, 876] means "kept hashes #235 and #876 both have value 41 in chunk 2 of the un-rotated hash.""
| return (int)__popcnt64(value); | ||
| #elif defined(__GNUC__) || defined(__clang__) | ||
| return __builtin_popcountll(value); | ||
| #else |
There was a problem hiding this comment.
nit: maybe add that this is Kernighan's algorithm as a comment: // Kernighan's algorithm
| pypi_publish: | ||
| build_sdist: | ||
| docker: | ||
| - image: cimg/python:3.10 |
There was a problem hiding this comment.
nit: should this be the same as line 14? - image: python:3.10-bullseye
There was a problem hiding this comment.
nah it doesn't matter, the previous tests used the debian bullseye image so I just stuck to that convention
| native_unique_records = _deduplicate_with_native(records, threshold) | ||
| if native_unique_records is not None: | ||
| unique_records = native_unique_records | ||
| elif threshold == 0: |
There was a problem hiding this comment.
So if the native c dedup throws at runtime, we want it to error out and not fallback to python, right?
There was a problem hiding this comment.
Yeah, c only really throws for OOM which would be an issue in python as well
| from collections.abc import Iterable | ||
|
|
||
| def deduplicate_phashes( | ||
| phashes: Iterable[int], threshold: int |
There was a problem hiding this comment.
seems like the c code uses PySequence_Fast, would Sequence[int] be better for type of phashes?
There was a problem hiding this comment.
PySequence_Fast kind of a misnomer, it has nothing to do with Sequences. Since Sequences are a more constrained type than Iterable, I'm gonna leave it as is. you could pass any kind of iterable here and we'd be able to convert it to a PySequence_Fast object which is more permissive than a Sequence.
| int is_duplicate; | ||
| PyObject *result = NULL; | ||
|
|
||
| memset(&index, 0, sizeof(index)); |
There was a problem hiding this comment.
Doesn't this happen inside of hamming_index_init already? This might be duplicated logic?
| Py_ssize_t i; | ||
| int threshold; | ||
|
|
||
| (void)self; |
There was a problem hiding this comment.
instead of this, can you have the arg be PyObject *Py_UNUSED(self)?
Replace magic 0/1/2 literals with a documented CandidateMark enum so Hamming-index query state is easier to follow. Co-authored-by: Cursor <cursoragent@cursor.com>
253f2a3 to
fc377e0
Compare
Summary
Adds native C acceleration for the local
deduplicate_by_phashutility and releases it as0.18.5.The public SDK API is unchanged. When
nucleus._native_dedupis available,deduplicate_by_phashsends the sorted pHash values to native code and reconstructs the existingLocalDeduplicationResultfrom the returned kept indexes. If the extension is unavailable, the existing pure-Python implementation is still used.Native Threshold Routing
Native code now handles every valid threshold:
0 <= threshold <= 11: chunked Hamming index12 <= threshold <= 63: native linear scan usinguint64_tXOR + popcountthreshold == 64: keep-first fast pathThe threshold-index cap moves from
10to11becausefloor(11 / 6) == 1, so it has the same one-bit chunk-radius behavior as threshold10.Packaging / CI Notes
This adds a Poetry build script backed by
setuptools.Extension. The extension is markedoptional=True, so environments that cannot compile the C module can still build/install the SDK and use the Python fallback.The CircleCI tag-release workflow now builds release artifacts in separate jobs and publishes the combined
dist/directory:x86_64wheels for CPython 3.10 through 3.14universal2wheels for CPython 3.10 through 3.14amd64wheels for CPython 3.10 through 3.14Each wheel job uses
cibuildwheeland runs a smoke test that importsnucleus._native_dedupand verifies an indexed threshold result. The finalpypi_publishjob attaches the wheel/sdist workspace and uses the existing PyPI credentials context before runningpoetry publish.Validation
NUCLEUS_PYTEST_API_KEY=dummy poetry run pytest tests/test_local_deduplication.py -qNUCLEUS_PYTEST_API_KEY=dummy poetry run pre-commit run --all-filespoetry buildpoetry run python -m pip wheel --no-deps --no-cache-dir -w /tmp/nucleus-native-0185-sdist-wheel dist/scale_nucleus-0.18.5.tar.gz0,10,11,12, and64through the public API, plus direct native threshold12output..circleci/config.ymllocally with PyYAML.Greptile Summary
This PR adds a native C extension (
nucleus._native_dedup) to accelerate pHash deduplication and releases it as version0.18.6. The public Python API is unchanged; when the compiled extension is present, all thresholds (0–64) are handled natively, with a pure-Python fallback otherwise. A new multi-job CircleCI pipeline builds platform wheels (Linuxx86_64, macOSuniversal2, Windowsamd64) for Python 3.10–3.14 usingcibuildwheeland publishes them alongside the sdist.nucleus/_native_dedup.c): implements three paths — a two-partition chunked Hamming index for thresholds 0–11, a linear XOR+popcount scan for 12–63, and a keep-first fast path for threshold 64. Memory management and Python C API error handling are correct throughout.nucleus/local_deduplication.py): wraps the import intry/except ImportError, passes sortedphash_valueintegers to the native function, and reconstructs_DeduplicationRecordresults from the returned index list..circleci/config.yml): splits the old singlepypi_publishjob intobuild_sdist, three platform wheel builders, and a finalpypi_publishthat attaches the combined workspace before uploading.Confidence Score: 5/5
Safe to merge. The C extension is mathematically correct, memory management is clean, and the Python fallback is properly guarded.
The two-partition chunked Hamming index holds its pigeonhole invariant for all handled thresholds (0–11), the linear scan and keep-first paths are straightforward, and every allocation/cleanup path in the C code is covered. The Python integration correctly wraps the import, passes sorted integer hashes, and reconstructs results from returned indexes. The CIBW_ environment-variable prefix is used correctly across all three platform wheel jobs.
.circleci/config.yml — the wheel jobs could be made to depend on build_sdist to avoid wasting CI minutes when the PyPI version check fails, but this does not affect correctness.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[deduplicate_by_phash called] --> B[validate threshold] B --> C[normalize & sort records by phash_value, stable_id] C --> D{_NATIVE_DEDUP available?} D -- Yes --> E[_deduplicate_with_native] E --> F{threshold routing in C} F -- threshold == 64 --> G[deduplicate_keep_first] F -- 0 to 11 --> H[deduplicate_with_index two-partition chunked Hamming] F -- 12 to 63 --> I[deduplicate_with_linear_scan XOR + popcount] G & H & I --> J[return kept_indexes list] J --> K[reconstruct records from indexes] D -- No --> L{Python fallback dispatch} L -- threshold == 0 --> M[_deduplicate_exact] L -- threshold == 64 --> N[keep first record] L -- 1 to 11 --> O[_deduplicate_with_index Python HammingIndex] L -- 12 to 63 --> P[_deduplicate_with_linear_scan] K & M & N & O & P --> Q[build LocalDeduplicationResult]Comments Outside Diff (1)
.circleci/config.yml, line 58-63 (link)cibuildwheelenvironment variable prefixAll
cibuildwheelconfiguration variables use theCIBW_prefix (e.g.,CIBW_BUILD,CIBW_SKIP,CIBW_ARCHS_LINUX,CIBW_TEST_COMMAND), notCIB_. With the wrong prefix the variables are silently ignored:cibuildwheelwill build its default set of targets (all detected Python versions, including PyPy, and all default architectures), the musllinux skip won't apply, and the smoke test won't run. This affects all three wheel jobs (build_linux_wheels,build_macos_wheels,build_windows_wheels).Replace every
CIB_prefix withCIBW_across all three wheel jobs.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (4): Last reviewed commit: "Address native dedup review nits" | Re-trigger Greptile