Added content-defined chunking#209
Conversation
# Conflicts: # primitives/src/lib.rs # provider-node/Cargo.toml # provider-node/src/api.rs # provider-node/tests/s3_integration.rs
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
| expected: query.data_root.clone(), | ||
| actual: "invalid hex".to_string(), | ||
| })?; | ||
| let data_root = H256::from_slice(&root_bytes); |
There was a problem hiding this comment.
H256::from_slice panics if root_bytes.len() != 32. hex_decode above validates hex characters but not length, so GET /content?data_root=abcd (2 bytes) panics → 500. Add a root_bytes.len() == 32 check returning Error::InvalidHash. See a similar pattern at other places too - maybe a shared parse_h256() helper would fix all of them?
| let chunk_size = storage_primitives::DEFAULT_CHUNK_SIZE as u64; | ||
| let start_chunk = query.offset / chunk_size; | ||
| let end_chunk = (query.offset + query.length).div_ceil(chunk_size); |
There was a problem hiding this comment.
This is the most important one. Now that S3/FS uploads are always CDC, /read's offset / DEFAULT_CHUNK_SIZE math is wrong for essentially every new root, and get_chunk_at_index still returns a chunk - so /read silently serves misaligned bytes with a 200 instead of erroring. Please add a guard that makes /read reject non-fixed roots before merge (the variable-size implementation can stay deferred, but silent corruption shouldn't ship).
| let chunks = state.storage.collect_chunks(data_root); | ||
| if chunks.is_empty() && data_root != H256::zero() { |
There was a problem hiding this comment.
get_content reassembles the entire object into one in-memory Vec. Unlike /read (paginated by offset/length), this is whole-file, so memory scales with max object size × concurrent requests. Fine for now, but worth a tracking note to stream the body (axum::body::Body from a stream) before large objects are in play.
There was a problem hiding this comment.
I agree we should stream it. It's fine if we add it here or in a follow-up PR but we need it
|
|
||
| **Unchanged.** The MMR is still leaf-per-chunk and chunk-size-agnostic — each leaf is `blake2_256(chunk_bytes)`. The Merkle proof model is unchanged. The S3/FS GET handlers still reassemble via `collect_chunks(data_root)`, which walks the tree by hash and doesn't assume any chunk size. Content addressing and per-bucket isolation are unchanged. | ||
|
|
||
| The `GET /read?data_root=&offset=&length=` byte-range endpoint **still assumes fixed-size** (its arithmetic computes `chunk_index = offset / DEFAULT_CHUNK_SIZE`). It is therefore unsafe to call on CDC-chunked roots. For whole-file historical fetch, use the new `GET /content?data_root=` endpoint, which walks the manifest. Updating `/read` for variable-size byte-range reads is deferred until a caller actually needs it. |
There was a problem hiding this comment.
Since this PR makes uploads always-CDC, /read is now wrong for the common case, not a rare one - please reframe from "deferred" to "guard now," and add a line that roots don't record their chunking strategy (so all offset math is globally unsafe until they do).
| | Insert/delete dedup | ❌ Cascades downstream | ✅ Only chunks straddling the edit change | | ||
| | In-place overwrite dedup | ✅ Works | ✅ Works | | ||
| | Append dedup | ✅ Works (only trailing chunk changes) | ✅ Works | | ||
| | Best for | Binary blobs with fixed-offset fields; embedded use where every byte counts | Text, structured data, anything with shifting edits | |
There was a problem hiding this comment.
Worth noting files < 64 KiB are always a single chunk and get no edit-dedup - the "best for text/structured data" framing oversells the small-file case.
|
|
||
| The provider stores file bytes as content-addressed chunks: each chunk's key is `blake2_256(bytes)`, so two PUTs that produce a byte-identical chunk get stored once. That dedup is automatic — *if* the chunker actually produces identical chunks across edits. | ||
|
|
||
| With a fixed-size chunker, it doesn't. Inserting a single byte at the start of a file shifts every downstream byte by one, so every chunk has different bytes, so every chunk has a different hash. v2's PUT effectively re-uploads the entire file. |
There was a problem hiding this comment.
I didn't understand what v2 meant at first. After reading I realized it was the new blob. It might be better to say "the new blob"
|
|
||
| **Changed.** Chunks are now variable-size (between 64 KiB and 1 MiB). The provider's S3 (`PUT /s3/.../object`) and FS (`PUT /fs/.../file`) handlers chunk via `chunk_cdc_borrowed`. The Rust client SDK's `ChunkingStrategy::ContentDefined` arm uses the same chunker (previously a TODO that fell back to fixed-256K). | ||
|
|
||
| **Unchanged.** The MMR is still leaf-per-chunk and chunk-size-agnostic — each leaf is `blake2_256(chunk_bytes)`. The Merkle proof model is unchanged. The S3/FS GET handlers still reassemble via `collect_chunks(data_root)`, which walks the tree by hash and doesn't assume any chunk size. Content addressing and per-bucket isolation are unchanged. |
There was a problem hiding this comment.
The MMR is chunk-size-agnostic but you still need to stick to one chunking strategy right? I'd mention that here
| | Append dedup | ✅ Works (only trailing chunk changes) | ✅ Works | | ||
| | Best for | Binary blobs with fixed-offset fields; embedded use where every byte counts | Text, structured data, anything with shifting edits | | ||
|
|
||
| `ChunkingStrategy::Fixed(usize)` remains the SDK default; CDC is opt-in via `ChunkingStrategy::ContentDefined`. The provider's HTTP upload endpoints (S3 / FS) always use CDC since their callers don't pick a strategy. |
There was a problem hiding this comment.
Then shouldn't CDC be the default for the SDK as well?
| } | ||
|
|
||
| #[test] | ||
| fn cdc_size_distribution_within_bounds() { |
There was a problem hiding this comment.
Fixed-size chunking would also pass this test. Can we add another one testing that the sizes of contiguous chunks are different?
| use scale_info::TypeInfo; | ||
| use sp_core::H256; | ||
|
|
||
| #[cfg(feature = "std")] |
There was a problem hiding this comment.
Why do we need std here? alloc is not std
| [dev-dependencies] | ||
| tokio = { workspace = true, features = ["macros", "rt-multi-thread", "test-util", "time"] } | ||
| serde_json = { workspace = true } | ||
| rand = "0.8" |
There was a problem hiding this comment.
Get it from the workspace
| let chunks = state.storage.collect_chunks(data_root); | ||
| if chunks.is_empty() && data_root != H256::zero() { |
There was a problem hiding this comment.
I agree we should stream it. It's fine if we add it here or in a follow-up PR but we need it
Added CDC to allow efficient mutability