Added content-defined chunking by antkve · Pull Request #209 · paritytech/web3-storage

antkve · 2026-06-17T08:11:42Z

Added CDC to allow efficient mutability

Replace fixed-size chunking with FastCDC in the provider's S3 and FS upload paths so mid-file edits reuse most chunks via content-addressed dedup.
Add GET /content?data_root= for CID-keyed whole-file reassembly (existing /read assumes fixed-size chunks and can't serve CDC roots).

# Conflicts: # primitives/src/lib.rs # provider-node/Cargo.toml # provider-node/src/api.rs # provider-node/tests/s3_integration.rs

socket-security · 2026-06-17T10:23:49Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	cargo/fastcdc@3.2.1

View full report

mudigal · 2026-06-23T07:06:21Z

+        expected: query.data_root.clone(),
+        actual: "invalid hex".to_string(),
+    })?;
+    let data_root = H256::from_slice(&root_bytes);


H256::from_slice panics if root_bytes.len() != 32. hex_decode above validates hex characters but not length, so GET /content?data_root=abcd (2 bytes) panics → 500. Add a root_bytes.len() == 32 check returning Error::InvalidHash. See a similar pattern at other places too - maybe a shared parse_h256() helper would fix all of them?

mudigal · 2026-06-23T07:10:37Z

    let chunk_size = storage_primitives::DEFAULT_CHUNK_SIZE as u64;
    let start_chunk = query.offset / chunk_size;
    let end_chunk = (query.offset + query.length).div_ceil(chunk_size);


This is the most important one. Now that S3/FS uploads are always CDC, /read's offset / DEFAULT_CHUNK_SIZE math is wrong for essentially every new root, and get_chunk_at_index still returns a chunk - so /read silently serves misaligned bytes with a 200 instead of erroring. Please add a guard that makes /read reject non-fixed roots before merge (the variable-size implementation can stay deferred, but silent corruption shouldn't ship).

mudigal · 2026-06-23T07:11:56Z

+    let chunks = state.storage.collect_chunks(data_root);
+    if chunks.is_empty() && data_root != H256::zero() {


get_content reassembles the entire object into one in-memory Vec. Unlike /read (paginated by offset/length), this is whole-file, so memory scales with max object size × concurrent requests. Fine for now, but worth a tracking note to stream the body (axum::body::Body from a stream) before large objects are in play.

I agree we should stream it. It's fine if we add it here or in a follow-up PR but we need it

mudigal · 2026-06-23T07:13:48Z

+
+**Unchanged.** The MMR is still leaf-per-chunk and chunk-size-agnostic — each leaf is `blake2_256(chunk_bytes)`. The Merkle proof model is unchanged. The S3/FS GET handlers still reassemble via `collect_chunks(data_root)`, which walks the tree by hash and doesn't assume any chunk size. Content addressing and per-bucket isolation are unchanged.
+
+The `GET /read?data_root=&offset=&length=` byte-range endpoint **still assumes fixed-size** (its arithmetic computes `chunk_index = offset / DEFAULT_CHUNK_SIZE`). It is therefore unsafe to call on CDC-chunked roots. For whole-file historical fetch, use the new `GET /content?data_root=` endpoint, which walks the manifest. Updating `/read` for variable-size byte-range reads is deferred until a caller actually needs it.


Since this PR makes uploads always-CDC, /read is now wrong for the common case, not a rare one - please reframe from "deferred" to "guard now," and add a line that roots don't record their chunking strategy (so all offset math is globally unsafe until they do).

mudigal · 2026-06-23T07:15:31Z

+| Insert/delete dedup | ❌ Cascades downstream | ✅ Only chunks straddling the edit change |
+| In-place overwrite dedup | ✅ Works | ✅ Works |
+| Append dedup | ✅ Works (only trailing chunk changes) | ✅ Works |
+| Best for | Binary blobs with fixed-offset fields; embedded use where every byte counts | Text, structured data, anything with shifting edits |


Worth noting files < 64 KiB are always a single chunk and get no edit-dedup - the "best for text/structured data" framing oversells the small-file case.

franciscoaguirre · 2026-06-23T01:07:58Z

+
+The provider stores file bytes as content-addressed chunks: each chunk's key is `blake2_256(bytes)`, so two PUTs that produce a byte-identical chunk get stored once. That dedup is automatic — *if* the chunker actually produces identical chunks across edits.
+
+With a fixed-size chunker, it doesn't. Inserting a single byte at the start of a file shifts every downstream byte by one, so every chunk has different bytes, so every chunk has a different hash. v2's PUT effectively re-uploads the entire file.


I didn't understand what v2 meant at first. After reading I realized it was the new blob. It might be better to say "the new blob"

franciscoaguirre · 2026-06-23T15:01:37Z

+
+**Changed.** Chunks are now variable-size (between 64 KiB and 1 MiB). The provider's S3 (`PUT /s3/.../object`) and FS (`PUT /fs/.../file`) handlers chunk via `chunk_cdc_borrowed`. The Rust client SDK's `ChunkingStrategy::ContentDefined` arm uses the same chunker (previously a TODO that fell back to fixed-256K).
+
+**Unchanged.** The MMR is still leaf-per-chunk and chunk-size-agnostic — each leaf is `blake2_256(chunk_bytes)`. The Merkle proof model is unchanged. The S3/FS GET handlers still reassemble via `collect_chunks(data_root)`, which walks the tree by hash and doesn't assume any chunk size. Content addressing and per-bucket isolation are unchanged.


The MMR is chunk-size-agnostic but you still need to stick to one chunking strategy right? I'd mention that here

franciscoaguirre · 2026-06-23T15:04:35Z

+| Append dedup | ✅ Works (only trailing chunk changes) | ✅ Works |
+| Best for | Binary blobs with fixed-offset fields; embedded use where every byte counts | Text, structured data, anything with shifting edits |
+
+`ChunkingStrategy::Fixed(usize)` remains the SDK default; CDC is opt-in via `ChunkingStrategy::ContentDefined`. The provider's HTTP upload endpoints (S3 / FS) always use CDC since their callers don't pick a strategy.


Then shouldn't CDC be the default for the SDK as well?

franciscoaguirre · 2026-06-23T15:18:03Z

+    }
+
+    #[test]
+    fn cdc_size_distribution_within_bounds() {


Fixed-size chunking would also pass this test. Can we add another one testing that the sizes of contiguous chunks are different?

franciscoaguirre · 2026-06-23T15:18:36Z

 use scale_info::TypeInfo;
 use sp_core::H256;

+#[cfg(feature = "std")]


Why do we need std here? alloc is not std

franciscoaguirre · 2026-06-23T15:23:41Z

 [dev-dependencies]
 tokio = { workspace = true, features = ["macros", "rt-multi-thread", "test-util", "time"] }
 serde_json = { workspace = true }
+rand = "0.8"


Get it from the workspace

franciscoaguirre · 2026-06-23T15:24:28Z

+    let chunks = state.storage.collect_chunks(data_root);
+    if chunks.is_empty() && data_root != H256::zero() {


I agree we should stream it. It's fine if we add it here or in a follow-up PR but we need it

…ntent

Added content-defined chunking

7b306e8

antkve linked an issue Jun 17, 2026 that may be closed by this pull request

Content-defined Chunking #156

Open

Merge remote-tracking branch 'origin/dev' into ak-add-cdc

e63c424

# Conflicts: # primitives/src/lib.rs # provider-node/Cargo.toml # provider-node/src/api.rs # provider-node/tests/s3_integration.rs

Merge branch 'dev' into ak-add-cdc

641a1bc

bkontur reviewed Jun 19, 2026

View reviewed changes

Comment thread primitives/Cargo.toml Outdated

bkontur reviewed Jun 19, 2026

View reviewed changes

Comment thread primitives/src/chunking.rs

bkontur reviewed Jun 19, 2026

View reviewed changes

Comment thread provider-node/src/api.rs Outdated

mudigal requested changes Jun 23, 2026

View reviewed changes

franciscoaguirre reviewed Jun 23, 2026

View reviewed changes

antkve added 2 commits June 24, 2026 11:06

Address CDC review: workspace rand dep, SPDX header, no unwrap in /co…

9e985b5

…ntent

Added guard for rejecting non-fixed roots in /read, updated CDC .md

ce2e6c6

		let chunks = state.storage.collect_chunks(data_root);
		if chunks.is_empty() && data_root != H256::zero() {


		Unchanged. The MMR is still leaf-per-chunk and chunk-size-agnostic — each leaf is `blake2_256(chunk_bytes)`. The Merkle proof model is unchanged. The S3/FS GET handlers still reassemble via `collect_chunks(data_root)`, which walks the tree by hash and doesn't assume any chunk size. Content addressing and per-bucket isolation are unchanged.

		The `GET /read?data_root=&offset=&length=` byte-range endpoint still assumes fixed-size (its arithmetic computes `chunk_index = offset / DEFAULT_CHUNK_SIZE`). It is therefore unsafe to call on CDC-chunked roots. For whole-file historical fetch, use the new `GET /content?data_root=` endpoint, which walks the manifest. Updating `/read` for variable-size byte-range reads is deferred until a caller actually needs it.


		The provider stores file bytes as content-addressed chunks: each chunk's key is `blake2_256(bytes)`, so two PUTs that produce a byte-identical chunk get stored once. That dedup is automatic — if the chunker actually produces identical chunks across edits.

		With a fixed-size chunker, it doesn't. Inserting a single byte at the start of a file shifts every downstream byte by one, so every chunk has different bytes, so every chunk has a different hash. v2's PUT effectively re-uploads the entire file.


		Changed. Chunks are now variable-size (between 64 KiB and 1 MiB). The provider's S3 (`PUT /s3/.../object`) and FS (`PUT /fs/.../file`) handlers chunk via `chunk_cdc_borrowed`. The Rust client SDK's `ChunkingStrategy::ContentDefined` arm uses the same chunker (previously a TODO that fell back to fixed-256K).

		Unchanged. The MMR is still leaf-per-chunk and chunk-size-agnostic — each leaf is `blake2_256(chunk_bytes)`. The Merkle proof model is unchanged. The S3/FS GET handlers still reassemble via `collect_chunks(data_root)`, which walks the tree by hash and doesn't assume any chunk size. Content addressing and per-bucket isolation are unchanged.

Uh oh!

Conversation

antkve commented Jun 17, 2026

Uh oh!

socket-security Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

socket-security Bot commented Jun 17, 2026 •

edited

Loading