Skip to content

Support sharding through config and raster_write_kwargs#1106

Open
melonora wants to merge 44 commits into
scverse:mainfrom
melonora:support_sharding
Open

Support sharding through config and raster_write_kwargs#1106
melonora wants to merge 44 commits into
scverse:mainfrom
melonora:support_sharding

Conversation

@melonora

@melonora melonora commented Apr 14, 2026

Copy link
Copy Markdown
Collaborator

This PR adds the following:

  • passing kwargs for zarr.create_array directly as raster_write_kwargs for io functions like .write and .write_element. This also adds the ability to write sharded arrays. Support for anndata sharding is to be added in a follow up PR.
  • proper docstrings for the new raster_write_kwargs argument.
  • Extension of the current config to include raster_chunks and raster_shards. The config can now be stored in a default location or a custom location. Additionally, environment variables can be set to temporarily override the values.
  • Adding zarrs as a dependency and enabled the codec by default to allow for faster io when writing shards. This is a discussion point of whether we should do this or provide more of an opt-in for advanced users.

Additional changes

  • Minimal supported version of dask is 2026.3.0. The reason here is that only this provides the api in such a way that you don't risk zarr format 2 being written in a zarr v3 group and vice versa + it includes the setting that prevents collaps of partitions of dask dataframes after reading from parquet.

@LucaMarconato

@melonora

melonora commented Apr 14, 2026

Copy link
Copy Markdown
Collaborator Author

Failing atm due to ome-zarr not yet being released. You can test locally with ome-zarr-py from main.

Also, need to add support for zarrs to improve speed of shard io

@codecov

codecov Bot commented Apr 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.08197% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.50%. Comparing base (68dade6) to head (1f49846).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/spatialdata/_core/_utils.py 90.00% 2 Missing ⚠️
src/spatialdata/_core/spatialdata.py 95.83% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1106      +/-   ##
==========================================
+ Coverage   92.44%   92.50%   +0.05%     
==========================================
  Files          51       51              
  Lines        7811     7857      +46     
==========================================
+ Hits         7221     7268      +47     
+ Misses        590      589       -1     
Files with missing lines Coverage Δ
src/spatialdata/_io/io_raster.py 90.90% <100.00%> (+1.65%) ⬆️
src/spatialdata/_utils.py 86.25% <100.00%> (+0.72%) ⬆️
src/spatialdata/_core/spatialdata.py 93.93% <95.83%> (+0.07%) ⬆️
src/spatialdata/_core/_utils.py 97.05% <90.00%> (-2.95%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The reason for only supporting these versions is that they provide the proper use of the zarr api inside dask and also
the possibility for setting the tune optimization. The latter is required to prevent errors due to collapsing dask partitions
when reading data back in from parquet.
@Mr-Milk

Mr-Milk commented Apr 15, 2026

Copy link
Copy Markdown

Should we also allow the control of sharding for anndata?

@melonora

Copy link
Copy Markdown
Collaborator Author

Yes, but not as part of this PR. I will adjust the config though to accommodate.

Comment thread src/spatialdata/config.py Outdated
Comment thread src/spatialdata/config.py Outdated
Comment thread src/spatialdata/config.py Outdated
Comment thread src/spatialdata/config.py Outdated
Comment thread src/spatialdata/config.py Outdated
Comment thread tests/utils/test_config.py Outdated
Comment thread tests/utils/test_config.py Outdated
@LucaMarconato

LucaMarconato commented May 12, 2026

Copy link
Copy Markdown
Member

I haven't made it a default dependency for anndata but it will be enabled by default if discovered.

  • @melonora I would do the same that @ilan-gold is doing in anndata, and therefore drop the dependency (we can add this as an optional dependency).

assert arr.shards == write_shards

other_arr = zarr.open_group(path / zarr_subpath / other_name, mode="r")["s0"]
assert other_arr.chunks == base_chunks

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could add an explicit check that shards are None (?) here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think it is necessary to add here. The other writing tests basically don't specify it. This is particularly for when it is specified. If it is somehow still specified you would get permission errors straight away as the chunks and shards would not match.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary, but I think it's still interesting to check that it is either None or the same value as for chunks, and that it doesn't change over releases. Could you please add it?

Comment thread tests/io/test_readwrite.py
@LucaMarconato

LucaMarconato commented May 12, 2026

Copy link
Copy Markdown
Member

Looks good in general. There are a few decisions to be made, but once done most of the requested changes can be one-shot by an agent.

Comment thread pyproject.toml Outdated

@LucaMarconato LucaMarconato left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good so far.

Comment thread src/spatialdata/_utils.py
zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}) if find_spec("zarrs") else nullcontext(),
warnings.catch_warnings() if find_spec("zarrs") else nullcontext(),
):
if find_spec("zarrs"):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this comment from @ilan-gold

The warning is there in case zarrs doesn't support the store type you passed in to read_zarr, could use a comment sure

@LucaMarconato LucaMarconato left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@melonora thanks for the changes. I re-reviewed all the conversations (open and closed) and checked the new code changes. The most crucial things (zarrs, settings, dependencies) are resolved. There are still conversation open (I closed a few and reopened one), please can you address the remaining comments?

After this I will do a fast review; we can merge soon.

A list where each dictionary defines the storage options for one scale of a multiscale raster element.

Important Notes
- The available key–value pairs in these dictionaries depend on the Zarr format used for writing.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it rather hard to know what is or isn't allowed here, as the argument type is very loose and correct parameters depend on what zarr will take. So this is hard for users in interactive mode to get right, and also hard for dependands in non-interactive mode to determine if they have well-formed storage options.

If we don't want (or need) to expose all zarr options to the user right now, we could for now just expose like

    def write_element(
        ...
        raster_write_args: Mapping[str, RasterWriteArgs] | Sequence[RasterWriteArgs] | RasterWriteArgs
        ...
    ):
        ...

    class RasterWriteArgs:
        shards: ShardsLike | None = None

    # or maybe this, if we want to allow raw dicts

    class RasterWriteARgs(TypedDict):
        shards: ShardsLike | None = None

Personally, I also don't like being able to pass in dict or list or dict of dicts, and would rather just have a single way to pass in those parameters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants