Support sharding through config and raster_write_kwargs#1106
Conversation
- Added additional settings - Allow environment variables that overwrite config
|
Failing atm due to ome-zarr not yet being released. You can test locally with ome-zarr-py from main. Also, need to add support for zarrs to improve speed of shard io |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1106 +/- ##
==========================================
+ Coverage 92.44% 92.50% +0.05%
==========================================
Files 51 51
Lines 7811 7857 +46
==========================================
+ Hits 7221 7268 +47
+ Misses 590 589 -1
🚀 New features to boost your workflow:
|
The reason for only supporting these versions is that they provide the proper use of the zarr api inside dask and also the possibility for setting the tune optimization. The latter is required to prevent errors due to collapsing dask partitions when reading data back in from parquet.
|
Should we also allow the control of sharding for anndata? |
|
Yes, but not as part of this PR. I will adjust the config though to accommodate. |
|
| assert arr.shards == write_shards | ||
|
|
||
| other_arr = zarr.open_group(path / zarr_subpath / other_name, mode="r")["s0"] | ||
| assert other_arr.chunks == base_chunks |
There was a problem hiding this comment.
we could add an explicit check that shards are None (?) here.
There was a problem hiding this comment.
don't think it is necessary to add here. The other writing tests basically don't specify it. This is particularly for when it is specified. If it is somehow still specified you would get permission errors straight away as the chunks and shards would not match.
There was a problem hiding this comment.
Not necessary, but I think it's still interesting to check that it is either None or the same value as for chunks, and that it doesn't change over releases. Could you please add it?
|
Looks good in general. There are a few decisions to be made, but once done most of the requested changes can be one-shot by an agent. |
chore: fix typo in docstrings
…to support_sharding
bump ome_zarr remove distributed add platformdirs
LucaMarconato
left a comment
There was a problem hiding this comment.
Changes look good so far.
We will opt for scverse-misc settings in a follow up PR
| zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}) if find_spec("zarrs") else nullcontext(), | ||
| warnings.catch_warnings() if find_spec("zarrs") else nullcontext(), | ||
| ): | ||
| if find_spec("zarrs"): |
There was a problem hiding this comment.
Please add this comment from @ilan-gold
The warning is there in case zarrs doesn't support the store type you passed in to read_zarr, could use a comment sure
There was a problem hiding this comment.
@melonora thanks for the changes. I re-reviewed all the conversations (open and closed) and checked the new code changes. The most crucial things (zarrs, settings, dependencies) are resolved. There are still conversation open (I closed a few and reopened one), please can you address the remaining comments?
After this I will do a fast review; we can merge soon.
| A list where each dictionary defines the storage options for one scale of a multiscale raster element. | ||
|
|
||
| Important Notes | ||
| - The available key–value pairs in these dictionaries depend on the Zarr format used for writing. |
There was a problem hiding this comment.
I find it rather hard to know what is or isn't allowed here, as the argument type is very loose and correct parameters depend on what zarr will take. So this is hard for users in interactive mode to get right, and also hard for dependands in non-interactive mode to determine if they have well-formed storage options.
If we don't want (or need) to expose all zarr options to the user right now, we could for now just expose like
def write_element(
...
raster_write_args: Mapping[str, RasterWriteArgs] | Sequence[RasterWriteArgs] | RasterWriteArgs
...
):
...
class RasterWriteArgs:
shards: ShardsLike | None = None
# or maybe this, if we want to allow raw dicts
class RasterWriteARgs(TypedDict):
shards: ShardsLike | None = NonePersonally, I also don't like being able to pass in dict or list or dict of dicts, and would rather just have a single way to pass in those parameters.
This PR adds the following:
raster_write_kwargsfor io functions like.writeand.write_element. This also adds the ability to write sharded arrays. Support for anndata sharding is to be added in a follow up PR.raster_write_kwargsargument.raster_chunksandraster_shards. The config can now be stored in a default location or a custom location. Additionally, environment variables can be set to temporarily override the values.Additional changes
@LucaMarconato