Developer API¶

If you want to write a new parser to create virtual references pointing to a custom file format, you will need to use VirtualiZarr's internal classes. See the page on custom parsers for more information.

Manifests¶

VirtualiZarr uses these classes to store virtual references internally. See the page on data structures for more information.

virtualizarr.manifests.ChunkManifest ¶

In-memory representation of a single Zarr chunk manifest.

Stores the manifest internally as numpy arrays, so the most efficient way to create this object is via the .from_arrays constructor classmethod.

The manifest can be converted to or from a dictionary which looks like this

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}

using the .init() and .dict() methods, so users of this class can think of the manifest as if it were a dict mapping zarr chunk keys to byte ranges.

(See the chunk manifest SPEC proposal in zarr-developers/zarr-specs#287.)

Validation is done when this object is instantiated, and this class is immutable, so it's not possible to have a ChunkManifest object that does not represent a valid grid of chunks.

nbytes `property` ¶

nbytes: int

Size required to hold these references in memory in bytes.

Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

ndim_chunk_grid `property` ¶

ndim_chunk_grid: int

Number of dimensions in the chunk grid.

Not the same as the dimension of an array backed by this chunk manifest.

shape_chunk_grid `property` ¶

shape_chunk_grid: tuple[int, ...]

Number of separate chunks along each dimension.

Not the same as the shape of an array backed by this chunk manifest.

eq ¶

__eq__(other: Any) -> bool

Two manifests are equal if all of their entries are identical.

init ¶

__init__(
    entries: dict,
    shape: tuple[int, ...] | None = None,
    separator: ChunkKeySeparator = ".",
) -> None

Create a ChunkManifest from a dictionary mapping zarr chunk keys to byte ranges.

Parameters:

entries (dict) –

Chunk keys and byte range information, as a dictionary of the form

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}

separator (ChunkKeySeparator, default: '.' ) –

The chunk key separator, as specified by the array's chunk_key_encoding metadata. Either "." (default/v2 encoding) or "/" (default encoding).

dict ¶

dict() -> ChunkDict

Convert the entire manifest to a nested dictionary.

The returned dict will be of the form

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}

Entries whose path is an empty string will be interpreted as missing chunks and omitted from the dictionary.

elementwise_eq ¶

elementwise_eq(other: 'ChunkManifest') -> ndarray

Return boolean array where True means that chunk entry matches.

from_arrays `classmethod` ¶

from_arrays(
    *,
    paths: ndarray[Any, StringDType],
    offsets: ndarray[Any, dtype[uint64]],
    lengths: ndarray[Any, dtype[uint64]],
    validate_paths: bool = True,
) -> "ChunkManifest"

Create manifest directly from numpy arrays containing the path and byte range information.

Useful if you want to avoid the memory overhead of creating an intermediate dictionary first, as these 3 arrays are what will be used internally to store the references.

Parameters:

paths (ndarray[Any, StringDType]) –

Array containing the paths to the chunks
offsets (ndarray[Any, dtype[uint64]]) –

Array containing the byte offsets of the chunks
lengths (ndarray[Any, dtype[uint64]]) –

Array containing the byte lengths of the chunks
validate_paths (bool, default: True ) –

Check that entries in the manifest are valid paths (e.g. that local paths are absolute not relative). Set to False to skip validation for performance reasons.

get_entry ¶

get_entry(indices: tuple[int, ...]) -> ChunkEntry | None

Look up a chunk entry by grid indices. Returns None for missing chunks (empty path).

iter_nonempty_paths ¶

iter_nonempty_paths() -> Iterator[str]

Yield all non-empty paths in the manifest.

iter_refs ¶

iter_refs() -> Iterator[tuple[tuple[int, ...], ChunkEntry]]

Yield (grid_indices, chunk_entry) for every non-missing chunk.

rename_paths ¶

rename_paths(new: str | Callable[[str], str]) -> 'ChunkManifest'

Rename paths to chunks in this manifest.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

new (str | Callable[[str], str]) –

New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

manifest –

virtualizarr.manifests.ManifestArray ¶

Virtualized array representation of the chunk data in a single Zarr Array.

Supports concatenation / stacking, but only if the two arrays to be concatenated have the same codecs.

Cannot be directly altered.

Implements subset of the array API standard such that it can be wrapped by xarray. Doesn't store the zarr array name, zattrs or ARRAY_DIMENSIONS, as instead those can be stored on a wrapping xarray object.

chunks `property` ¶

chunks: tuple[int, ...]

Individual chunk size by number of elements.

dtype `property` ¶

dtype: dtype

The native dtype of the data (typically a numpy dtype)

nbytes_virtual `property` ¶

nbytes_virtual: int

The total number of bytes required to hold these virtual references in memory in bytes.

Notes

This is not the size of the referenced array if it were actually loaded into memory (use .nbytes), this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

shape `property` ¶

shape: tuple[int, ...]

Array shape by number of elements along each dimension.

__array_function__ ¶

__array_function__(func, types, args, kwargs) -> Any

Hook to teach this class what to do if np.concat etc. is called on it.

Use this instead of array_namespace so that we don't make promises we can't keep.

__array_ufunc__ ¶

__array_ufunc__(ufunc, method, *inputs, **kwargs) -> Any

We have to define this in order to convince xarray that this class is a duckarray, even though we will never support ufuncs.

eq ¶

__eq__(other: Union[int, float, bool, ndarray, ManifestArray]) -> ndarray

Element-wise equality checking.

Returns a numpy array of booleans.

getitem ¶

__getitem__(key: T_Indexer) -> ManifestArray

Perform numpy-style indexing on this ManifestArray.

Only supports limited indexing, because in general you cannot slice inside of a compressed chunk. Mainly required because Xarray uses this instead of expand dims (by passing Nones) and often will index with a no-op.

Could potentially support indexing with slices aligned along chunk boundaries, but currently does not.

Parameters:

key (T_Indexer) –

init ¶

__init__(
    metadata: ArrayV3Metadata | dict, chunkmanifest: dict | ChunkManifest
) -> None

Create a ManifestArray directly from the metadata of a zarr array and the manifest of chunks.

Parameters:

metadata (dict or ArrayV3Metadata) –
chunkmanifest (dict or ChunkManifest) –

astype ¶

astype(dtype: dtype, /, *, copy: bool = True) -> ManifestArray

Cannot change the dtype, but needed because xarray will call this even when it's a no-op.

rename_paths ¶

rename_paths(new: str | Callable[[str], str]) -> ManifestArray

Rename paths to chunks in this array's manifest.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

new (str | Callable[[str], str]) –

New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

ManifestArray –

to_virtual_variable ¶

to_virtual_variable() -> Variable

Create a "virtual" xarray.Variable containing the contents of one zarr array.

The returned variable will be "virtual", i.e. it will wrap a single ManifestArray object.

virtualizarr.manifests.ManifestGroup ¶

Bases: Mapping[str, 'ManifestArray | ManifestGroup']

Immutable representation of a single virtual zarr group.

arrays `property` ¶

arrays: dict[str, ManifestArray]

ManifestArrays contained in this group.

groups `property` ¶

groups: dict[str, 'ManifestGroup']

Subgroups contained in this group.

metadata `property` ¶

metadata: GroupMetadata

Zarr group metadata.

getitem ¶

__getitem__(path: str) -> 'ManifestArray | ManifestGroup'

Obtain a group member.

init ¶

__init__(
    arrays: Mapping[str, ManifestArray] | None = None,
    groups: Mapping[str, ManifestGroup] | None = None,
    attributes: dict | None = None,
) -> None

Create a ManifestGroup containing ManifestArrays and/or sub-groups, as well as any group-level metadata.

Parameters:

arrays (Mapping[str, ManifestArray], default: None ) –

ManifestArray objects to represent virtual zarr arrays.
groups (Mapping[str, ManifestGroup], default: None ) –

ManifestGroup objects to represent virtual zarr subgroups.
attributes (dict, default: None ) –

Zarr attributes to add as zarr group metadata.

to_virtual_dataset ¶

to_virtual_dataset() -> Dataset

Create a "virtual" xarray.Dataset containing the contents of one zarr group.

All variables in the returned Dataset will be "virtual", i.e. they will wrap ManifestArray objects.

to_virtual_datasets ¶

to_virtual_datasets() -> Mapping[str, Dataset]

Create a dictionary containing virtual datasets for all the sub-groups of a ManifestGroup. All the variables in the datasets will be "virtual", i.e., they will wrap ManifestArray objects.

It is convenient to have a separate to_virtual_datasets function from to_virtual_datatree so that it can be called recursively without needing to use DataTree.to_dict() and.from_dict()` repeatedly.

to_virtual_datatree ¶

to_virtual_datatree() -> DataTree

Create a "virtual" xarray.DataTree containing the contents of one zarr group.

All variables in the returned DataTree will be "virtual", i.e. they will wrap ManifestArray objects.

virtualizarr.manifests.ManifestStore ¶

Bases: Store

A read-only Zarr store that uses obstore to read data from inside arbitrary files on AWS, GCP, Azure, or a local filesystem.

The requests from the Zarr API are redirected using the ManifestGroup containing multiple ManifestArray, allowing for virtually interfacing with underlying data in other file formats.

Parameters:

group (ManifestGroup) –

Root group of the store. Contains group metadata, ManifestArrays, and any subgroups.
registry (ObjectStoreRegistry, default: None ) –

ObjectStoreRegistry that maps the URL scheme and netloc to ObjectStore instances, allowing ManifestStores to read from different ObjectStore instances.

Warnings

ManifestStore is experimental and subject to API changes without notice. Please raise an issue with any comments/concerns about the store.

init ¶

__init__(
    group: ManifestGroup, *, registry: ObjectStoreRegistry | None = None
) -> None

Instantiate a new ManifestStore.

Parameters:

group (ManifestGroup) –

ManifestGroup containing Group metadata and mapping variable names to ManifestArrays
registry (ObjectStoreRegistry | None, default: None ) –

A registry mapping the URL scheme and netloc to ObjectStore instances, allowing ManifestStores to read from different ObjectStore instances.

to_virtual_dataset ¶

to_virtual_dataset(
    group="",
    loadable_variables: Iterable[str] | None = None,
    decode_times: bool | None = None,
) -> "xr.Dataset"

Create a "virtual" xarray.Dataset containing the contents of one zarr group.

Will ignore the contents of any other groups in the store.

Requires xarray.

Parameters:

group (str, default: '' ) –
loadable_variables (Iterable[str], default: None ) –

Returns:

vds ( Dataset ) –

to_virtual_datatree ¶

to_virtual_datatree(
    group="",
    *,
    loadable_variables: Iterable[str] | None = None,
    decode_times: bool | None = None,
) -> "xr.DataTree"

Create a "virtual" xarray.DataTree containing the contents of a zarr group. Default is the root group and all sub-groups.

Will ignore the contents of any other groups in the store.

Requires xarray.

Parameters:

group (Group to convert to a virtual DataTree, default: '' ) –
loadable_variables (Iterable[str] | None, default: None ) –

Variables in the data source to load as Dask/NumPy arrays instead of as virtual arrays.
decode_times (bool | None, default: None ) –

Bool that is passed into xarray.open_dataset. Allows time to be decoded into a datetime object.

Returns:

vdt ( DataTree ) –

Registry¶

Note

virtualizarr.registry.ObjectStoreRegistry has been deprecated. Please use obspec_utils.registry.ObjectStoreRegistry instead.

Array API¶

VirtualiZarr's virtualizarr.manifests.ManifestArray objects support a limited subset of the Python Array API standard in virtualizarr.manifests.array_api.

virtualizarr.manifests.array_api.concatenate ¶

concatenate(
    arrays: tuple[ManifestArray, ...] | list[ManifestArray],
    /,
    *,
    axis: int | None = 0,
) -> ManifestArray

Concatenate ManifestArrays by merging their chunk manifests.

The signature of this function is array API compliant, so that it can be called by xarray.concat.

virtualizarr.manifests.array_api.stack ¶

stack(
    arrays: tuple[ManifestArray, ...] | list[ManifestArray], /, *, axis: int = 0
) -> ManifestArray

Stack ManifestArrays by merging their chunk manifests.

The signature of this function is array API compliant, so that it can be called by xarray.stack.

virtualizarr.manifests.array_api.expand_dims ¶

expand_dims(x: ManifestArray, /, *, axis: int = 0) -> ManifestArray

Expands the shape of an array by inserting a new axis (dimension) of size one at the position specified by axis.

virtualizarr.manifests.array_api.broadcast_to ¶

broadcast_to(x: ManifestArray, /, shape: tuple[int, ...]) -> ManifestArray

Broadcasts a ManifestArray to a specified shape, by either adjusting chunk keys or copying chunk manifest entries.

Parallelization¶

Parallelizing virtual reference generation can be done using a number of parallel execution frameworks. Advanced users may want to call one of these executors directly. See the docs page on Scaling.

virtualizarr.parallel.SerialExecutor ¶

Bases: Executor

A custom Executor that runs tasks sequentially, mimicking the concurrent.futures.Executor interface. Useful as a default and for debugging.

map ¶

map(
    fn: Callable[..., T],
    *iterables: Iterable[Any],
    timeout: float | None = None,
    chunksize: int = 1,
    buffersize: int | None = None,
) -> Iterator[T]

Execute a function over an iterable sequentially.

Parameters:

fn (Callable[..., T]) –

Function to apply to each item
*iterables (Iterable[Any], default: () ) –

Iterables to process
timeout (float | None, default: None ) –

Optional timeout (ignored in serial execution)
chunksize (int, default: 1 ) –

Ignored in serial execution
buffersize (int | None, default: None ) –

Ignored in serial execution (added in Python 3.14)

Returns:

Generator of results –

submit ¶

submit(fn: Callable[..., T], /, *args: Any, **kwargs: Any) -> Future[T]

Submit a callable to be executed.

Unlike parallel executors, this runs the task immediately and sequentially.

Parameters:

fn (Callable[..., T]) –

The callable to execute
*args (Any, default: () ) –

Positional arguments for the callable
**kwargs (Any, default: {} ) –

Keyword arguments for the callable

Returns:

A Future representing the result of the execution –

virtualizarr.parallel.DaskDelayedExecutor ¶

Bases: Executor

An Executor that uses dask.delayed for parallel computation.

This executor mimics the concurrent.futures.Executor interface but uses Dask's delayed computation model.

init ¶

__init__() -> None

Initialize the Dask Delayed Executor.

map ¶

map(
    fn: Callable[..., T],
    *iterables: Iterable[Any],
    timeout: float | None = None,
    chunksize: int = 1,
    buffersize: int | None = None,
) -> Iterator[T]

Apply a function to an iterable using dask.delayed.

Parameters:

fn (Callable[..., T]) –

Function to apply to each item
*iterables (Iterable[Any], default: () ) –

Iterables to process
timeout (float | None, default: None ) –

Optional timeout (ignored in Dask execution)
chunksize (int, default: 1 ) –

Ignored in Dask execution
buffersize (int | None, default: None ) –

Ignored in Dask execution (added in Python 3.14)

Returns:

Generator of results –

submit ¶

submit(fn: Callable[..., T], /, *args: Any, **kwargs: Any) -> Future[T]

Submit a task to be computed with dask.delayed.

Parameters:

fn (Callable[..., T]) –

The callable to execute
*args (Any, default: () ) –

Positional arguments for the callable
**kwargs (Any, default: {} ) –

Keyword arguments for the callable

Returns:

A Future representing the result of the execution –

virtualizarr.parallel.LithopsEagerFunctionExecutor ¶

Bases: Executor

Lithops-based function executor that follows the concurrent.futures.Executor API.

Only required because lithops doesn't follow the concurrent.futures.Executor API. See lithops-cloud/lithops#1427.

compatible_callable ¶

Bases: Generic[P, T]

Wraps a callable to make it fully compatible with Lithops.

This wrapper deals with 2 oddities in Lithops:

Use of functools.partial, which Lithops fails to recognize as being callable. This is likely due to the builtin partial class using slots, which causes Lithops to not recognize the __call__ method as a method. See lithops-cloud/lithops#1428.
Use of generic function wrappers that define generic args and kwargs parameters. In this case, because of the way Lithops inspects function signatures to determine how to pass arguments, it does not properly "spread" arguments as normally expected. Instead, it collects all positional arguments and associates them with the keyword argument "args", which is utterly unhelpful.

map ¶

map(
    fn: Callable[..., T],
    *iterables: Iterable[Any],
    timeout: float | None = None,
    chunksize: int = 1,
    buffersize: int | None = None,
) -> Iterator[T]

Apply a function to an iterable using lithops.

Only needed because lithops.executors.FunctionExecutor.map returns futures, unlike concurrent.futures.Executor.map.

Parameters:

fn (Callable[..., T]) –

Function to apply to each item
*iterables (Iterable[Any], default: () ) –

Iterables to process
timeout (float | None, default: None ) –

Optional timeout (ignored in Lithops execution)
chunksize (int, default: 1 ) –

Ignored in Lithops execution
buffersize (int | None, default: None ) –

Ignored in Lithops execution (added in Python 3.14)

Returns:

Generator of results –

shutdown ¶

shutdown(wait: bool = True, *, cancel_futures: bool = False) -> None

Shutdown the executor.

Parameters:

wait (bool, default: True ) –

Whether to wait for pending futures.

submit ¶

submit(fn: Callable[..., T], /, *args: Any, **kwargs: Any) -> Future[T]

Submit a task to be computed using lithops.

Parameters:

fn (Callable[..., T]) –

The callable to execute
*args (Any, default: () ) –

Positional arguments for the callable
**kwargs (Any, default: {} ) –

Keyword arguments for the callable

Returns:

A concurrent.futures.Future representing the result of the execution –

Developer API¶

Manifests¶

virtualizarr.manifests.ChunkManifest ¶

nbytes property ¶

ndim_chunk_grid property ¶

shape_chunk_grid property ¶

__eq__ ¶

__init__ ¶

dict ¶

elementwise_eq ¶

from_arrays classmethod ¶

get_entry ¶

iter_nonempty_paths ¶

iter_refs ¶

rename_paths ¶

virtualizarr.manifests.ManifestArray ¶

chunks property ¶

dtype property ¶

nbytes_virtual property ¶

shape property ¶

__array_function__ ¶

__array_ufunc__ ¶

__eq__ ¶

__getitem__ ¶

__init__ ¶

astype ¶

rename_paths ¶

to_virtual_variable ¶

virtualizarr.manifests.ManifestGroup ¶

arrays property ¶

groups property ¶

metadata property ¶

__getitem__ ¶

__init__ ¶

to_virtual_dataset ¶

to_virtual_datasets ¶

to_virtual_datatree ¶

virtualizarr.manifests.ManifestStore ¶

__init__ ¶

to_virtual_dataset ¶

to_virtual_datatree ¶

Registry¶

Array API¶

virtualizarr.manifests.array_api.concatenate ¶

virtualizarr.manifests.array_api.stack ¶

virtualizarr.manifests.array_api.expand_dims ¶

virtualizarr.manifests.array_api.broadcast_to ¶

Parallelization¶

virtualizarr.parallel.SerialExecutor ¶

map ¶

submit ¶

virtualizarr.parallel.DaskDelayedExecutor ¶

__init__ ¶

map ¶

submit ¶

virtualizarr.parallel.LithopsEagerFunctionExecutor ¶

compatible_callable ¶

map ¶

shutdown ¶

submit ¶

nbytes `property` ¶

ndim_chunk_grid `property` ¶

shape_chunk_grid `property` ¶

eq ¶

init ¶

from_arrays `classmethod` ¶

chunks `property` ¶

dtype `property` ¶

nbytes_virtual `property` ¶

shape `property` ¶

eq ¶

getitem ¶

init ¶

arrays `property` ¶

groups `property` ¶

metadata `property` ¶

getitem ¶

init ¶

init ¶

init ¶