Skip to content

Developer API

If you want to write a new parser to create virtual references pointing to a custom file format, you will need to use VirtualiZarr's internal classes. See the page on custom parsers for more information.

Manifests

VirtualiZarr uses these classes to store virtual references internally. See the page on data structures for more information.

virtualizarr.manifests.ChunkManifest

In-memory representation of a single Zarr chunk manifest.

Stores the manifest internally as numpy arrays, so the most efficient way to create this object is via the .from_arrays constructor classmethod.

The manifest can be converted to or from a dictionary which looks like this

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}

using the .init() and .dict() methods, so users of this class can think of the manifest as if it were a dict mapping zarr chunk keys to byte ranges.

(See the chunk manifest SPEC proposal in zarr-developers/zarr-specs#287.)

Validation is done when this object is instantiated, and this class is immutable, so it's not possible to have a ChunkManifest object that does not represent a valid grid of chunks.

nbytes property

nbytes: int

Size required to hold these references in memory in bytes.

Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

ndim_chunk_grid property

ndim_chunk_grid: int

Number of dimensions in the chunk grid.

Not the same as the dimension of an array backed by this chunk manifest.

shape_chunk_grid property

shape_chunk_grid: tuple[int, ...]

Number of separate chunks along each dimension.

Not the same as the shape of an array backed by this chunk manifest.

__eq__

__eq__(other: Any) -> bool

Two manifests are equal if all of their entries are identical.

__init__

__init__(
    entries: dict,
    shape: tuple[int, ...] | None = None,
    separator: ChunkKeySeparator = ".",
) -> None

Create a ChunkManifest from a dictionary mapping zarr chunk keys to byte ranges.

Parameters:

  • entries (dict) –

    Chunk keys and byte range information, as a dictionary of the form

    {
        "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
        "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
        "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
        "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
    }
    
  • separator (ChunkKeySeparator, default: '.' ) –

    The chunk key separator, as specified by the array's chunk_key_encoding metadata. Either "." (default/v2 encoding) or "/" (default encoding).

dict

dict() -> ChunkDict

Convert the entire manifest to a nested dictionary.

The returned dict will be of the form

{
    "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
    "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
    "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
    "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}

Entries whose path is an empty string will be interpreted as missing chunks and omitted from the dictionary.

elementwise_eq

elementwise_eq(other: 'ChunkManifest') -> ndarray

Return boolean array where True means that chunk entry matches.

from_arrays classmethod

from_arrays(
    *,
    paths: ndarray[Any, StringDType],
    offsets: ndarray[Any, dtype[uint64]],
    lengths: ndarray[Any, dtype[uint64]],
    validate_paths: bool = True,
) -> "ChunkManifest"

Create manifest directly from numpy arrays containing the path and byte range information.

Useful if you want to avoid the memory overhead of creating an intermediate dictionary first, as these 3 arrays are what will be used internally to store the references.

Parameters:

  • paths (ndarray[Any, StringDType]) –

    Array containing the paths to the chunks

  • offsets (ndarray[Any, dtype[uint64]]) –

    Array containing the byte offsets of the chunks

  • lengths (ndarray[Any, dtype[uint64]]) –

    Array containing the byte lengths of the chunks

  • validate_paths (bool, default: True ) –

    Check that entries in the manifest are valid paths (e.g. that local paths are absolute not relative). Set to False to skip validation for performance reasons.

get_entry

get_entry(indices: tuple[int, ...]) -> ChunkEntry | None

Look up a chunk entry by grid indices. Returns None for missing chunks (empty path).

iter_nonempty_paths

iter_nonempty_paths() -> Iterator[str]

Yield all non-empty paths in the manifest.

iter_refs

iter_refs() -> Iterator[tuple[tuple[int, ...], ChunkEntry]]

Yield (grid_indices, chunk_entry) for every non-missing chunk.

rename_paths

rename_paths(new: str | Callable[[str], str]) -> 'ChunkManifest'

Rename paths to chunks in this manifest.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

  • new (str | Callable[[str], str]) –

    New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

  • manifest
See Also

ManifestArray.rename_paths

Examples:

Rename paths to reflect moving the referenced files from local storage to an S3 bucket.

>>> def local_to_s3_url(old_local_path: str) -> str:
...     from pathlib import Path
...
...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
...     filename = Path(old_local_path).name
...     return str(new_s3_bucket_url / filename)
>>>
>>> manifest.rename_paths(local_to_s3_url)

virtualizarr.manifests.ManifestArray

Virtualized array representation of the chunk data in a single Zarr Array.

Supports concatenation / stacking, but only if the two arrays to be concatenated have the same codecs.

Cannot be directly altered.

Implements subset of the array API standard such that it can be wrapped by xarray. Doesn't store the zarr array name, zattrs or ARRAY_DIMENSIONS, as instead those can be stored on a wrapping xarray object.

chunks property

chunks: tuple[int, ...]

Individual chunk size by number of elements.

dtype property

dtype: dtype

The native dtype of the data (typically a numpy dtype)

nbytes_virtual property

nbytes_virtual: int

The total number of bytes required to hold these virtual references in memory in bytes.

Notes

This is not the size of the referenced array if it were actually loaded into memory (use .nbytes), this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

shape property

shape: tuple[int, ...]

Array shape by number of elements along each dimension.

__array_function__

__array_function__(func, types, args, kwargs) -> Any

Hook to teach this class what to do if np.concat etc. is called on it.

Use this instead of array_namespace so that we don't make promises we can't keep.

__array_ufunc__

__array_ufunc__(ufunc, method, *inputs, **kwargs) -> Any

We have to define this in order to convince xarray that this class is a duckarray, even though we will never support ufuncs.

__eq__

__eq__(other: Union[int, float, bool, ndarray, ManifestArray]) -> ndarray

Element-wise equality checking.

Returns a numpy array of booleans.

__getitem__

__getitem__(key: T_Indexer) -> ManifestArray

Perform numpy-style indexing on this ManifestArray.

Only supports limited indexing, because in general you cannot slice inside of a compressed chunk. Mainly required because Xarray uses this instead of expand dims (by passing Nones) and often will index with a no-op.

Could potentially support indexing with slices aligned along chunk boundaries, but currently does not.

Parameters:

  • key (T_Indexer) –

__init__

__init__(
    metadata: ArrayV3Metadata | dict, chunkmanifest: dict | ChunkManifest
) -> None

Create a ManifestArray directly from the metadata of a zarr array and the manifest of chunks.

Parameters:

astype

astype(dtype: dtype, /, *, copy: bool = True) -> ManifestArray

Cannot change the dtype, but needed because xarray will call this even when it's a no-op.

rename_paths

rename_paths(new: str | Callable[[str], str]) -> ManifestArray

Rename paths to chunks in this array's manifest.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

  • new (str | Callable[[str], str]) –

    New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

See Also

ChunkManifest.rename_paths

Examples:

Rename paths to reflect moving the referenced files from local storage to an S3 bucket.

>>> def local_to_s3_url(old_local_path: str) -> str:
...     from pathlib import Path
...
...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
...     filename = Path(old_local_path).name
...     return str(new_s3_bucket_url / filename)
>>>
>>> marr.rename_paths(local_to_s3_url)

to_virtual_variable

to_virtual_variable() -> Variable

Create a "virtual" xarray.Variable containing the contents of one zarr array.

The returned variable will be "virtual", i.e. it will wrap a single ManifestArray object.

virtualizarr.manifests.ManifestGroup

Bases: Mapping[str, 'ManifestArray | ManifestGroup']

Immutable representation of a single virtual zarr group.

arrays property

ManifestArrays contained in this group.

groups property

groups: dict[str, 'ManifestGroup']

Subgroups contained in this group.

metadata property

metadata: GroupMetadata

Zarr group metadata.

__getitem__

__getitem__(path: str) -> 'ManifestArray | ManifestGroup'

Obtain a group member.

__init__

__init__(
    arrays: Mapping[str, ManifestArray] | None = None,
    groups: Mapping[str, ManifestGroup] | None = None,
    attributes: dict | None = None,
) -> None

Create a ManifestGroup containing ManifestArrays and/or sub-groups, as well as any group-level metadata.

Parameters:

to_virtual_dataset

to_virtual_dataset() -> Dataset

Create a "virtual" xarray.Dataset containing the contents of one zarr group.

All variables in the returned Dataset will be "virtual", i.e. they will wrap ManifestArray objects.

to_virtual_datasets

to_virtual_datasets() -> Mapping[str, Dataset]

Create a dictionary containing virtual datasets for all the sub-groups of a ManifestGroup. All the variables in the datasets will be "virtual", i.e., they will wrap ManifestArray objects.

It is convenient to have a separate to_virtual_datasets function from to_virtual_datatree so that it can be called recursively without needing to use DataTree.to_dict() and.from_dict()` repeatedly.

to_virtual_datatree

to_virtual_datatree() -> DataTree

Create a "virtual" xarray.DataTree containing the contents of one zarr group.

All variables in the returned DataTree will be "virtual", i.e. they will wrap ManifestArray objects.

virtualizarr.manifests.ManifestStore

Bases: Store

A read-only Zarr store that uses obstore to read data from inside arbitrary files on AWS, GCP, Azure, or a local filesystem.

The requests from the Zarr API are redirected using the ManifestGroup containing multiple ManifestArray, allowing for virtually interfacing with underlying data in other file formats.

Parameters:

Warnings

ManifestStore is experimental and subject to API changes without notice. Please raise an issue with any comments/concerns about the store.

__init__

__init__(
    group: ManifestGroup, *, registry: ObjectStoreRegistry | None = None
) -> None

Instantiate a new ManifestStore.

Parameters:

to_virtual_dataset

to_virtual_dataset(
    group="",
    loadable_variables: Iterable[str] | None = None,
    decode_times: bool | None = None,
) -> "xr.Dataset"

Create a "virtual" xarray.Dataset containing the contents of one zarr group.

Will ignore the contents of any other groups in the store.

Requires xarray.

Parameters:

  • group (str, default: '' ) –
  • loadable_variables (Iterable[str], default: None ) –

Returns:

  • vds ( Dataset ) –

to_virtual_datatree

to_virtual_datatree(
    group="",
    *,
    loadable_variables: Iterable[str] | None = None,
    decode_times: bool | None = None,
) -> "xr.DataTree"

Create a "virtual" xarray.DataTree containing the contents of a zarr group. Default is the root group and all sub-groups.

Will ignore the contents of any other groups in the store.

Requires xarray.

Parameters:

  • group (Group to convert to a virtual DataTree, default: '' ) –
  • loadable_variables (Iterable[str] | None, default: None ) –

    Variables in the data source to load as Dask/NumPy arrays instead of as virtual arrays.

  • decode_times (bool | None, default: None ) –

    Bool that is passed into xarray.open_dataset. Allows time to be decoded into a datetime object.

Returns:

  • vdt ( DataTree ) –

Registry

Note

virtualizarr.registry.ObjectStoreRegistry has been deprecated. Please use obspec_utils.registry.ObjectStoreRegistry instead.

Array API

VirtualiZarr's virtualizarr.manifests.ManifestArray objects support a limited subset of the Python Array API standard in virtualizarr.manifests.array_api.

virtualizarr.manifests.array_api.concatenate

concatenate(
    arrays: tuple[ManifestArray, ...] | list[ManifestArray],
    /,
    *,
    axis: int | None = 0,
) -> ManifestArray

Concatenate ManifestArrays by merging their chunk manifests.

The signature of this function is array API compliant, so that it can be called by xarray.concat.

virtualizarr.manifests.array_api.stack

stack(
    arrays: tuple[ManifestArray, ...] | list[ManifestArray], /, *, axis: int = 0
) -> ManifestArray

Stack ManifestArrays by merging their chunk manifests.

The signature of this function is array API compliant, so that it can be called by xarray.stack.

virtualizarr.manifests.array_api.expand_dims

expand_dims(x: ManifestArray, /, *, axis: int = 0) -> ManifestArray

Expands the shape of an array by inserting a new axis (dimension) of size one at the position specified by axis.

virtualizarr.manifests.array_api.broadcast_to

broadcast_to(x: ManifestArray, /, shape: tuple[int, ...]) -> ManifestArray

Broadcasts a ManifestArray to a specified shape, by either adjusting chunk keys or copying chunk manifest entries.

Parallelization

Parallelizing virtual reference generation can be done using a number of parallel execution frameworks. Advanced users may want to call one of these executors directly. See the docs page on Scaling.

virtualizarr.parallel.SerialExecutor

Bases: Executor

A custom Executor that runs tasks sequentially, mimicking the concurrent.futures.Executor interface. Useful as a default and for debugging.

map

map(
    fn: Callable[..., T],
    *iterables: Iterable[Any],
    timeout: float | None = None,
    chunksize: int = 1,
    buffersize: int | None = None,
) -> Iterator[T]

Execute a function over an iterable sequentially.

Parameters:

  • fn (Callable[..., T]) –

    Function to apply to each item

  • *iterables (Iterable[Any], default: () ) –

    Iterables to process

  • timeout (float | None, default: None ) –

    Optional timeout (ignored in serial execution)

  • chunksize (int, default: 1 ) –

    Ignored in serial execution

  • buffersize (int | None, default: None ) –

    Ignored in serial execution (added in Python 3.14)

Returns:

  • Generator of results

submit

submit(fn: Callable[..., T], /, *args: Any, **kwargs: Any) -> Future[T]

Submit a callable to be executed.

Unlike parallel executors, this runs the task immediately and sequentially.

Parameters:

  • fn (Callable[..., T]) –

    The callable to execute

  • *args (Any, default: () ) –

    Positional arguments for the callable

  • **kwargs (Any, default: {} ) –

    Keyword arguments for the callable

Returns:

  • A Future representing the result of the execution

virtualizarr.parallel.DaskDelayedExecutor

Bases: Executor

An Executor that uses dask.delayed for parallel computation.

This executor mimics the concurrent.futures.Executor interface but uses Dask's delayed computation model.

__init__

__init__() -> None

Initialize the Dask Delayed Executor.

map

map(
    fn: Callable[..., T],
    *iterables: Iterable[Any],
    timeout: float | None = None,
    chunksize: int = 1,
    buffersize: int | None = None,
) -> Iterator[T]

Apply a function to an iterable using dask.delayed.

Parameters:

  • fn (Callable[..., T]) –

    Function to apply to each item

  • *iterables (Iterable[Any], default: () ) –

    Iterables to process

  • timeout (float | None, default: None ) –

    Optional timeout (ignored in Dask execution)

  • chunksize (int, default: 1 ) –

    Ignored in Dask execution

  • buffersize (int | None, default: None ) –

    Ignored in Dask execution (added in Python 3.14)

Returns:

  • Generator of results

submit

submit(fn: Callable[..., T], /, *args: Any, **kwargs: Any) -> Future[T]

Submit a task to be computed with dask.delayed.

Parameters:

  • fn (Callable[..., T]) –

    The callable to execute

  • *args (Any, default: () ) –

    Positional arguments for the callable

  • **kwargs (Any, default: {} ) –

    Keyword arguments for the callable

Returns:

  • A Future representing the result of the execution

virtualizarr.parallel.LithopsEagerFunctionExecutor

Bases: Executor

Lithops-based function executor that follows the concurrent.futures.Executor API.

Only required because lithops doesn't follow the concurrent.futures.Executor API. See lithops-cloud/lithops#1427.

compatible_callable

Bases: Generic[P, T]

Wraps a callable to make it fully compatible with Lithops.

This wrapper deals with 2 oddities in Lithops:

  1. Use of functools.partial, which Lithops fails to recognize as being callable. This is likely due to the builtin partial class using slots, which causes Lithops to not recognize the __call__ method as a method. See lithops-cloud/lithops#1428.
  2. Use of generic function wrappers that define generic args and kwargs parameters. In this case, because of the way Lithops inspects function signatures to determine how to pass arguments, it does not properly "spread" arguments as normally expected. Instead, it collects all positional arguments and associates them with the keyword argument "args", which is utterly unhelpful.

map

map(
    fn: Callable[..., T],
    *iterables: Iterable[Any],
    timeout: float | None = None,
    chunksize: int = 1,
    buffersize: int | None = None,
) -> Iterator[T]

Apply a function to an iterable using lithops.

Only needed because lithops.executors.FunctionExecutor.map returns futures, unlike concurrent.futures.Executor.map.

Parameters:

  • fn (Callable[..., T]) –

    Function to apply to each item

  • *iterables (Iterable[Any], default: () ) –

    Iterables to process

  • timeout (float | None, default: None ) –

    Optional timeout (ignored in Lithops execution)

  • chunksize (int, default: 1 ) –

    Ignored in Lithops execution

  • buffersize (int | None, default: None ) –

    Ignored in Lithops execution (added in Python 3.14)

Returns:

  • Generator of results

shutdown

shutdown(wait: bool = True, *, cancel_futures: bool = False) -> None

Shutdown the executor.

Parameters:

  • wait (bool, default: True ) –

    Whether to wait for pending futures.

submit

submit(fn: Callable[..., T], /, *args: Any, **kwargs: Any) -> Future[T]

Submit a task to be computed using lithops.

Parameters:

  • fn (Callable[..., T]) –

    The callable to execute

  • *args (Any, default: () ) –

    Positional arguments for the callable

  • **kwargs (Any, default: {} ) –

    Keyword arguments for the callable

Returns:

  • A concurrent.futures.Future representing the result of the execution