Developer API¶
If you want to write a new parser to create virtual references pointing to a custom file format, you will need to use VirtualiZarr's internal classes. See the page on custom parsers for more information.
Manifests¶
VirtualiZarr uses these classes to store virtual references internally. See the page on data structures for more information.
virtualizarr.manifests.ChunkManifest ¶
In-memory representation of a single Zarr chunk manifest.
Stores the manifest internally as numpy arrays, so the most efficient way to create this object is via the .from_arrays constructor classmethod.
The manifest can be converted to or from a dictionary which looks like this
{
"0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
"0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
"0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
"0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}
using the .init() and .dict() methods, so users of this class can think of the manifest as if it were a dict mapping zarr chunk keys to byte ranges.
(See the chunk manifest SPEC proposal in zarr-developers/zarr-specs#287.)
Validation is done when this object is instantiated, and this class is immutable, so it's not possible to have a ChunkManifest object that does not represent a valid grid of chunks.
nbytes
property
¶
nbytes: int
Size required to hold these references in memory in bytes.
Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.
ndim_chunk_grid
property
¶
ndim_chunk_grid: int
Number of dimensions in the chunk grid.
Not the same as the dimension of an array backed by this chunk manifest.
shape_chunk_grid
property
¶
Number of separate chunks along each dimension.
Not the same as the shape of an array backed by this chunk manifest.
__init__ ¶
__init__(
entries: dict,
shape: tuple[int, ...] | None = None,
separator: ChunkKeySeparator = ".",
) -> None
Create a ChunkManifest from a dictionary mapping zarr chunk keys to byte ranges.
Parameters:
-
entries(dict) –Chunk keys and byte range information, as a dictionary of the form
{ "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100}, "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100}, "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100}, "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100}, } -
separator(ChunkKeySeparator, default:'.') –The chunk key separator, as specified by the array's chunk_key_encoding metadata. Either "." (default/v2 encoding) or "/" (default encoding).
dict ¶
dict() -> ChunkDict
Convert the entire manifest to a nested dictionary.
The returned dict will be of the form
{
"0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
"0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
"0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
"0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}
Entries whose path is an empty string will be interpreted as missing chunks and omitted from the dictionary.
elementwise_eq ¶
elementwise_eq(other: 'ChunkManifest') -> ndarray
Return boolean array where True means that chunk entry matches.
from_arrays
classmethod
¶
from_arrays(
*,
paths: ndarray[Any, StringDType],
offsets: ndarray[Any, dtype[uint64]],
lengths: ndarray[Any, dtype[uint64]],
validate_paths: bool = True,
) -> "ChunkManifest"
Create manifest directly from numpy arrays containing the path and byte range information.
Useful if you want to avoid the memory overhead of creating an intermediate dictionary first, as these 3 arrays are what will be used internally to store the references.
Parameters:
-
paths(ndarray[Any, StringDType]) –Array containing the paths to the chunks
-
offsets(ndarray[Any, dtype[uint64]]) –Array containing the byte offsets of the chunks
-
lengths(ndarray[Any, dtype[uint64]]) –Array containing the byte lengths of the chunks
-
validate_paths(bool, default:True) –Check that entries in the manifest are valid paths (e.g. that local paths are absolute not relative). Set to False to skip validation for performance reasons.
get_entry ¶
Look up a chunk entry by grid indices. Returns None for missing chunks (empty path).
iter_nonempty_paths ¶
Yield all non-empty paths in the manifest.
iter_refs ¶
Yield (grid_indices, chunk_entry) for every non-missing chunk.
rename_paths ¶
Rename paths to chunks in this manifest.
Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.
Parameters:
-
new(str | Callable[[str], str]) –New path to use for all chunks, either as a string, or as a function which accepts and returns strings.
Returns:
-
manifest–
See Also
ManifestArray.rename_paths
Examples:
Rename paths to reflect moving the referenced files from local storage to an S3 bucket.
>>> def local_to_s3_url(old_local_path: str) -> str:
... from pathlib import Path
...
... new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
... filename = Path(old_local_path).name
... return str(new_s3_bucket_url / filename)
>>>
>>> manifest.rename_paths(local_to_s3_url)
virtualizarr.manifests.ManifestArray ¶
Virtualized array representation of the chunk data in a single Zarr Array.
Supports concatenation / stacking, but only if the two arrays to be concatenated have the same codecs.
Cannot be directly altered.
Implements subset of the array API standard such that it can be wrapped by xarray. Doesn't store the zarr array name, zattrs or ARRAY_DIMENSIONS, as instead those can be stored on a wrapping xarray object.
nbytes_virtual
property
¶
nbytes_virtual: int
The total number of bytes required to hold these virtual references in memory in bytes.
Notes
This is not the size of the referenced array if it were actually loaded into memory (use .nbytes),
this is only the size of the pointers to the chunk locations.
If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.
__array_function__ ¶
__array_function__(func, types, args, kwargs) -> Any
Hook to teach this class what to do if np.concat etc. is called on it.
Use this instead of array_namespace so that we don't make promises we can't keep.
__array_ufunc__ ¶
__array_ufunc__(ufunc, method, *inputs, **kwargs) -> Any
We have to define this in order to convince xarray that this class is a duckarray, even though we will never support ufuncs.
__eq__ ¶
Element-wise equality checking.
Returns a numpy array of booleans.
__getitem__ ¶
__getitem__(key: T_Indexer) -> ManifestArray
Perform numpy-style indexing on this ManifestArray.
Only supports limited indexing, because in general you cannot slice inside of a compressed chunk. Mainly required because Xarray uses this instead of expand dims (by passing Nones) and often will index with a no-op.
Could potentially support indexing with slices aligned along chunk boundaries, but currently does not.
Parameters:
-
key(T_Indexer) –
__init__ ¶
__init__(
metadata: ArrayV3Metadata | dict, chunkmanifest: dict | ChunkManifest
) -> None
Create a ManifestArray directly from the metadata of a zarr array and the manifest of chunks.
Parameters:
-
metadata(dict or ArrayV3Metadata) – -
chunkmanifest(dict or ChunkManifest) –
astype ¶
astype(dtype: dtype, /, *, copy: bool = True) -> ManifestArray
Cannot change the dtype, but needed because xarray will call this even when it's a no-op.
rename_paths ¶
rename_paths(new: str | Callable[[str], str]) -> ManifestArray
Rename paths to chunks in this array's manifest.
Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.
Parameters:
-
new(str | Callable[[str], str]) –New path to use for all chunks, either as a string, or as a function which accepts and returns strings.
Returns:
See Also
ChunkManifest.rename_paths
Examples:
Rename paths to reflect moving the referenced files from local storage to an S3 bucket.
>>> def local_to_s3_url(old_local_path: str) -> str:
... from pathlib import Path
...
... new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
... filename = Path(old_local_path).name
... return str(new_s3_bucket_url / filename)
>>>
>>> marr.rename_paths(local_to_s3_url)
virtualizarr.manifests.ManifestGroup ¶
Bases: Mapping[str, 'ManifestArray | ManifestGroup']
Immutable representation of a single virtual zarr group.
__init__ ¶
__init__(
arrays: Mapping[str, ManifestArray] | None = None,
groups: Mapping[str, ManifestGroup] | None = None,
attributes: dict | None = None,
) -> None
Create a ManifestGroup containing ManifestArrays and/or sub-groups, as well as any group-level metadata.
Parameters:
-
arrays(Mapping[str, ManifestArray], default:None) –ManifestArray objects to represent virtual zarr arrays.
-
groups(Mapping[str, ManifestGroup], default:None) –ManifestGroup objects to represent virtual zarr subgroups.
-
attributes(dict, default:None) –Zarr attributes to add as zarr group metadata.
to_virtual_dataset ¶
to_virtual_dataset() -> Dataset
Create a "virtual" xarray.Dataset containing the contents of one zarr group.
All variables in the returned Dataset will be "virtual", i.e. they will wrap ManifestArray objects.
to_virtual_datasets ¶
Create a dictionary containing virtual datasets for all the sub-groups of a ManifestGroup. All the variables in the datasets will be "virtual", i.e., they will wrap ManifestArray objects.
It is convenient to have a separate to_virtual_datasets function from to_virtual_datatree so that
it can be called recursively without needing to use DataTree.to_dict() and.from_dict()` repeatedly.
to_virtual_datatree ¶
to_virtual_datatree() -> DataTree
Create a "virtual" xarray.DataTree containing the contents of one zarr group.
All variables in the returned DataTree will be "virtual", i.e. they will wrap ManifestArray objects.
virtualizarr.manifests.ManifestStore ¶
Bases: Store
A read-only Zarr store that uses obstore to read data from inside arbitrary files on AWS, GCP, Azure, or a local filesystem.
The requests from the Zarr API are redirected using the ManifestGroup containing multiple ManifestArray, allowing for virtually interfacing with underlying data in other file formats.
Parameters:
-
group(ManifestGroup) –Root group of the store. Contains group metadata, ManifestArrays, and any subgroups.
-
registry(ObjectStoreRegistry, default:None) –ObjectStoreRegistry that maps the URL scheme and netloc to ObjectStore instances, allowing ManifestStores to read from different ObjectStore instances.
Warnings
ManifestStore is experimental and subject to API changes without notice. Please raise an issue with any comments/concerns about the store.
__init__ ¶
__init__(
group: ManifestGroup, *, registry: ObjectStoreRegistry | None = None
) -> None
Instantiate a new ManifestStore.
Parameters:
-
group(ManifestGroup) –ManifestGroup containing Group metadata and mapping variable names to ManifestArrays
-
registry(ObjectStoreRegistry | None, default:None) –A registry mapping the URL scheme and netloc to ObjectStore instances, allowing ManifestStores to read from different ObjectStore instances.
to_virtual_dataset ¶
to_virtual_dataset(
group="",
loadable_variables: Iterable[str] | None = None,
decode_times: bool | None = None,
) -> "xr.Dataset"
Create a "virtual" xarray.Dataset containing the contents of one zarr group.
Will ignore the contents of any other groups in the store.
Requires xarray.
Parameters:
Returns:
-
vds(Dataset) –
to_virtual_datatree ¶
to_virtual_datatree(
group="",
*,
loadable_variables: Iterable[str] | None = None,
decode_times: bool | None = None,
) -> "xr.DataTree"
Create a "virtual" xarray.DataTree containing the contents of a zarr group. Default is the root group and all sub-groups.
Will ignore the contents of any other groups in the store.
Requires xarray.
Parameters:
-
group(Group to convert to a virtual DataTree, default:'') – -
loadable_variables(Iterable[str] | None, default:None) –Variables in the data source to load as Dask/NumPy arrays instead of as virtual arrays.
-
decode_times(bool | None, default:None) –Bool that is passed into xarray.open_dataset. Allows time to be decoded into a datetime object.
Returns:
-
vdt(DataTree) –
Registry¶
Note
virtualizarr.registry.ObjectStoreRegistry has been deprecated. Please use obspec_utils.registry.ObjectStoreRegistry instead.
Array API¶
VirtualiZarr's virtualizarr.manifests.ManifestArray objects support a limited subset of the Python Array API standard in virtualizarr.manifests.array_api.
virtualizarr.manifests.array_api.concatenate ¶
concatenate(
arrays: tuple[ManifestArray, ...] | list[ManifestArray],
/,
*,
axis: int | None = 0,
) -> ManifestArray
Concatenate ManifestArrays by merging their chunk manifests.
The signature of this function is array API compliant, so that it can be called by xarray.concat.
virtualizarr.manifests.array_api.stack ¶
stack(
arrays: tuple[ManifestArray, ...] | list[ManifestArray], /, *, axis: int = 0
) -> ManifestArray
Stack ManifestArrays by merging their chunk manifests.
The signature of this function is array API compliant, so that it can be called by xarray.stack.
virtualizarr.manifests.array_api.expand_dims ¶
expand_dims(x: ManifestArray, /, *, axis: int = 0) -> ManifestArray
Expands the shape of an array by inserting a new axis (dimension) of size one at the position specified by axis.
virtualizarr.manifests.array_api.broadcast_to ¶
broadcast_to(x: ManifestArray, /, shape: tuple[int, ...]) -> ManifestArray
Broadcasts a ManifestArray to a specified shape, by either adjusting chunk keys or copying chunk manifest entries.
Parallelization¶
Parallelizing virtual reference generation can be done using a number of parallel execution frameworks. Advanced users may want to call one of these executors directly. See the docs page on Scaling.
virtualizarr.parallel.SerialExecutor ¶
Bases: Executor
A custom Executor that runs tasks sequentially, mimicking the concurrent.futures.Executor interface. Useful as a default and for debugging.
map ¶
map(
fn: Callable[..., T],
*iterables: Iterable[Any],
timeout: float | None = None,
chunksize: int = 1,
buffersize: int | None = None,
) -> Iterator[T]
Execute a function over an iterable sequentially.
Parameters:
-
fn(Callable[..., T]) –Function to apply to each item
-
*iterables(Iterable[Any], default:()) –Iterables to process
-
timeout(float | None, default:None) –Optional timeout (ignored in serial execution)
-
chunksize(int, default:1) –Ignored in serial execution
-
buffersize(int | None, default:None) –Ignored in serial execution (added in Python 3.14)
Returns:
-
Generator of results–
submit ¶
Submit a callable to be executed.
Unlike parallel executors, this runs the task immediately and sequentially.
Parameters:
-
fn(Callable[..., T]) –The callable to execute
-
*args(Any, default:()) –Positional arguments for the callable
-
**kwargs(Any, default:{}) –Keyword arguments for the callable
Returns:
-
A Future representing the result of the execution–
virtualizarr.parallel.DaskDelayedExecutor ¶
Bases: Executor
An Executor that uses dask.delayed for parallel computation.
This executor mimics the concurrent.futures.Executor interface but uses Dask's delayed computation model.
map ¶
map(
fn: Callable[..., T],
*iterables: Iterable[Any],
timeout: float | None = None,
chunksize: int = 1,
buffersize: int | None = None,
) -> Iterator[T]
Apply a function to an iterable using dask.delayed.
Parameters:
-
fn(Callable[..., T]) –Function to apply to each item
-
*iterables(Iterable[Any], default:()) –Iterables to process
-
timeout(float | None, default:None) –Optional timeout (ignored in Dask execution)
-
chunksize(int, default:1) –Ignored in Dask execution
-
buffersize(int | None, default:None) –Ignored in Dask execution (added in Python 3.14)
Returns:
-
Generator of results–
submit ¶
Submit a task to be computed with dask.delayed.
Parameters:
-
fn(Callable[..., T]) –The callable to execute
-
*args(Any, default:()) –Positional arguments for the callable
-
**kwargs(Any, default:{}) –Keyword arguments for the callable
Returns:
-
A Future representing the result of the execution–
virtualizarr.parallel.LithopsEagerFunctionExecutor ¶
Bases: Executor
Lithops-based function executor that follows the concurrent.futures.Executor API.
Only required because lithops doesn't follow the concurrent.futures.Executor API. See lithops-cloud/lithops#1427.
compatible_callable ¶
Bases: Generic[P, T]
Wraps a callable to make it fully compatible with Lithops.
This wrapper deals with 2 oddities in Lithops:
- Use of
functools.partial, which Lithops fails to recognize as being callable. This is likely due to the builtinpartialclass using slots, which causes Lithops to not recognize the__call__method as a method. See lithops-cloud/lithops#1428. - Use of generic function wrappers that define generic
argsandkwargsparameters. In this case, because of the way Lithops inspects function signatures to determine how to pass arguments, it does not properly "spread" arguments as normally expected. Instead, it collects all positional arguments and associates them with the keyword argument"args", which is utterly unhelpful.
map ¶
map(
fn: Callable[..., T],
*iterables: Iterable[Any],
timeout: float | None = None,
chunksize: int = 1,
buffersize: int | None = None,
) -> Iterator[T]
Apply a function to an iterable using lithops.
Only needed because lithops.executors.FunctionExecutor.map returns futures, unlike concurrent.futures.Executor.map.
Parameters:
-
fn(Callable[..., T]) –Function to apply to each item
-
*iterables(Iterable[Any], default:()) –Iterables to process
-
timeout(float | None, default:None) –Optional timeout (ignored in Lithops execution)
-
chunksize(int, default:1) –Ignored in Lithops execution
-
buffersize(int | None, default:None) –Ignored in Lithops execution (added in Python 3.14)
Returns:
-
Generator of results–
shutdown ¶
Shutdown the executor.
Parameters:
-
wait(bool, default:True) –Whether to wait for pending futures.
submit ¶
Submit a task to be computed using lithops.
Parameters:
-
fn(Callable[..., T]) –The callable to execute
-
*args(Any, default:()) –Positional arguments for the callable
-
**kwargs(Any, default:{}) –Keyword arguments for the callable
Returns:
-
A concurrent.futures.Future representing the result of the execution–