Skip to content

Virtual Datasets

VirtualiZarr has a small API surface, because most of the complexity is handled by xarray functions like xarray.concat and xarray.merge. Users can use xarray for every step apart from reading and serializing virtual references.

Reading

virtualizarr.open_virtual_dataset

open_virtual_dataset(
    url: str,
    registry: ObjectStoreRegistry,
    parser: Parser,
    drop_variables: Iterable[str] | None = None,
    loadable_variables: Iterable[str] | None = None,
    decode_times: bool | None = None,
) -> Dataset

Open an archival data source as an xarray.Dataset wrapping virtualized zarr arrays.

No data variables will be loaded unless specified in the loadable_variables kwarg (in which case they will open as lazily indexed arrays using xarray's standard lazy indexing classes). Coordinate variables are loaded by default following xarray's behavior.

Parameters:

Returns:

  • vds

    An xarray.Dataset containing virtual chunk references for all variables not included in loadable_variables and normal lazily indexed arrays for each variable in loadable_variables.

virtualizarr.open_virtual_mfdataset

open_virtual_mfdataset(
    urls: str
    | PathLike
    | Sequence[str | PathLike]
    | NestedSequence[str | PathLike],
    registry: ObjectStoreRegistry,
    parser: Parser,
    concat_dim: str
    | DataArray
    | Index
    | Sequence[str]
    | Sequence[DataArray]
    | Sequence[Index]
    | None = None,
    compat: "CompatOptions" = "no_conflicts",
    preprocess: Callable[[Dataset], Dataset] | None = None,
    data_vars: Literal["all", "minimal", "different"] | list[str] = "all",
    coords="different",
    combine: Literal["by_coords", "nested"] = "by_coords",
    parallel: Literal["dask", "lithops", False] | type[Executor] = False,
    join: "JoinOptions" = "outer",
    attrs_file: str | PathLike | None = None,
    combine_attrs: "CombineAttrsOptions" = "override",
    **kwargs,
) -> Dataset

Open multiple data sources as a single virtual dataset.

This function is explicitly modelled after xarray.open_mfdataset, and works in the same way.

If combine='by_coords' then the function combine_by_coords is used to combine the datasets into one before returning the result, and if combine='nested' then combine_nested is used. The urls must be structured according to which combining function is used, the details of which are given in the documentation for combine_by_coords and combine_nested. By default combine='by_coords' will be used. Global attributes from the attrs_file are used for the combined dataset.

Parameters:

Returns:

  • vds

    An xarray.Dataset containing virtual chunk references for all variables not included in loadable_variables and normal lazily indexed arrays for each variable in loadable_variables.

Notes

The results of opening each virtual dataset in parallel are sent back to the client process, so must not be too large. See the docs page on Scaling.

virtualizarr.open_virtual_datatree

open_virtual_datatree(
    url: str,
    registry: ObjectStoreRegistry,
    parser: Parser,
    *,
    loadable_variables: Iterable[str] | None = None,
    decode_times: bool | None = None,
) -> DataTree

Open an archival data source as an xarray.DataTree wrapping virtualized zarr arrays.

See the loadable_variables kwarg for a description of which data variables are loaded vs. virtualized.

Parameters:

Returns:

  • vds

    An xarray.DataTree containing virtual chunk references for all variables.

Examples:

Virtualize a Cloud Optimized GeoTIFF (COG) using virtual_tiff.VirtualTIFF:

from obstore.store import S3Store

from virtualizarr import open_virtual_datatree
from obspec_utils.registry import ObjectStoreRegistry
from virtual_tiff import VirtualTIFF

# Access a public Sentinel-2 COG from AWS
store = S3Store("sentinel-cogs", region="us-west-2", skip_signature=True)
registry = ObjectStoreRegistry({"s3://sentinel-cogs/": store})
url = "s3://sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A/B04.tif"
parser = VirtualTIFF(ifd_layout="nested")

with open_virtual_datatree(url=url, parser=parser, registry=registry) as vdt:
    print(vdt)

Virtualize a NetCDF4 file using the the virtualizarr.parsers.HDFParser:

from obstore.store import HTTPStore

from virtualizarr import open_virtual_datatree
from virtualizarr.parsers import HDFParser
from obspec_utils.registry import ObjectStoreRegistry

base = "https://github.com"
url = f"{base}/pydata/xarray-data/raw/refs/heads/master/precipitation.nc4"

store = HTTPStore(base)

parser = HDFParser()
registry = ObjectStoreRegistry({base: store})

vdt = open_virtual_datatree(url=url, registry=registry, parser=parser)
print(vdt)

Load prevent loading variables from any groups (default loads the coordinate variables "time", "lat", and "lon"):

vdt = open_virtual_datatree(
    url=url,
    registry=registry,
    parser=parser,
    loadable_variables=[],
)

Drop variables from a specific group after opening:

vdt = open_virtual_datatree(
    url=url,
    registry=registry,
    parser=parser,
)
vdt["/observed"] = vdt["/observed"].to_dataset().drop_vars(["lon"])

Information

virtualizarr.accessor.VirtualiZarrDatasetAccessor.nbytes property

nbytes: int

Size required to hold these references in memory in bytes.

Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

In-memory (loadable) variables are included in the total using xarray's normal .nbytes method.

Renaming paths

virtualizarr.accessor.VirtualiZarrDatasetAccessor.rename_paths

rename_paths(new: str | Callable[[str], str]) -> Dataset

Rename paths to chunks in every ManifestArray in this dataset.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

  • new (str | Callable[[str], str]) –

    New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

  • Dataset
See Also

virtualizarr.ManifestArray.rename_paths

virtualizarr.ChunkManifest.rename_paths

Examples:

Rename paths to reflect moving the referenced files from local storage to an S3 bucket.

>>> def local_to_s3_url(old_local_path: str) -> str:
...     from pathlib import Path
...
...     new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
...     filename = Path(old_local_path).name
...     return str(new_s3_bucket_url / filename)
>>>
>>> ds.vz.rename_paths(local_to_s3_url)