Virtual Datasets¶

VirtualiZarr has a small API surface, because most of the complexity is handled by xarray functions like xarray.concat and xarray.merge. Users can use xarray for every step apart from reading and serializing virtual references.

Reading¶

virtualizarr.open_virtual_dataset ¶

open_virtual_dataset(
    url: str,
    registry: ObjectStoreRegistry,
    parser: Parser,
    drop_variables: Iterable[str] | None = None,
    loadable_variables: Iterable[str] | None = None,
    decode_times: bool | None = None,
) -> Dataset

Open an archival data source as an xarray.Dataset wrapping virtualized zarr arrays.

No data variables will be loaded unless specified in the loadable_variables kwarg (in which case they will open as lazily indexed arrays using xarray's standard lazy indexing classes). Coordinate variables are loaded by default following xarray's behavior.

Parameters:

url (str) –
The url of the data source to virtualize. The URL should include a scheme. For example:
- url="file:///Users/my-name/Documents/my-project/my-data.nc" for a local data source.
- url="s3://my-bucket/my-project/my-data.nc" for a remote data source on an S3 compatible cloud.
registry (ObjectStoreRegistry) –

An ObjectStoreRegistry for resolving urls and reading data.
parser (Parser) –
A parser to use for the given data source. For example:
- virtualizarr.parsers.HDFParser for virtualizing NetCDF4 or HDF5 files.
- virtualizarr.parsers.FITSParser for virtualizing FITS files.
- virtualizarr.parsers.NetCDF3Parser for virtualizing NetCDF3 files.
- virtualizarr.parsers.DMRPPParser for virtualizing DMR++ files.
- virtualizarr.parsers.KerchunkJSONParser for re-opening Kerchunk JSONs.
- virtualizarr.parsers.KerchunkParquetParser for re-opening Kerchunk Parquets.
- virtualizarr.parsers.ZarrParser for virtualizing Zarr stores.
- virtual_tiff.VirtualTIFF for virtualizing TIFFs.
drop_variables (Iterable[str] | None, default: None ) –

Variables in the data source to drop before returning.
loadable_variables (Iterable[str] | None, default: None ) –

Variables in the data source to load as Dask/NumPy arrays instead of as virtual arrays.
decode_times (bool | None, default: None ) –

Bool that is passed into xarray.open_dataset. Allows time to be decoded into a datetime object.

Returns:

vds –

An xarray.Dataset containing virtual chunk references for all variables not included in loadable_variables and normal lazily indexed arrays for each variable in loadable_variables.

virtualizarr.open_virtual_mfdataset ¶

open_virtual_mfdataset(
    urls: str
    | PathLike
    | Sequence[str | PathLike]
    | NestedSequence[str | PathLike],
    registry: ObjectStoreRegistry,
    parser: Parser,
    concat_dim: str
    | DataArray
    | Index
    | Sequence[str]
    | Sequence[DataArray]
    | Sequence[Index]
    | None = None,
    compat: "CompatOptions" = "no_conflicts",
    preprocess: Callable[[Dataset], Dataset] | None = None,
    data_vars: Literal["all", "minimal", "different"] | list[str] = "all",
    coords="different",
    combine: Literal["by_coords", "nested"] = "by_coords",
    parallel: Literal["dask", "lithops", False] | type[Executor] = False,
    join: "JoinOptions" = "outer",
    attrs_file: str | PathLike | None = None,
    combine_attrs: "CombineAttrsOptions" = "override",
    **kwargs,
) -> Dataset

Open multiple data sources as a single virtual dataset.

This function is explicitly modelled after xarray.open_mfdataset, and works in the same way.

If combine='by_coords' then the function combine_by_coords is used to combine the datasets into one before returning the result, and if combine='nested' then combine_nested is used. The urls must be structured according to which combining function is used, the details of which are given in the documentation for combine_by_coords and combine_nested. By default combine='by_coords' will be used. Global attributes from the attrs_file are used for the combined dataset.

Parameters:

urls (str | PathLike | Sequence[str | PathLike] | NestedSequence[str | PathLike]) –

Same as in virtualizarr.open_virtual_dataset
registry (ObjectStoreRegistry) –

An ObjectStoreRegistry for resolving urls and reading data.
concat_dim (str | DataArray | Index | Sequence[str] | Sequence[DataArray] | Sequence[Index] | None, default: None ) –

Same as in xarray.open_mfdataset
compat ('CompatOptions', default: 'no_conflicts' ) –

Same as in xarray.open_mfdataset
preprocess (Callable[[Dataset], Dataset] | None, default: None ) –

Same as in xarray.open_mfdataset
data_vars (Literal['all', 'minimal', 'different'] | list[str], default: 'all' ) –

Same as in xarray.open_mfdataset
coords –

Same as in xarray.open_mfdataset
combine (Literal['by_coords', 'nested'], default: 'by_coords' ) –

Same as in xarray.open_mfdataset
parallel ("dask", "lithops", False, or type of subclass of [concurrent.futures.Executor][], default: False ) –

Specify whether the open and preprocess steps of this function will be performed in parallel using lithops, dask.delayed, or any executor compatible with the concurrent.futures interface, or in serial. Default is False, which will execute these steps in serial.
join ('JoinOptions', default: 'outer' ) –

Same as in xarray.open_mfdataset
attrs_file (str | PathLike | None, default: None ) –

Same as in xarray.open_mfdataset
combine_attrs ('CombineAttrsOptions', default: 'override' ) –

Same as in xarray.open_mfdataset
**kwargs (optional, default: {} ) –

Additional arguments passed on to virtualizarr.open_virtual_dataset. For an overview of some of the possible options, see the documentation of virtualizarr.open_virtual_dataset.

Returns:

vds –

An xarray.Dataset containing virtual chunk references for all variables not included in loadable_variables and normal lazily indexed arrays for each variable in loadable_variables.

Notes

The results of opening each virtual dataset in parallel are sent back to the client process, so must not be too large. See the docs page on Scaling.

virtualizarr.open_virtual_datatree ¶

open_virtual_datatree(
    url: str,
    registry: ObjectStoreRegistry,
    parser: Parser,
    *,
    loadable_variables: Iterable[str] | None = None,
    decode_times: bool | None = None,
) -> DataTree

Open an archival data source as an xarray.DataTree wrapping virtualized zarr arrays.

See the loadable_variables kwarg for a description of which data variables are loaded vs. virtualized.

Parameters:

url (str) –
The url of the data source to virtualize. The URL should include a scheme. For example:
- url="file:///Users/my-name/Documents/my-project/my-data.nc" for a local data source.
- url="s3://my-bucket/my-project/my-data.nc" for a remote data source on an S3 compatible cloud.
registry (ObjectStoreRegistry) –

An ObjectStoreRegistry for resolving urls and reading data.
parser (Parser) –
A parser to use for the given data source. For example:
- virtualizarr.parsers.HDFParser for virtualizing NetCDF4 or HDF5 files.
- virtualizarr.parsers.FITSParser for virtualizing FITS files.
- virtualizarr.parsers.NetCDF3Parser for virtualizing NetCDF3 files.
- virtualizarr.parsers.KerchunkJSONParser for re-opening Kerchunk JSONs.
- virtualizarr.parsers.KerchunkParquetParser for re-opening Kerchunk Parquets.
- virtualizarr.parsers.ZarrParser for virtualizing Zarr stores.
- virtualizarr.parsers.ZarrParser for virtualizing Zarr stores.
- virtual_tiff.VirtualTIFF for virtualizing TIFFs.
loadable_variables (Iterable[str] | None, default: None ) –

If None (the default), dimension coordinate variables (1D variables whose name matches their dimension) will be loaded automatically to enable xarray indexing.

If an empty iterable, no variables will be loaded.

Other options are not yet supported.
decode_times (bool | None, default: None ) –

Bool that is passed into xarray.open_dataset. Allows time to be decoded into a datetime object.

Returns:

vds –

An xarray.DataTree containing virtual chunk references for all variables.

Examples:

Virtualize a Cloud Optimized GeoTIFF (COG) using virtual_tiff.VirtualTIFF:

from obstore.store import S3Store

from virtualizarr import open_virtual_datatree
from obspec_utils.registry import ObjectStoreRegistry
from virtual_tiff import VirtualTIFF

# Access a public Sentinel-2 COG from AWS
store = S3Store("sentinel-cogs", region="us-west-2", skip_signature=True)
registry = ObjectStoreRegistry({"s3://sentinel-cogs/": store})
url = "s3://sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A/B04.tif"
parser = VirtualTIFF(ifd_layout="nested")

with open_virtual_datatree(url=url, parser=parser, registry=registry) as vdt:
    print(vdt)

Virtualize a NetCDF4 file using the the virtualizarr.parsers.HDFParser:

from obstore.store import HTTPStore

from virtualizarr import open_virtual_datatree
from virtualizarr.parsers import HDFParser
from obspec_utils.registry import ObjectStoreRegistry

base = "https://github.com"
url = f"{base}/pydata/xarray-data/raw/refs/heads/master/precipitation.nc4"

store = HTTPStore(base)

parser = HDFParser()
registry = ObjectStoreRegistry({base: store})

vdt = open_virtual_datatree(url=url, registry=registry, parser=parser)
print(vdt)

Load prevent loading variables from any groups (default loads the coordinate variables "time", "lat", and "lon"):

vdt = open_virtual_datatree(
    url=url,
    registry=registry,
    parser=parser,
    loadable_variables=[],
)

Drop variables from a specific group after opening:

vdt = open_virtual_datatree(
    url=url,
    registry=registry,
    parser=parser,
)
vdt["/observed"] = vdt["/observed"].to_dataset().drop_vars(["lon"])

Information¶

virtualizarr.accessor.VirtualiZarrDatasetAccessor.nbytes `property` ¶

nbytes: int

Size required to hold these references in memory in bytes.

Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.

In-memory (loadable) variables are included in the total using xarray's normal .nbytes method.

Renaming paths¶

virtualizarr.accessor.VirtualiZarrDatasetAccessor.rename_paths ¶

rename_paths(new: str | Callable[[str], str]) -> Dataset

Rename paths to chunks in every ManifestArray in this dataset.

Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.

Parameters:

new (str | Callable[[str], str]) –

New path to use for all chunks, either as a string, or as a function which accepts and returns strings.

Returns:

Dataset –

Virtual Datasets¶

Reading¶

virtualizarr.open_virtual_dataset ¶

virtualizarr.open_virtual_mfdataset ¶

virtualizarr.open_virtual_datatree ¶

Information¶

virtualizarr.accessor.VirtualiZarrDatasetAccessor.nbytes property ¶

Renaming paths¶

virtualizarr.accessor.VirtualiZarrDatasetAccessor.rename_paths ¶

virtualizarr.accessor.VirtualiZarrDatasetAccessor.nbytes `property` ¶