V2 Migration Guide¶
VirtualiZarr V2 includes breaking changes and other conceptual differences relative to V1. The goal of this guide is to provide some context around the core changes and demonstrate the updated usage.
Breaking API changes in open_virtual_dataset¶
Filetype identification, parsers, and stores¶
In V1 there was a lot of auto-magic guesswork of filetypes, urls, and types of remote storage happening under the hood. While this made it easy to get started, it could lead to a lot of foot-guns and unexpected behavior.
For example, the following V1-style usage would guess that your data is in a NetCDF file format and that your data is stored in a local file. However, this did not provide a way for people to develop their own utilities for data formats or specific datasets. This guess work also made it more challenging for developers to avoid bugs and users to understand VirtualiZarr's behavior.
from virtualizarr import open_virtual_dataset
vds = open_virtual_dataset("data1.nc")
To provide a more extensible and reliable API, VirtualiZarr V2 requires more explicit configuration by the user. You now must pass in a valid Parser and a obspec_utils.registry.ObjectStoreRegistry to virtualizarr.open_virtual_dataset. This change adds a bit more verbosity, but is intended to make virtualizing datasets more robust. It is most common for the ObjectStoreRegistry to contain one or more ObjectStores for reading the original data, but some parsers may accept an empty ObjectStoreRegistry.
from obstore.store import S3Store
from obspec_utils.registry import ObjectStoreRegistry
from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser
bucket = "nex-gddp-cmip6"
store = S3Store(
bucket=bucket,
region="us-west-2",
skip_signature=True # required for this specific example data because the data is in a public bucket, so the S3Store shouldn't fetch and use credentials.
)
registry = ObjectStoreRegistry({f"s3://{bucket}": store})
parser = HDFParser()
vds = open_virtual_dataset(
url=f"s3://{bucket}/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc",
registry=registry,
parser=parser
)
print(vds)
<xarray.Dataset> Size: 1GB
Dimensions: (time: 365, lat: 600, lon: 1440)
Coordinates:
* time (time) datetime64[ns] 3kB 2015-01-01T12:00:00 ... 2015-12-31T12:...
* lat (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
* lon (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
tasmax (time, lat, lon) float32 1GB ManifestArray<shape=(365, 600, 1440...
Attributes: (12/22)
cmip6_source_id: ACCESS-CM2
cmip6_institution_id: CSIRO-ARCCSS
cmip6_license: CC-BY-SA 4.0
activity: NEX-GDDP-CMIP6
Conventions: CF-1.7
frequency: day
... ...
doi: https://doi.org/10.7917/OFSG3345
external_variables: areacella
contact: Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
creation_date: Sat Nov 16 13:31:18 PST 2024
disclaimer: These data are considered provisional and subject ...
tracking_id: d4b2123b-abf9-4c3c-a780-58df6ce4e67f
from obstore.store import LocalStore
from obspec_utils.registry import ObjectStoreRegistry
from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser
from pathlib import Path
store_path = Path.cwd()
file_path = str(store_path / "tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc")
file_url = f"file://{file_path}"
store = LocalStore(prefix=store_path)
registry = ObjectStoreRegistry({file_url: store})
parser = HDFParser()
vds = open_virtual_dataset(
url=file_url,
registry=registry,
parser=parser
)
print(vds)
Deprecation of other kwargs¶
We have removed some keyword arguments to open_virtual_dataset that were deprecated, saw little use, or are now redundant. Specifically:
indexes- there is little need to control this separately fromloadable_variables,cftime_variables- this argument is deprecated upstream in favor ofdecode_times,backend- replaced by theparserkwarg,virtual_backend_kwargs- replaced by arguments to theparserinstance,reader_options- replaced by arguments to the ObjectStore instance.virtual_array_class- so far has not been needed,
Missing features¶
We have worked hard to ensure that nearly all features from VirtualiZarr V1 are available in V2. To our knowledge, the only functionality regression is the ability to "glob" in virtualizarr.open_virtual_mfdataset. We aim to support this in the future. Please see issue #569 for progress towards this feature.
Xarray accessor name¶
In VirtualiZarr V2 you should use the shorthand .vz accessor for Xarray operations. The previous accessor name
virtualize is available but will yield a DeprecationWarning. It may be remove entirely in the future. Here
is an example of using the new accessor name:
vds.vz.to_icechunk(icechunk_store)
New functionality¶
Reading chunks without writing to disk¶
In Virtualizarr V1 if you wanted to access the underlying chunks of a dataset, you first had to write the reference to disk. From there you could read those references back into Xarray and access the chunks like you would with a normal Xarray dataset.
In V2 you can now directly read the chunks from a Parser into Xarray without writing them to disk first. 🤯
Since each Parser is now responsible for creating a ManifestStore and the ManifestStore has the ability to fetch data through any ObjectStore in the ObjectStoreRegistry. You
can load data using the ManifestStore via either Zarr or Xarray. Here's an example using Xarray:
import xarray as xr
from obstore.store import S3Store
from obspec_utils.registry import ObjectStoreRegistry
from virtualizarr.parsers import HDFParser
bucket = "nex-gddp-cmip6"
store = S3Store(
bucket=bucket,
region="us-west-2",
skip_signature=True
)
registry = ObjectStoreRegistry({f"s3://{bucket}": store})
parser = HDFParser()
manifest_store = parser(
url=f"s3://{bucket}/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc",
registry=registry
)
loadable_ds = xr.open_zarr(
manifest_store,
consolidated=False,
zarr_format=3,
)
print(loadable_ds)
<xarray.Dataset> Size: 1GB
Dimensions: (time: 365, lat: 600, lon: 1440)
Coordinates:
* time (time) datetime64[ns] 3kB 2015-01-01T12:00:00 ... 2015-12-31T12:...
* lat (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
* lon (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
tasmax (time, lat, lon) float32 1GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
Attributes: (12/22)
cmip6_source_id: ACCESS-CM2
cmip6_institution_id: CSIRO-ARCCSS
cmip6_license: CC-BY-SA 4.0
activity: NEX-GDDP-CMIP6
Conventions: CF-1.7
frequency: day
... ...
doi: https://doi.org/10.7917/OFSG3345
external_variables: areacella
contact: Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
creation_date: Sat Nov 16 13:31:18 PST 2024
disclaimer: These data are considered provisional and subject ...
tracking_id: d4b2123b-abf9-4c3c-a780-58df6ce4e67f
Note how the Xarray dataset contains loadable Dask arrays rather than manifest arrays.
Bring your own parser¶
The V2 API means that you can use VirtualiZarr's data structure and xarray's functionality merging and combining datasets completely independently from the VirtualiZarr library! Virtual-Tiff and the hrrr-parser are examples of this pattern. Read some instructions on how to write a parser in the Custom Parsers page.