V2 Migration Guide¶

VirtualiZarr V2 includes breaking changes and other conceptual differences relative to V1. The goal of this guide is to provide some context around the core changes and demonstrate the updated usage.

Breaking API changes in `open_virtual_dataset`¶

Filetype identification, parsers, and stores¶

In V1 there was a lot of auto-magic guesswork of filetypes, urls, and types of remote storage happening under the hood. While this made it easy to get started, it could lead to a lot of foot-guns and unexpected behavior.

For example, the following V1-style usage would guess that your data is in a NetCDF file format and that your data is stored in a local file. However, this did not provide a way for people to develop their own utilities for data formats or specific datasets. This guess work also made it more challenging for developers to avoid bugs and users to understand VirtualiZarr's behavior.

from virtualizarr import open_virtual_dataset
vds = open_virtual_dataset("data1.nc")

To provide a more extensible and reliable API, VirtualiZarr V2 requires more explicit configuration by the user. You now must pass in a valid Parser and a obspec_utils.registry.ObjectStoreRegistry to virtualizarr.open_virtual_dataset. This change adds a bit more verbosity, but is intended to make virtualizing datasets more robust. It is most common for the ObjectStoreRegistry to contain one or more ObjectStores for reading the original data, but some parsers may accept an empty ObjectStoreRegistry.

S3 StoreLocal Store

from obstore.store import S3Store
from obspec_utils.registry import ObjectStoreRegistry

from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser

bucket = "nex-gddp-cmip6"
store = S3Store(
    bucket=bucket,
    region="us-west-2",
    skip_signature=True # required for this specific example data because the data is in a public bucket, so the S3Store shouldn't fetch and use credentials.
)
registry = ObjectStoreRegistry({f"s3://{bucket}": store})
parser = HDFParser()
vds = open_virtual_dataset(
    url=f"s3://{bucket}/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc",
    registry=registry,
    parser=parser
)
print(vds)

<xarray.Dataset> Size: 1GB
Dimensions:  (time: 365, lat: 600, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 3kB 2015-01-01T12:00:00 ... 2015-12-31T12:...
  * lat      (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
    tasmax   (time, lat, lon) float32 1GB ManifestArray<shape=(365, 600, 1440...
Attributes: (12/22)
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    activity:              NEX-GDDP-CMIP6
    Conventions:           CF-1.7
    frequency:             day
    ...                    ...
    doi:                   https://doi.org/10.7917/OFSG3345
    external_variables:    areacella
    contact:               Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
    creation_date:         Sat Nov 16 13:31:18 PST 2024
    disclaimer:            These data are considered provisional and subject ...
    tracking_id:           d4b2123b-abf9-4c3c-a780-58df6ce4e67f

from obstore.store import LocalStore
from obspec_utils.registry import ObjectStoreRegistry

from virtualizarr import open_virtual_dataset
from virtualizarr.parsers import HDFParser

from pathlib import Path

store_path = Path.cwd()
file_path = str(store_path / "tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc")
file_url = f"file://{file_path}"

store = LocalStore(prefix=store_path)
registry = ObjectStoreRegistry({file_url: store})
parser = HDFParser()

vds = open_virtual_dataset(
    url=file_url,
    registry=registry,
    parser=parser
)
    print(vds)

Deprecation of other kwargs¶

We have removed some keyword arguments to open_virtual_dataset that were deprecated, saw little use, or are now redundant. Specifically:

indexes - there is little need to control this separately from loadable_variables,
cftime_variables - this argument is deprecated upstream in favor of decode_times,
backend - replaced by the parser kwarg,
virtual_backend_kwargs - replaced by arguments to the parser instance,
reader_options - replaced by arguments to the ObjectStore instance.
virtual_array_class - so far has not been needed,

Missing features¶

We have worked hard to ensure that nearly all features from VirtualiZarr V1 are available in V2. To our knowledge, the only functionality regression is the ability to "glob" in virtualizarr.open_virtual_mfdataset. We aim to support this in the future. Please see issue #569 for progress towards this feature.

Xarray accessor name¶

In VirtualiZarr V2 you should use the shorthand .vz accessor for Xarray operations. The previous accessor name virtualize is available but will yield a DeprecationWarning. It may be remove entirely in the future. Here is an example of using the new accessor name:

vds.vz.to_icechunk(icechunk_store)

New functionality¶

Reading chunks without writing to disk¶

In Virtualizarr V1 if you wanted to access the underlying chunks of a dataset, you first had to write the reference to disk. From there you could read those references back into Xarray and access the chunks like you would with a normal Xarray dataset.

In V2 you can now directly read the chunks from a Parser into Xarray without writing them to disk first. 🤯 Since each Parser is now responsible for creating a ManifestStore and the ManifestStore has the ability to fetch data through any ObjectStore in the ObjectStoreRegistry. You can load data using the ManifestStore via either Zarr or Xarray. Here's an example using Xarray:

import xarray as xr
from obstore.store import S3Store
from obspec_utils.registry import ObjectStoreRegistry

from virtualizarr.parsers import HDFParser

bucket = "nex-gddp-cmip6"
store = S3Store(
    bucket=bucket,
    region="us-west-2",
    skip_signature=True
)
registry = ObjectStoreRegistry({f"s3://{bucket}": store})
parser = HDFParser()
manifest_store = parser(
    url=f"s3://{bucket}/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc",
    registry=registry
)
loadable_ds = xr.open_zarr(
    manifest_store,
    consolidated=False,
    zarr_format=3,
)
print(loadable_ds)

<xarray.Dataset> Size: 1GB
Dimensions:  (time: 365, lat: 600, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 3kB 2015-01-01T12:00:00 ... 2015-12-31T12:...
  * lat      (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
    tasmax   (time, lat, lon) float32 1GB dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
Attributes: (12/22)
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    activity:              NEX-GDDP-CMIP6
    Conventions:           CF-1.7
    frequency:             day
    ...                    ...
    doi:                   https://doi.org/10.7917/OFSG3345
    external_variables:    areacella
    contact:               Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
    creation_date:         Sat Nov 16 13:31:18 PST 2024
    disclaimer:            These data are considered provisional and subject ...
    tracking_id:           d4b2123b-abf9-4c3c-a780-58df6ce4e67f

Note how the Xarray dataset contains loadable Dask arrays rather than manifest arrays.

Bring your own parser¶

The V2 API means that you can use VirtualiZarr's data structure and xarray's functionality merging and combining datasets completely independently from the VirtualiZarr library! Virtual-Tiff and the hrrr-parser are examples of this pattern. Read some instructions on how to write a parser in the Custom Parsers page.