VirtualiZarr¶

Create virtual Zarr stores for cloud-friendly access to archival data, using familiar xarray syntax.

The best way to distribute large scientific datasets is via the Cloud, in Cloud-Optimized formats ¹. But often this data is stuck in archival pre-Cloud file formats such as netCDF.

VirtualiZarr² makes it easy to create "Virtual" Zarr stores, allowing performant access to archival data as if it were in the Cloud-Optimized Zarr format, without duplicating any data.

Motivation¶

"Virtualized data" solves an incredibly important problem: accessing big archival datasets via a cloud-optimized pattern, but without copying or modifying the original data in any way. This is a win-win-win for users, data engineers, and data providers. Users see fast-opening zarr-compliant stores that work performantly with libraries like xarray and dask, data engineers can provide this speed by adding a lightweight virtualization layer on top of existing data (without having to ask anyone's permission), and data providers don't have to change anything about their archival files for them to be used in a cloud-optimized way.

VirtualiZarr aims to make the creation of cloud-optimized virtualized zarr data from existing scientific data as easy as possible.

Features¶

Create virtual references pointing to bytes inside a archival file with open_virtual_dataset,
Supports a range of archival file formats, including netCDF4 and HDF5,
Combine data from multiple files into one larger store using xarray's combining functions, such as xarray.concat,
Commit the virtual references to storage either using the Kerchunk references specification or the Icechunk transactional storage engine.
Users access the virtual dataset using xarray.open_dataset.

Inspired by Kerchunk¶

VirtualiZarr grew out of discussions on the Kerchunk repository, and is an attempt to provide the game-changing power of kerchunk but in a zarr-native way, and with a familiar array-like API.

You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides almost all the same features as Kerchunk.

Quick usage example¶

Creating the virtual dataset looks quite similar to how we normally open data with xarray, but there are a few notable differences that are shown through this example.

First, import the necessary functions and classes:

import icechunk
import obstore

from obspec_utils.registry import ObjectStoreRegistry

from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser

Zarr can emit a lot of warnings about Numcodecs not being including in the Zarr version 3 specification yet -- let's suppress those.

import warnings
warnings.filterwarnings(
  "ignore",
  message="Numcodecs codecs are not in the Zarr version 3 specification*",
  category=UserWarning
)

We can use Obstore's obstore.store.from_url convenience method to create an ObjectStore that can fetch data from the specified URLs.

bucket = "s3://nex-gddp-cmip6"
path = "NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc"
store = obstore.store.from_url(bucket, region="us-west-2", skip_signature=True)

We also need to create an ObjectStoreRegistry that maps the URL structure to the ObjectStore.

registry = ObjectStoreRegistry({bucket: store})

Now, let's create a parser instance and create a virtual dataset by passing the URL, parser, and registry to virtualizarr.open_virtual_dataset.

parser = HDFParser()
vds = open_virtual_dataset(
  url=f"{bucket}/{path}",
  parser=parser,
  registry=registry,
  loadable_variables=[],
)
print(vds)

<xarray.Dataset> Size: 1GB
Dimensions:  (time: 365, lat: 600, lon: 1440)
Coordinates:
    time     (time) float64 3kB ManifestArray<shape=(365,), dtype=float64, ch...
    lat      (lat) float64 5kB ManifestArray<shape=(600,), dtype=float64, chu...
    lon      (lon) float64 12kB ManifestArray<shape=(1440,), dtype=float64, c...
Data variables:
    tasmax   (time, lat, lon) float32 1GB ManifestArray<shape=(365, 600, 1440...
Attributes: (12/22)
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    activity:              NEX-GDDP-CMIP6
    Conventions:           CF-1.7
    frequency:             day
    ...                    ...
    doi:                   https://doi.org/10.7917/OFSG3345
    external_variables:    areacella
    contact:               Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
    creation_date:         Sat Nov 16 13:31:18 PST 2024
    disclaimer:            These data are considered provisional and subject ...
    tracking_id:           d4b2123b-abf9-4c3c-a780-58df6ce4e67f

Since we specified loadable_variables=[], no data has been loaded or copied in this process. We have merely created an in-memory lookup table that points to the location of chunks in the original netCDF when data is needed later on. The default behavior (loadable_variables=None) will load data associated with coordinates but not data variables. The size represents the size of the original dataset - you can see the size of the virtual dataset using the vz accessor:

print(f"Original dataset size: {vds.nbytes} bytes")
print(f"Virtual dataset size: {vds.vz.nbytes} bytes")

Original dataset size: 1261459240 bytes
Virtual dataset size: 11776 bytes

VirtualiZarr's other top-level function is virtualizarr.open_virtual_mfdataset, which can open and virtualize multiple data sources into a single virtual dataset, similar to how xarray.open_mfdataset opens multiple data files as a single dataset.

urls = [f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_{year}_v2.0.nc" for year in range(2015, 2017)]
vds = open_virtual_mfdataset(urls, parser = parser, registry = registry)
print(vds)

<xarray.Dataset> Size: 3GB
Dimensions:  (time: 731, lat: 600, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 6kB 2015-01-01T12:00:00 ... 2016-12-31T12:...
  * lat      (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
    tasmax   (time, lat, lon) float32 3GB ManifestArray<shape=(731, 600, 1440...
Attributes: (12/22)
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    activity:              NEX-GDDP-CMIP6
    Conventions:           CF-1.7
    frequency:             day
    ...                    ...
    doi:                   https://doi.org/10.7917/OFSG3345
    external_variables:    areacella
    contact:               Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
    creation_date:         Sat Nov 16 13:31:18 PST 2024
    disclaimer:            These data are considered provisional and subject ...
    tracking_id:           d4b2123b-abf9-4c3c-a780-58df6ce4e67f

The magic of VirtualiZarr is that you can persist the virtual dataset to disk in a chunk references format such as Icechunk, meaning that the work of constructing the single coherent dataset only needs to happen once. For subsequent data access, you can use xarray.open_zarr to open that Icechunk store, which on object storage is far faster than using xarray.open_mfdataset to open the the original non-cloud-optimized files.

Let's persist the Virtual dataset using Icechunk. First let's create an Icechunk configuration with permissions to access our data.

config = icechunk.RepositoryConfig.default()
container = icechunk.VirtualChunkContainer(
    url_prefix="s3://nex-gddp-cmip6/",
    store=icechunk.s3_store(region="us-west-2", anonymous=True),
)
config.set_virtual_chunk_container(container)

Now we can store the references to our data. Here we store the references in an icechunk store that only lives in memory, but in most cases you'll store the "virtual" icechunk store in the cloud.

icechunk_store = icechunk.in_memory_storage()
repo = icechunk.Repository.create(icechunk_store, config)
session = repo.writable_session("main")
vds.vz.to_icechunk(session.store)
session.commit("Create virtual store")

See the Usage docs page for more details.

Talks and Presentations¶

2025/04/30 - Cloud-Native Geospatial Forum - Tom Nicholas - Slides / Recording
2024/11/21 - MET Office Architecture Guild - Tom Nicholas - Slides
2024/11/13 - Cloud-Native Geospatial conference - Raphael Hagen - Slides
2024/07/24 - ESIP Meeting - Sean Harkins - Event / Recording
2024/05/15 - Pangeo showcase - Tom Nicholas - Event / Recording / Slides

Credits¶

This package was originally developed by Tom Nicholas whilst working at [C]Worthy, who deserve credit for allowing him to prioritise a generalizable open-source solution to the dataset virtualization problem. VirtualiZarr is now a community-owned multi-stakeholder project.

Licence¶

Apache 2.0

References¶

Cloud-Native Repositories for Big Scientific Data, Abernathey et. al., Computing in Science & Engineering. ↩
(Pronounced like "virtualizer" but more piratey 🦜) ↩