Usage¶

This page explains how to use VirtualiZarr. Review the Data Structures documentation if you want to understand the conceptual models underpinning VirtualiZarr.

Opening files as virtual datasets¶

VirtualiZarr is for creating and manipulating "virtual" references to pre-existing data stored in the cloud or on disk in a variety of formats, by representing it in terms of the Zarr data model of chunked N-dimensional arrays.

The first step to virtualizing data is to create an ObjectStore instance that can access your data. Available ObjectStores are described in the obstore docs.

Note

Here, we use skip_signature=True because the data is public. We also need to set the cloud region for any data stored in AWS (this isn't required for all S3-compatible clouds).

S3GCSAzureR2HTTPCEPH / OSNLocal

import xarray as xr
from obstore.store import from_url

from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser
from obspec_utils.registry import ObjectStoreRegistry

bucket = "s3://nex-gddp-cmip6"
path = "NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc"
url = f"{bucket}/{path}"
store = from_url(bucket, region="us-west-2", skip_signature=True)
registry = ObjectStoreRegistry({bucket: store})

import xarray as xr
from obstore.store import from_url

from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser
from obspec_utils.registry import ObjectStoreRegistry

bucket = "gs://data-bucket"
path = "file-path/data.nc"
url = f"{bucket}/{path}"
store = from_url(bucket)
registry = ObjectStoreRegistry({bucket: store})

import xarray as xr
from obspec_utils.registry import ObjectStoreRegistry
from obstore.store import from_url

from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser

bucket = "abfs://data-container"
path = "file-path/data.nc"
url = f"{bucket}/{path}"
store = from_url(bucket)
registry = ObjectStoreRegistry({bucket: store})

import xarray as xr
from obstore.store import S3Store
from obspec_utils.registry import ObjectStoreRegistry

from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser

endpoint = "https://f0b62eebfbdde1133378bfe3958325f6.r2.cloudflarestorage.com"
access_key_id = "<access_key_id>"
secret_access_key = "<secret_access_key>"
path = "<path_to_files>"
scheme = "s3://"
bucket_name = "<bucket_name>"
bucket = f"{scheme}{bucket_name}"

# create anon s3 store
store = S3Store.from_url(f"{bucket}", endpoint=endpoint, skip_signature=True)

# create s3 store with aws-style credentials
store = S3Store.from_url(f"{bucket}", endpoint=endpoint, access_key_id=access_key_id, secret_access_key=secret_access_key)

registry = ObjectStoreRegistry({f"{bucket}": store})

import xarray as xr
from obstore.store import from_url
from obspec_utils.registry import ObjectStoreRegistry

from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser

# This examples uses a NetCDF file of CMIP6 from ESGF.
bucket  = 'https://esgf-data.ucar.edu'
path = 'thredds/fileServer/esg_dataroot/CMIP6/CMIP/NCAR/CESM2/historical/r3i1p1f1/day/tas/gn/v20190308/tas_day_CESM2_historical_r3i1p1f1_gn_19200101-19291231.nc'
store = from_url(bucket)
registry = ObjectStoreRegistry({bucket: store})

import xarray as xr
from obstore.store import S3Store
from obspec_utils.registry import ObjectStoreRegistry

from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser

endpoint = "https://nyu1.osn.mghpcc.org"
access_key_id = "<access_key_id>"
secret_access_key = "<secret_access_key>"
path = "<path_to_files>"
scheme = "s3://"
bucket_name = "<bucket_name>"
bucket = f"{scheme}{bucket_name}"

# create anon s3 store
store = S3Store.from_url(f"{bucket}", endpoint=endpoint, skip_signature=True)

# create s3 store with aws-style credentials
store = S3Store.from_url(f"{bucket}", endpoint=endpoint, access_key_id = access_key_id, secret_access_key=secret_access_key)

registry = ObjectStoreRegistry({f"{bucket}": store})

import xarray as xr
from obstore.store import LocalStore
from obspec_utils.registry import ObjectStoreRegistry

from virtualizarr import open_virtual_dataset, open_virtual_mfdataset
from virtualizarr.parsers import HDFParser

from pathlib import Path

store_path = Path.cwd()
file_path = str(store_path / "data.nc")
file_url = f"file://{file_path}"

store = LocalStore(prefix=store_path)
registry = ObjectStoreRegistry({file_url: store})

Zarr can emit a lot of warnings about Numcodecs not being including in the Zarr version 3 specification yet -- let's suppress those.

import warnings
warnings.filterwarnings(
  "ignore",
  message="Numcodecs codecs are not in the Zarr version 3 specification*",
  category=UserWarning
)

We can open a virtual representation of this file using virtualizarr.open_virtual_dataset. VirtualiZarr has various "parsers" that understand different file formats. You must supply a parser, and as all netCDF4 files are HDF5 files, here we used the HDFParser.

parser = HDFParser()
vds = open_virtual_dataset(
  url=f"{bucket}/{path}",
  parser=parser,
  registry=registry,
)

Important

It is good practice to use open_virtual_dataset as a context manager to automatically close file handles. For example the above code would become:

with vz.open_virtual_dataset('air.nc', registry=registry, parser=parser) as vds:
    # do things with vds
    ...

This is important to avoid accumulating open file handles and for avoiding leaks, so is recommended for production code. However we omit the context managers from the examples in the documentation for brevity.

Printing this "virtual dataset" shows that although it is an instance of xarray.Dataset, unlike a typical xarray dataset, it wraps virtualizarr.manifests.ManifestArray objects in addition to a few in-memory NumPy arrays. You can learn more about the ManifestArray class in the Data Structures documentation.

print(vds)

<xarray.Dataset> Size: 1GB
Dimensions:  (time: 365, lat: 600, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 3kB 2015-01-01T12:00:00 ... 2015-12-31T12:...
  * lat      (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
    tasmax   (time, lat, lon) float32 1GB ManifestArray<shape=(365, 600, 1440...
Attributes: (12/22)
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    activity:              NEX-GDDP-CMIP6
    Conventions:           CF-1.7
    frequency:             day
    ...                    ...
    doi:                   https://doi.org/10.7917/OFSG3345
    external_variables:    areacella
    contact:               Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
    creation_date:         Sat Nov 16 13:31:18 PST 2024
    disclaimer:            These data are considered provisional and subject ...
    tracking_id:           d4b2123b-abf9-4c3c-a780-58df6ce4e67f

Generally a "virtual dataset" is any xarray.Dataset which wraps one or more virtualizarr.manifests.ManifestArray objects.

These particular virtualizarr.manifests.ManifestArray objects are each a virtual reference to some data in the source NetCDF file, with the references stored in the form of ChunkManifests.

As the manifest contains only addresses at which to find large binary chunks, the virtual dataset takes up far less space in memory than the original dataset does:

print(vds.nbytes)

1261459240

print(vds.vz.nbytes)

Important

Virtual datasets are not normal xarray datasets!

Although the top-level type is still xarray.Dataset, they are intended only as an abstract representation of a set of data files, not as something you can do analysis with. If you try to load, view, or plot any data you will get a NotImplementedError. Virtual datasets only support a very limited subset of normal xarray operations, particularly functions and methods for concatenating, merging and extracting variables, as well as operations for renaming dimensions and variables.

The only use case for a virtual dataset is combining references to files before writing out those references to disk.

Loading variables¶

Once a virtual dataset is created, you won't be able to load the values of the virtual variables into memory. Instead, you could load specific variables during virtual dataset creation using the loadable_variables parameter. Loading the variables during virtual dataset creation has several benefits detailed in the FAQ.

vds = open_virtual_dataset(
    url=url,
    registry=registry,
    parser=parser,
    loadable_variables=['time']
)
print(vds)

<xarray.Dataset> Size: 1GB
Dimensions:  (time: 365, lat: 600, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 3kB 2015-01-01T12:00:00 ... 2015-12-31T12:...
    lat      (lat) float64 5kB ManifestArray<shape=(600,), dtype=float64, chu...
    lon      (lon) float64 12kB ManifestArray<shape=(1440,), dtype=float64, c...
Data variables:
    tasmax   (time, lat, lon) float32 1GB ManifestArray<shape=(365, 600, 1440...
Attributes: (12/22)
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    activity:              NEX-GDDP-CMIP6
    Conventions:           CF-1.7
    frequency:             day
    ...                    ...
    doi:                   https://doi.org/10.7917/OFSG3345
    external_variables:    areacella
    contact:               Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
    creation_date:         Sat Nov 16 13:31:18 PST 2024
    disclaimer:            These data are considered provisional and subject ...
    tracking_id:           d4b2123b-abf9-4c3c-a780-58df6ce4e67f

You can see that the dataset contains a mixture of virtual variables backed by ManifestArray objects (tasmax, lat, and lon), and loadable variables backed by (lazy) numpy arrays (time).

The default value of loadable_variables is None, which effectively specifies all the "dimension coordinates" in the file, i.e. all one-dimensional coordinate variables whose name is the same as the name of their dimensions. Xarray indexes will also be automatically created for these variables. Together these defaults mean that your virtual dataset will be opened with the same indexes as it would have been if it had been opened with just xarray.open_dataset.

Note

In general, it is recommended to load all of your low-dimensional (e.g scalar and 1D) variables.

Whilst this does mean the original data will be duplicated in your new virtual zarr store, by loading your coordinates into memory they can be inlined in the reference file or be stored as single chunks rather than large numbers of extremely tiny chunks, which speeds up loading that data on subsequent usages of the virtual dataset.

However, you should not do this for much higher-dimensional variables, as then you might use a lot of storage duplicating them, defeating the point of the virtual zarr approach.

Also, anything duplicated could become out of sync with the referenced original files, especially if not using a transactional storage engine such as Icechunk.

Loading CF-encoded time variables¶

To decode time variables according to the CF conventions upon loading, you must ensure that variable is one of the loadable_variables and the decode_times argument of open_virtual_dataset is set to True (decode_times defaults to None).

vds = open_virtual_dataset(
    url=url,
    registry=registry,
    parser=parser,
    loadable_variables=['time'],
    decode_times=True,
)
print(vds)

<xarray.Dataset> Size: 1GB
Dimensions:  (time: 365, lat: 600, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 3kB 2015-01-01T12:00:00 ... 2015-12-31T12:...
    lat      (lat) float64 5kB ManifestArray<shape=(600,), dtype=float64, chu...
    lon      (lon) float64 12kB ManifestArray<shape=(1440,), dtype=float64, c...
Data variables:
    tasmax   (time, lat, lon) float32 1GB ManifestArray<shape=(365, 600, 1440...
Attributes: (12/22)
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    activity:              NEX-GDDP-CMIP6
    Conventions:           CF-1.7
    frequency:             day
    ...                    ...
    doi:                   https://doi.org/10.7917/OFSG3345
    external_variables:    areacella
    contact:               Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
    creation_date:         Sat Nov 16 13:31:18 PST 2024
    disclaimer:            These data are considered provisional and subject ...
    tracking_id:           d4b2123b-abf9-4c3c-a780-58df6ce4e67f

Combining virtual datasets¶

In general we should be able to combine all the datasets from our archival files into one using some combination of calls to xarray.concat and xarray.merge. For combining along multiple dimensions in one call we also have xarray.combine_nested and xarray.combine_by_coords. If you're not familiar with any of these functions we recommend you skim through xarray's docs on combining.

Important

Currently the virtual approach requires the same chunking and encoding across datasets. See the FAQ for more details.

Manual concatenation ordering¶

The simplest case of concatenation is when you have a set of files and you know the order in which they should be concatenated, without looking inside the files. In this case it is sufficient to open the files one-by-one, then pass the virtual datasets as a list to the concatenation function.

url_1 = "s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2015_v2.0.nc"
url_2 = "s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/ssp126/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_2016_v2.0.nc"

vds1 = open_virtual_dataset(url=url_1, registry=registry, parser=parser)
vds2 = open_virtual_dataset(url=url_2, registry=registry, parser=parser)

As we know the correct order a priori, we can just combine along one dimension using xarray.concat.

combined_vds = xr.concat([vds1, vds2], dim='time')
print(combined_vds)

<xarray.Dataset> Size: 3GB
Dimensions:  (time: 731, lat: 600, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 6kB 2015-01-01T12:00:00 ... 2016-12-31T12:...
  * lat      (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
    tasmax   (time, lat, lon) float32 3GB ManifestArray<shape=(731, 600, 1440...
Attributes: (12/22)
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    activity:              NEX-GDDP-CMIP6
    Conventions:           CF-1.7
    frequency:             day
    ...                    ...
    doi:                   https://doi.org/10.7917/OFSG3345
    external_variables:    areacella
    contact:               Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
    creation_date:         Sat Nov 16 13:31:18 PST 2024
    disclaimer:            These data are considered provisional and subject ...
    tracking_id:           d4b2123b-abf9-4c3c-a780-58df6ce4e67f

Note

If you have any virtual coordinate variables, you will likely need to specify the keyword arguments coords='minimal' and compat='override' to xarray.concat(), because the default behaviour of xarray will attempt to load coordinates in order to check their compatibility with one another. Similarly, if there are data variables that do not include the concatenation dimension, you will likely need to specify data_vars='minimal'.

In future this default will be changed, such that passing these two arguments explicitly will become unnecessary.

The general multi-dimensional version of this concatenation-by-order-supplied can be achieved using xarray.combine_nested().

combined_vds = xr.combine_nested([vds1, vds2], concat_dim=['time'])

In N-dimensions the datasets would need to be passed as an N-deep nested list-of-lists, see the xarray docs.

Note

For manual concatenation we can actually avoid creating any xarray indexes, as we won't need them. Without indexes we can avoid loading any data whatsoever from the files. However, you should first be confident that the archival files actually do have compatible data, as the coordinate values then cannot be efficiently compared for consistency (i.e. aligned).

You can achieve both the opening and combining steps for multiple files in one go by using open_virtual_mfdataset.

combined_vds = open_virtual_mfdataset(
    [url_1, url_2],
    registry=registry,
    parser=parser,
    combine="nested",
    concat_dim="time"
)

We passed combine='nested' to specify that we want the datasets to be combined in the order they appear, using xr.combine_nested under the hood.

Ordering by coordinate values¶

If you're happy to load 1D dimension coordinates into memory, you can use their values to do the ordering for you!

vds1 = open_virtual_dataset(url=url_1, registry=registry, parser=parser, loadable_variables=['time','lat', 'lon'], decode_times=True)
vds2 = open_virtual_dataset(url=url_2, registry=registry, parser=parser, loadable_variables=['time','lat', 'lon'], decode_times=True)

combined_vds = xr.combine_by_coords([vds2, vds1], combine_attrs="drop_conflicts")

Notice we don't have to specify the concatenation dimension explicitly - xarray works out the correct ordering for us. Even though we actually passed in the virtual datasets in the wrong order just now, they have been combined in the correct order such that the 1-dimensional time coordinate has ascending values. As a result our virtual dataset still has the data in the correct order.

print(combined_vds)

<xarray.Dataset> Size: 3GB
Dimensions:  (time: 731, lat: 600, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 6kB 2015-01-01T12:00:00 ... 2016-12-31T12:...
  * lat      (lat) float64 5kB -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 12kB 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
Data variables:
    tasmax   (time, lat, lon) float32 3GB ManifestArray<shape=(731, 600, 1440...
Attributes: (12/20)
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    activity:              NEX-GDDP-CMIP6
    Conventions:           CF-1.7
    frequency:             day
    ...                    ...
    title:                 ACCESS-CM2, r1i1p1f1, ssp126, global downscaled CM...
    resolution_id:         0.25 degree
    doi:                   https://doi.org/10.7917/OFSG3345
    external_variables:    areacella
    contact:               Dr. Bridget Thrasher: bridget@climateanalyticsgrou...
    disclaimer:            These data are considered provisional and subject ...

Again, we can achieve both the opening and combining steps for multiple files in one go by using open_virtual_mfdataset, but this passing combine='by_coords'.

combined_vds = open_virtual_mfdataset(
    [url_1, url_2],
    registry=registry,
    parser=parser,
    combine="by_coords",
    combine_attrs="drop_conflicts",
)

In the future, we aim to provide globbing utilities to simplify finding datasets to include.

Ordering using metadata¶

You can create a new index from the url by passing a function to the preprocess parameter of open_virtual_mfdataset. An example will be added in the future.

Combining many virtual datasets at once¶

Combining a large number (e.g., 1000s) of virtual datasets at once should be very quick (a few seconds), as we are manipulating only a few KBs of metadata in memory.

However creating 1000s of virtual datasets at once can take a very long time. (If it were quick to do so, there would be little need for this library!) See the page on Scaling for tips on how to create large numbers of virtual datasets at once.

Changing the prefix of urls in the virtual dataset¶

You can update the urls stored in a manifest or virtual dataset without changing the byte range information using the virtualizarr.accessor.VirtualiZarrDatasetAccessor.rename_paths accessor method.

For example, you may want to rename urls according to a function to reflect having moved the location of the referenced files from local storage to an S3 bucket.

def local_to_s3_url(old_local_path: str) -> str:
    from pathlib import Path

    new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"

    filename = Path(old_local_path).name
    return str(new_s3_bucket_url / filename)

renamed_vds = vds.vz.rename_paths(local_to_s3_url)

Writing virtual stores to disk¶

Once we've combined references to all the chunks of all our archival files into one virtual xarray dataset, we still need to store those references so that they can be read by our analysis code later.

Writing to an Icechunk Store¶

We can store these references using Icechunk. Icechunk is an open-source, cloud-native transactional tensor storage engine that is fully compatible with Zarr-Python version 3, as it conforms to the Zarr V3 specification. To export our virtual dataset to an Icechunk Store, we use the virtualizarr.accessor.VirtualiZarrDatasetAccessor.to_icechunk accessor method.

Here we use a memory store but in real use-cases you'll probably want to use icechunk.local_filesystem_storage, icechunk.s3_storage, icechunk.azure_storage, icechunk.gcs_storage, or a similar storage class.

import icechunk

# you need to explicitly grant permissions to icechunk to read from the locations of your archival files
# we use `anonymous=True` because this is a public bucket, otherwise you need to set credentials explicitly
config = icechunk.RepositoryConfig.default()
config.set_virtual_chunk_container(
    icechunk.VirtualChunkContainer(
        url_prefix="s3://nex-gddp-cmip6/",
        store=icechunk.s3_store(region="us-west-2", anonymous=True),
    ),
)

# create an in-memory icechunk repository that includes the virtual chunk containers
storage = icechunk.in_memory_storage()
repo = icechunk.Repository.create(storage, config)

# open a writable icechunk session to be able to add new contents to the store
session = repo.writable_session("main")

# write the virtual dataset to the session's IcechunkStore instance, using VirtualiZarr's `.vz` accessor
vds1.vz.to_icechunk(session.store)

# commit your changes so that they are permanently available as a new snapshot
snapshot_id = session.commit("Wrote first dataset")
print(snapshot_id)

# optionally persist the new permissions to be permanent, which you probably want
# otherwise every user who wants to read the referenced virtual data back later will have to repeat the `config.set_virtual_chunk_container` step at read time.
repo.save_config()

XBSASR94X7MZCPEA3FW0

Append to an existing Icechunk Store¶

You can append a virtual dataset to an existing Icechunk store using the append_dim argument. This option is designed to behave similarly to the append_dim option to xarray's xarray.Dataset.to_zarr method, and is especially useful for datasets that grow over time.

Important

Note again that the virtual Zarr approach requires the same chunking and encoding across datasets. This including when appending to an existing Icechunk-backed Zarr store. See the FAQ for more details.

# write the virtual dataset to the session with the IcechunkStore
session = repo.writable_session("main")
vds2.vz.to_icechunk(session.store, append_dim="time")
snapshot_id = session.commit("Appended second dataset")
print(snapshot_id)

1T8VSFMMDD0QW5ASHVFG

See the Icechunk documentation for more details.

Writing to Kerchunk's format and reading data via fsspec¶

The kerchunk library has its own specification for serializing virtual datasets as a JSON file or Parquet directory.

To write out all the references in the virtual dataset as a single kerchunk-compliant JSON or parquet file, you can use the virtualizarr.accessor.VirtualiZarrDatasetAccessor.to_kerchunk accessor method.

combined_vds.vz.to_kerchunk('output/combined.json', format='json')

These zarr-like references can now be interpreted by fsspec, using kerchunk's built-in xarray backend (kerchunk must be installed to use engine='kerchunk').

combined_ds = xr.open_dataset('output/combined.json', engine="kerchunk")
print(combined_ds)

In-memory ("loadable") variables backed by numpy arrays can also be written out to kerchunk reference files, with the values serialized as bytes. This is equivalent to kerchunk's concept of "inlining", but done on a per-array basis using the loadable_variables kwarg rather than a per-chunk basis using kerchunk's inline_threshold kwarg.

Note

Currently you can only serialize in-memory variables to kerchunk references if they do not have any encoding.

When you have many chunks, the reference file can get large enough to be unwieldy as JSON. In that case the references can be instead stored as parquet, which again this uses kerchunk internally.

combined_vds.vz.to_kerchunk('output/combined.parquet', format='parquet')

And again we can read these references using the "kerchunk" backend

combined_ds = xr.open_dataset('output/combined.parquet', engine="kerchunk")
print(combined_ds)

By default references are placed in separate parquet files when the total number of references exceeds record_size. If there are fewer than categorical_threshold unique urls referenced by a particular variable, url will be stored as a categorical variable.

Opening Kerchunk references as virtual datasets¶

You can open existing Kerchunk json or parquet references as VirtualiZarr virtual datasets. This may be useful for manipulating them or converting existing kerchunk-formatted references to other reference storage formats such as Icechunk.

from pathlib import Path
from virtualizarr.parsers import KerchunkJSONParser, KerchunkParquetParser

url_cwd = f"file://{str(Path.cwd())}"
store = from_url(url_cwd)
registry.register(url_cwd, store)
vds = open_virtual_dataset('output/combined.json', registry=registry, parser=KerchunkJSONParser())
# or
vds = open_virtual_dataset('output/combined.parquet', registry=registry, parser=KerchunkParquetParser())

One difference between the kerchunk references format and virtualizarr's internal manifest representation (as well as Icechunk's format) is that paths in kerchunk references can be relative paths.

Opening kerchunk references that contain relative local filepaths therefore requires supplying another piece of information: the directory of the fsspec filesystem which the filepath was defined relative to.

You can dis-ambuiguate kerchunk references containing relative paths by passing the fs_root kwarg to virtual_backend_kwargs.

# file `relative_refs.json` contains a path like './file.nc'

vds = open_virtual_dataset(
    'relative_refs.json',
    registry=registry,
    parser=KerchunkJSONParser(
        fs_root='file:///data_directory/',
    )
)

# the path in the virtual dataset will now be 'file:///data_directory/file.nc'

Note that as the virtualizarr virtualizarr.accessor.VirtualiZarrDatasetAccessor.to_kerchunk method only writes absolute paths, the only scenario in which you might come across references containing relative paths is if you are opening references that were previously created using the kerchunk library alone.