Virtual Datasets¶
VirtualiZarr has a small API surface, because most of the complexity is handled by xarray functions like xarray.concat and xarray.merge. Users can use xarray for every step apart from reading and serializing virtual references.
Reading¶
virtualizarr.open_virtual_dataset ¶
open_virtual_dataset(
url: str,
registry: ObjectStoreRegistry,
parser: Parser,
drop_variables: Iterable[str] | None = None,
loadable_variables: Iterable[str] | None = None,
decode_times: bool | None = None,
) -> Dataset
Open an archival data source as an xarray.Dataset wrapping virtualized zarr arrays.
No data variables will be loaded unless specified in the loadable_variables kwarg (in which case they will open as lazily indexed arrays using xarray's standard lazy indexing classes).
Coordinate variables are loaded by default following xarray's behavior.
Parameters:
-
url(str) –The url of the data source to virtualize. The URL should include a scheme. For example:
url="file:///Users/my-name/Documents/my-project/my-data.nc"for a local data source.url="s3://my-bucket/my-project/my-data.nc"for a remote data source on an S3 compatible cloud.
-
registry(ObjectStoreRegistry) –An ObjectStoreRegistry for resolving urls and reading data.
-
parser(Parser) –A parser to use for the given data source. For example:
- virtualizarr.parsers.HDFParser for virtualizing NetCDF4 or HDF5 files.
- virtualizarr.parsers.FITSParser for virtualizing FITS files.
- virtualizarr.parsers.NetCDF3Parser for virtualizing NetCDF3 files.
- virtualizarr.parsers.DMRPPParser for virtualizing DMR++ files.
- virtualizarr.parsers.KerchunkJSONParser for re-opening Kerchunk JSONs.
- virtualizarr.parsers.KerchunkParquetParser for re-opening Kerchunk Parquets.
- virtualizarr.parsers.ZarrParser for virtualizing Zarr stores.
- virtual_tiff.VirtualTIFF for virtualizing TIFFs.
-
drop_variables(Iterable[str] | None, default:None) –Variables in the data source to drop before returning.
-
loadable_variables(Iterable[str] | None, default:None) –Variables in the data source to load as Dask/NumPy arrays instead of as virtual arrays.
-
decode_times(bool | None, default:None) –Bool that is passed into xarray.open_dataset. Allows time to be decoded into a datetime object.
Returns:
-
vds–An xarray.Dataset containing virtual chunk references for all variables not included in
loadable_variablesand normal lazily indexed arrays for each variable inloadable_variables.
virtualizarr.open_virtual_mfdataset ¶
open_virtual_mfdataset(
urls: str
| PathLike
| Sequence[str | PathLike]
| NestedSequence[str | PathLike],
registry: ObjectStoreRegistry,
parser: Parser,
concat_dim: str
| DataArray
| Index
| Sequence[str]
| Sequence[DataArray]
| Sequence[Index]
| None = None,
compat: "CompatOptions" = "no_conflicts",
preprocess: Callable[[Dataset], Dataset] | None = None,
data_vars: Literal["all", "minimal", "different"] | list[str] = "all",
coords="different",
combine: Literal["by_coords", "nested"] = "by_coords",
parallel: Literal["dask", "lithops", False] | type[Executor] = False,
join: "JoinOptions" = "outer",
attrs_file: str | PathLike | None = None,
combine_attrs: "CombineAttrsOptions" = "override",
**kwargs,
) -> Dataset
Open multiple data sources as a single virtual dataset.
This function is explicitly modelled after xarray.open_mfdataset, and works in the same way.
If combine='by_coords' then the function combine_by_coords is used to combine
the datasets into one before returning the result, and if combine='nested' then
combine_nested is used. The urls must be structured according to which
combining function is used, the details of which are given in the documentation for
combine_by_coords and combine_nested. By default combine='by_coords'
will be used. Global attributes from the attrs_file are used
for the combined dataset.
Parameters:
-
urls(str | PathLike | Sequence[str | PathLike] | NestedSequence[str | PathLike]) –Same as in virtualizarr.open_virtual_dataset
-
registry(ObjectStoreRegistry) –An ObjectStoreRegistry for resolving urls and reading data.
-
concat_dim(str | DataArray | Index | Sequence[str] | Sequence[DataArray] | Sequence[Index] | None, default:None) –Same as in xarray.open_mfdataset
-
compat('CompatOptions', default:'no_conflicts') –Same as in xarray.open_mfdataset
-
preprocess(Callable[[Dataset], Dataset] | None, default:None) –Same as in xarray.open_mfdataset
-
data_vars(Literal['all', 'minimal', 'different'] | list[str], default:'all') –Same as in xarray.open_mfdataset
-
coords–Same as in xarray.open_mfdataset
-
combine(Literal['by_coords', 'nested'], default:'by_coords') –Same as in xarray.open_mfdataset
-
parallel("dask", "lithops", False, or type of subclass of [concurrent.futures.Executor][], default:False) –Specify whether the open and preprocess steps of this function will be performed in parallel using lithops,
dask.delayed, or any executor compatible with the concurrent.futures interface, or in serial. Default is False, which will execute these steps in serial. -
join('JoinOptions', default:'outer') –Same as in xarray.open_mfdataset
-
attrs_file(str | PathLike | None, default:None) –Same as in xarray.open_mfdataset
-
combine_attrs('CombineAttrsOptions', default:'override') –Same as in xarray.open_mfdataset
-
**kwargs(optional, default:{}) –Additional arguments passed on to virtualizarr.open_virtual_dataset. For an overview of some of the possible options, see the documentation of virtualizarr.open_virtual_dataset.
Returns:
-
vds–An xarray.Dataset containing virtual chunk references for all variables not included in
loadable_variablesand normal lazily indexed arrays for each variable inloadable_variables.
Notes
The results of opening each virtual dataset in parallel are sent back to the client process, so must not be too large. See the docs page on Scaling.
virtualizarr.open_virtual_datatree ¶
open_virtual_datatree(
url: str,
registry: ObjectStoreRegistry,
parser: Parser,
*,
loadable_variables: Iterable[str] | None = None,
decode_times: bool | None = None,
) -> DataTree
Open an archival data source as an xarray.DataTree wrapping virtualized zarr arrays.
See the loadable_variables kwarg for a description of which data variables are loaded vs.
virtualized.
Parameters:
-
url(str) –The url of the data source to virtualize. The URL should include a scheme. For example:
url="file:///Users/my-name/Documents/my-project/my-data.nc"for a local data source.url="s3://my-bucket/my-project/my-data.nc"for a remote data source on an S3 compatible cloud.
-
registry(ObjectStoreRegistry) –An ObjectStoreRegistry for resolving urls and reading data.
-
parser(Parser) –A parser to use for the given data source. For example:
- virtualizarr.parsers.HDFParser for virtualizing NetCDF4 or HDF5 files.
- virtualizarr.parsers.FITSParser for virtualizing FITS files.
- virtualizarr.parsers.NetCDF3Parser for virtualizing NetCDF3 files.
- virtualizarr.parsers.KerchunkJSONParser for re-opening Kerchunk JSONs.
- virtualizarr.parsers.KerchunkParquetParser for re-opening Kerchunk Parquets.
- virtualizarr.parsers.ZarrParser for virtualizing Zarr stores.
- virtualizarr.parsers.ZarrParser for virtualizing Zarr stores.
- virtual_tiff.VirtualTIFF for virtualizing TIFFs.
-
loadable_variables(Iterable[str] | None, default:None) –If
None(the default), dimension coordinate variables (1D variables whose name matches their dimension) will be loaded automatically to enable xarray indexing.If an empty iterable, no variables will be loaded.
Other options are not yet supported.
-
decode_times(bool | None, default:None) –Bool that is passed into xarray.open_dataset. Allows time to be decoded into a datetime object.
Returns:
-
vds–An xarray.DataTree containing virtual chunk references for all variables.
Examples:
Virtualize a Cloud Optimized GeoTIFF (COG) using virtual_tiff.VirtualTIFF:
from obstore.store import S3Store
from virtualizarr import open_virtual_datatree
from obspec_utils.registry import ObjectStoreRegistry
from virtual_tiff import VirtualTIFF
# Access a public Sentinel-2 COG from AWS
store = S3Store("sentinel-cogs", region="us-west-2", skip_signature=True)
registry = ObjectStoreRegistry({"s3://sentinel-cogs/": store})
url = "s3://sentinel-cogs/sentinel-s2-l2a-cogs/12/S/UF/2022/6/S2B_12SUF_20220609_0_L2A/B04.tif"
parser = VirtualTIFF(ifd_layout="nested")
with open_virtual_datatree(url=url, parser=parser, registry=registry) as vdt:
print(vdt)
Virtualize a NetCDF4 file using the the virtualizarr.parsers.HDFParser:
from obstore.store import HTTPStore
from virtualizarr import open_virtual_datatree
from virtualizarr.parsers import HDFParser
from obspec_utils.registry import ObjectStoreRegistry
base = "https://github.com"
url = f"{base}/pydata/xarray-data/raw/refs/heads/master/precipitation.nc4"
store = HTTPStore(base)
parser = HDFParser()
registry = ObjectStoreRegistry({base: store})
vdt = open_virtual_datatree(url=url, registry=registry, parser=parser)
print(vdt)
Load prevent loading variables from any groups (default loads the coordinate variables "time", "lat", and "lon"):
vdt = open_virtual_datatree(
url=url,
registry=registry,
parser=parser,
loadable_variables=[],
)
Drop variables from a specific group after opening:
vdt = open_virtual_datatree(
url=url,
registry=registry,
parser=parser,
)
vdt["/observed"] = vdt["/observed"].to_dataset().drop_vars(["lon"])
Information¶
virtualizarr.accessor.VirtualiZarrDatasetAccessor.nbytes
property
¶
nbytes: int
Size required to hold these references in memory in bytes.
Note this is not the size of the referenced chunks if they were actually loaded into memory, this is only the size of the pointers to the chunk locations. If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.
In-memory (loadable) variables are included in the total using xarray's normal .nbytes method.
Renaming paths¶
virtualizarr.accessor.VirtualiZarrDatasetAccessor.rename_paths ¶
Rename paths to chunks in every ManifestArray in this dataset.
Accepts either a string, in which case this new path will be used for all chunks, or a function which accepts the old path and returns the new path.
Parameters:
-
new(str | Callable[[str], str]) –New path to use for all chunks, either as a string, or as a function which accepts and returns strings.
Returns:
-
Dataset–
See Also
virtualizarr.ManifestArray.rename_paths
virtualizarr.ChunkManifest.rename_paths
Examples:
Rename paths to reflect moving the referenced files from local storage to an S3 bucket.
>>> def local_to_s3_url(old_local_path: str) -> str:
... from pathlib import Path
...
... new_s3_bucket_url = "http://s3.amazonaws.com/my_bucket/"
...
... filename = Path(old_local_path).name
... return str(new_s3_bucket_url / filename)
>>>
>>> ds.vz.rename_paths(local_to_s3_url)