Skip to content

Serialization

virtualizarr.accessor.VirtualiZarrDatasetAccessor.to_icechunk

to_icechunk(
    store: IcechunkStore,
    *,
    group: str | None = None,
    append_dim: str | None = None,
    region: Literal["auto"]
    | Mapping[str, Literal["auto"] | slice]
    | None = None,
    validate_containers: bool = True,
    last_updated_at: datetime | None = None,
) -> None

Write an xarray dataset to an Icechunk store.

Any variables backed by ManifestArray objects will be be written as virtual references. Any other variables will be loaded into memory before their binary chunk data is written into the store.

If append_dim is provided, the virtual dataset will be appended to the existing IcechunkStore along the append_dim dimension.

If last_updated_at is provided, it will be used as a checksum for any virtual chunks written to the store with this operation. At read time, if any of the virtual chunks have been updated since this provided datetime, an error will be raised. This protects against reading outdated virtual chunks that have been updated since the last read. When not provided, the current time is used. This value is stored in Icechunk with seconds precision, so be sure to take that into account when providing this value.

Parameters:

  • store (IcechunkStore) –

    Store to write dataset into.

  • group (str | None, default: None ) –

    Path of the group to write the dataset into (default: the root group).

  • append_dim (str | None, default: None ) –

    Dimension along which to append the virtual dataset.

  • region (Literal['auto'] | Mapping[str, Literal['auto'] | slice] | None, default: None ) –

    Optional mapping from dimension names to either a) "auto", or b) integer slices, indicating the region of existing zarr array(s) in which to write this dataset's data.

    See xarray.Dataset.to_zarr documentation for details.

  • validate_containers (bool, default: True ) –

    If True, raise if any virtual chunks have a refer to locations that don't match any existing virtual chunk container set on this Icechunk repository.

    It is not generally recommended to set this to False, because it can lead to confusing runtime results and errors when reading data back.

  • last_updated_at (datetime | None, default: None ) –

    Datetime to use as a checksum for any virtual chunks written to the store with this operation. When not provided, the current time is used.

Raises:

virtualizarr.accessor.VirtualiZarrDatasetAccessor.to_kerchunk

to_kerchunk(filepath: None, format: Literal['dict']) -> KerchunkStoreRefs
to_kerchunk(filepath: str | Path, format: Literal['json']) -> None
to_kerchunk(
    filepath: str | Path,
    format: Literal["parquet"],
    record_size: int = 100000,
    categorical_threshold: int = 10,
) -> None
to_kerchunk(
    filepath: str | Path | None = None,
    format: Literal["dict", "json", "parquet"] = "dict",
    record_size: int = 100000,
    categorical_threshold: int = 10,
) -> KerchunkStoreRefs | None

Serialize all virtualized arrays in this xarray dataset into the kerchunk references format.

Parameters:

  • filepath (str | Path | None, default: None ) –

    File path to write kerchunk references into. Not required if format is 'dict'.

  • format (Literal['dict', 'json', 'parquet'], default: 'dict' ) –

    Format to serialize the kerchunk references as. If 'json' or 'parquet' then the 'filepath' argument is required.

  • record_size (int, default: 100000 ) –

    Number of references to store in each reference file (default 100,000). Bigger values mean fewer read requests but larger memory footprint. Only available when format is 'parquet'.

  • categorical_threshold (int, default: 10 ) –

    Encode urls as pandas.Categorical to reduce memory footprint if the ratio of the number of unique urls to total number of refs for each variable is greater than or equal to this number (default 10). Only available when format is 'parquet'.

References

fsspec.github.io/kerchunk/spec.html

virtualizarr.accessor.VirtualiZarrDataTreeAccessor.to_icechunk

to_icechunk(
    store: IcechunkStore,
    *,
    write_inherited_coords: bool = False,
    validate_containers: bool = True,
    last_updated_at: datetime | None = None,
    **kwargs,
) -> None

Write an xarray DataTree to an Icechunk store.

Any variables backed by ManifestArray objects will be be written as virtual references. Any other variables will be loaded into memory before their binary chunk data is written into the store.

If last_updated_at is provided, it will be used as a checksum for any virtual chunks written to the store with this operation. At read time, if any of the virtual chunks have been updated since this provided datetime, an error will be raised. This protects against reading outdated virtual chunks that have been updated since the last read. When not provided, no check is performed. This value is stored in Icechunk with seconds precision, so be sure to take that into account when providing this value.

Parameters:

  • store (IcechunkStore) –

    Store to write dataset into.

  • write_inherited_coords (bool, default: False ) –

    If True, replicate inherited coordinates on all descendant nodes. Otherwise, only write coordinates at the level at which they are originally defined. This saves disk space, but requires opening the full tree to load inherited coordinates.

  • validate_containers (bool, default: True ) –

    If True, raise if any virtual chunks have a refer to locations that don't match any existing virtual chunk container set on this Icechunk repository.

    It is not generally recommended to set this to False, because it can lead to confusing runtime results and errors when reading data back.

  • last_updated_at (datetime | None, default: None ) –

    Datetime to use as a checksum for any virtual chunks written to the store with this operation. When not provided, no check is performed.

  • **kwargs

    Additional keyword arguments to be passed to xarray.Dataset.vz.to_icechunk.

Raises:

Examples:

To ensure an error is raised if the files containing referenced virtual chunks are modified at any time from now on, pass the current time to last_updated_at.

>>> from datetime import datetime
>>> vdt.vz.to_icechunk(
...     icechunkstore,
...     last_updated_at=datetime.now(),
... )