Zarr + STAC

Approaches to integrating STAC and Zarr

Author

Julia Signell, Max Jones

Background + Motivation

STAC (SpatioTemporal Asset Catalogs) is a specification for defining and searching any type of data that has spatial and temporal dimensions. STAC has seen significant adoption in the earth observation community. Zarr is a specification for storing self-describing groups of cloud-optimized arrays. Zarr has been adopted by the earth modeling community (led by Pangeo). Both STAC and Zarr offer a flexible nested structure with arbitrary metadata at a variety of levels – for STAC: catalog, collection, item, asset, for Zarr: group, array. This flexibility has contributed to their popularity, but their overlapping goals can make it hard to tell where to draw the line between what belongs in Zarr and what belongs in STAC.

Comparison Table

The main thing to keep in mind is that Zarr is a data format and STAC is a catalog. Sometimes we conflate “STAC” with “COGs stored in a STAC catalog”, but STAC can be used to catalog anything as long as it has spatial and temporal dimensions.

STAC	Zarr (+xarray for the last 2 rows)
for data with spatial and temporal dimensions	for groups of arrays with any type of dimensions
supports arbitrary metadata for catalogs, collections, items, assets	supports arbitrary metadata for groups, arrays
storage of STAC metadata is completely decoupled from storage of data	storage of metadata is coupled to data (i.e., in the same directory, except when virtualized)
good for discovering datasets of interest	good at filtering within a dataset
searching returns STAC items or collections	filtering returns subsets of arrays (potentially composed of parts of multiple chunks)

Typical approach to storing data in STAC

This is the most common approach for organizing data with STAC:

Create a COG for each variable (for satellites: band) at each time and place (in satellite terminology: each scene)
Define a STAC collection for each dataset (i.e., set of data collected using the same platform, algorithms, model, etc).
Define a STAC item for each unique temporal and regional extent within that dataset.
In each STAC item include assets for each variable or band with href links to the COGs containing that data.

This is great for Level 1 or Level 2 data which tends to not be aligned and are potentially on different reference systems. When applied to Level 3 or Level 4 data (which tends to be aligned and on the same reference system) this approach has downsides:

Client libraries need to scan the metadata for each file in order to lazily load the data cube, which can result in lots of tiny GET requests. Depending on where the data is being used, that can be slow and expensive.
Data consumers need to use client libraries for concatenation (e.g., stackstac vs odc-stac) which can be confusing and have unintended performance implications.

New Approaches

Data producers can use Zarr + STAC to make their data accessible and discoverable. More details are in:

The Data Producers section – explores approaches for how to store data cubes using Zarr + STAC.
The Data Consumers section – deals with how to reproducibly consume data cubes stored in Zarr + STAC.

How you set up the Zarr stores depend on what type of data you have.

One big Zarr store

Good for aligned data cubes (Level 3 and 4 data) - using one big Zarr store gives the data producer full control over the shape of the data cube (or tree of data cubes).

Many smaller Zarr stores

This mimics the common STAC + COG approach but instead of a COG for each band at each scene there is a Zarr for the whole scene potentially with different groups representing different resolutions (in COG terminology: overviews).

This is the approach that ESA is planning to use with the new distribution of Sentinel-2 L2A data.

Virtual references

This is similar to One big Zarr store with the exception that data does not need to be stored in a Zarr store. It can be stored in any number of COG or NetCDF or HDF5 files. Anything that is accessible via a GET request and has consistent chunking, encoding, and aligned coordinates on-disc.

This is the approach that Planetary Computer takes for Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6).

Goals

This report will discuss the partially overlapping goals of STAC and Zarr and offer suggestions for how to use them together. Answering questions like:

What do each of these specifications excel at?
How can they be used together to get the maximum benefit out of both?
How do we refer to NetCDF or Zarr assets from STAC?
How do we represent the typical deeply nested hierarchy of earth-system model fields in a STAC catalog?
Where do virtualized datasets (kerchunk references and Icechunk virtual stores produced by VirtualiZarr) fit in?

References

This report does not represent novel work, but instead tries to aggregate ideas from a variety of sources:

https://github.com/cloudnativegeo/cloud-optimized-geospatial-formats-guide/pull/139
https://discourse.pangeo.io/t/stac-and-earth-systems-datasets/1472
https://discourse.pangeo.io/t/pangeo-showcase-high-performance-python-stac-tooling-backed-by-rust-feb-5-2025/4847
https://github.com/cloudnativegeo/cloud-optimized-geospatial-formats-guide/issues/134
https://github.com/stac-utils/xpystac/pull/33#issuecomment-1785892112