Zarr
Chunked, Compressed N-Dimensional Arrays
What is Zarr?
Zarr, despite its name, is not a scary format. It’s designed for data that is too big for users’ machines, but Zarr makes data small and organizes it in a way where users can take just the bits they need or distribute the load of processing lots of those bits (stored as chunks) across many machines.
The Zarr data format is a community-maintained format for large-scale n-dimensional data. A Zarr store consists of chunked and compressed n-dimensional arrays. Zarr’s flexible indexing and compatibility with object storage lends itself to parallel processing.
A Zarr chunk is Zarr’s unit of data storage. Each chunk of a Zarr array is an equally-sized block of the array within a larger Zarr store. The larger Zarr store is comprised of one or more arrays and array groups. The Zarr chunks are normally stored in separate objects in object storage to make reading and updating individual chunks more efficient.
Read more in the official zarr-python user guide: Zarr User Guide
Zarr Version 2 and Version 3
Zarr Version 3 represents a new specification of the same array-based data model. The concepts remain largely the same, however some metadata field names and organization have changed. Zarr Version 3 support is included in the canonical Python implementation – zarr-python – as of January 2025 (read more in the release blog post). Zarr Version 2 data is still usable in newer versions of the zarr-python library. The examples in this guide use zarr-python >= 3 and Zarr Version 3 data unless otherwise specified.
Zarr Version 3 specification changes from Version 2:
dtype
has been renamed todata_type
,chunks
has been replaced withchunk_grid
,dimension_separator
has been replaced withchunk_key_encoding
,order
has been replaced by the transpose codec,- multiple chunks can be stored within a single object on object storage (via the sharding codec)
- the separate
filters
andcompressor
fields been combined into the singlecodecs
field.
Read more:
Zarr Data Organization
Arrays
Zarr arrays are similar to numpy arrays, but chunked and compressed.
Hierarchy via Groups
Zarr supports hierarchical organization via groups. Each node in the Zarr hierarchy is either a group or an array.
Dimensions and Shape
A Zarr array has zero or more dimensions. A Zarr array’s shape is the tuple of the length of the array in each respective dimension.
Coordinates and Indexes
Zarr indexing supports array subsetting (both reading and writing) without loading the whole array into memory. Advanced indexing operations, such as block indexing, are detailed in the zarr-python user guide: Advanced indexing.
The Zarr format is language-agnostic, but this indexing reference is specific to Python.
The Xarray library provides a rich API for slicing and subselecting data. In addition to providing a positional index to subselect data, xarray supports label-based indexing. Labels, or coordinates, in the case of geospatial data, often include latitude and longitude (or y and x). Another common coordinate for data cubes is time. These coordinates (also called names or labels) can be used to read and write data without needing to know the positional index value.
Consolidated Metadata
Every Zarr group and every Zarr array has its own metadata. When considering cloud storage options, where latency is high so total requests should be limited, it is important to consolidate metadata at the root of the Zarr store so all metadata can be read from one object.
Read more on consolidated metadata.
Zarr Data Storage
Storage
At its core Zarr is a very flexible format that does not have any requirements regarding the actual storage system. Zarr can be stored in memory, on disk, in Zip files, and in any key-value store, such as object storage like S3. Learn more in the Storage section of the Zarr specification.
Zarr data chunks do not necessarily need to be stored in the same storage system as the Zarr metadata. This is what enables virtual Zarr stores (kerchunk, icechunk) where the metadata references data in legacy chunked data formats (such as NetCDF and HDF5).
Chunking
Chunking is the process of dividing the data arrays into smaller pieces. This allows for parallel processing and efficient storage.
Once data is chunked, applications may read in 1 or many chunks. Because the data is compressed at the chunk-level, within-chunk reads are not possible.
Traditionally each chunk is stored in a separate object in object storage but with the [sharding] codec in Zarr version 3 now several chunks can be stored within a single object. This is an important enhancement because it prevents Zarr hierarchies from having so many files that they are effectively too large to manage.
Compression
Zarr supports compression algorithms to support efficient storage and retrieval.
To explore these concepts in practice, see the Zarr in Practice notebook.
Other Things to Know about Zarr
What Zarr is not
Zarr is not designed for vector, point cloud or sparse data, although there is investigations into supporting a greater variety of data types.
Zarr is in Development
There are some limitations of Zarr which is why there are Zarr Enhancement Proposals.
Zarr Version 3 was itself a ZEP, which has been accepted.
Draft ZEPs are recommended reading for anyone considering creating a new Zarr store, since they address common challenges with Zarr data to date.