Zarr

Chunked, Compressed N-Dimensional Arrays

What is Zarr?

Zarr, despite its name, is not a scary format. It’s designed for data that is too big for users’ machines, but Zarr makes data small and organizes it in a way where users can take just the bits they need or distribute the load of processing lots of those bits (stored as chunks) across many machines.

The Zarr data format is a community-maintained format for large-scale n-dimensional data. A Zarr store consists of compressed and chunked n-dimensional arrays. Zarr’s flexible indexing and compatibility with object storage lends itself to parallel processing.

A Zarr chunk is Zarr’s unit of data storage. Each chunk of a Zarr array is an equally-sized block of the array within a larger Zarr store comprised of one or more arrays and array groups. These blocks or chunks of data are stored separately to make reading and updating small chunks more efficient.

Read more in the official tutorial: Zarr Tutorial

Zarr Version 2 and Version 3

Important

Zarr Version 3 is underway but not released yet, so all the examples in this guide are for Zarr Version 2 data. The concepts in this page are consistent across both Zarr Version 2 and Zarr Version 3, however some metadata field names and organization are changing from Version 2 to version 3.

Version 3 changes from Version 2:

  • dtype has been renamed to data_type,
  • chunks has been replaced with chunk_grid,
  • dimension_separator has been replaced with chunk_key_encoding,
  • order has been replaced by the transpose codec,
  • the separate filters and compressor fields been combined into the single codecs field.

Read more:

Zarr Data Organization

Arrays

Zarr arrays are similar to numpy arrays, but chunked and compressed. We will add details about chunking and compression to this guide soon.

Hierarchy via Groups

Zarr supports hierarchical organization via groups. Each node in the Zarr hierarchy is either a group or an array.

Dimensions and Shape

A Zarr array has zero or more dimensions. A Zarr array’s shape is the tuple of the length of the array in each respective dimension.

Coordinates and Indexes

Zarr indexing supports array subsetting (both reading and writing) without loading the whole array into memory. Advanced indexing operations, such as block indexing, are detailed in the Zarr tutorial: Advanced indexing.

Note

The Zarr format is language-agnostic, but this indexing reference is specific to Python.

The Xarray library provides a rich API for slicing and subselecting data. In addition to providing a positional index to subselect data, xarray supports label-based indexing. Labels, or coordinates, in the case of geospatial data, often include latitude and longitude (or y and x). These coordinates (also called names or labels) can be used to read and write data when the position is unknown.

Consolidated Metadata

Every Zarr array has its own metadata. When considering cloud storage options, where latency is high so total requests should be limited, it is important to consolidate metadata so all metadata can be read from one object.

Read more on consolidating metadata.

Zarr Data Storage

Storage

Zarr can be stored in memory, on disk, in Zip files, and in object storage like S3.

Note

Any backend that implements MutableMapping interface from the Python collections module can be used to store Zarr. Learn more and see all the options on the Storage (zarr.storage) documentation page.

As of Zarr version 2.5, Zarr store URLs can be passed to fsspec and it will create a MutableMapping automatically.

Chunking

Chunking is the process of dividing the data arrays into smaller pieces. This allows for parallel processing and efficient storage.

Once data is chunked, applications may read in 1 or many chunks. Because the data is compressed, within-chunk reads are not possible.

Compression

Zarr supports compression algorithms to support efficient storage and retrieval.

To explore these concepts in practice, see the Zarr in Practice notebook.

Other Things to Know about Zarr

What Zarr is not

Zarr is not designed for vector, point cloud or sparse data, although there is investigations into supporting a greater variety of data types.

Zarr is in Development

There are some limitations of Zarr which is why there are Zarr Enhancement Proposals.

Zarr Version 3 was itself a ZEP, which has been accepted.

Draft ZEPs are recommended reading for anyone considering creating a new Zarr store, since they address common challenges with Zarr data to date.