import sys
import numpy as np
import xarray as xr
import zarr
# Here we create a simple Zarr store.
= zarr.array(np.arange(10)) zstore
Zarr in Practice
This notebook demonstrates how to create, explore and modify a Zarr store.
These concepts are explored in more detail in the official Zarr Tutorial.
It also shows the use of public Zarr stores for geospatial data.
How to create a Zarr store
This is an in-memory Zarr store. To persist it to disk, we can use .save
"test.zarr", zstore)
We can open the metadata about this dataset, which gives us some interesting information. The dataset has a shape of 10 chunks of 10, so we know all the data was stored in 1 chunk, and was compressed with the blosc
!cat test.zarr/.zarray
"chunks": [
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
"dtype": "<i8",
"fill_value": 0,
"filters": null,
"order": "C",
"shape": [
"zarr_format": 2
This was a pretty basic example. Let’s explore the other things we might want to do when creating Zarr.
How to create a group
root = root.create_group('group1')
group1 = root.create_group('group2')
group2 = group1.create_dataset('ds_in_group', shape=(100,100), chunks=(10,10), dtype='i4')
z1 = group2.create_dataset('ds_in_group', shape=(1000,1000), chunks=(10,10), dtype='i4')
z2 =True) root.tree(expand
How to Examine and Modify the Chunk Shape
If your data is sufficiently large, Zarr will chose a chunksize for you.
= zarr.array(np.arange(100), chunks=True)
zarr_no_chunks zarr_no_chunks.chunks, zarr_no_chunks.shape
((100,), (100,))
= zarr.array(np.arange(10000000), chunks=True)
zarr_with_chunks zarr_with_chunks.chunks, zarr_with_chunks.shape
((156250,), (10000000,))
For zarr_with_chunks
we see the chunks are smaller than the shape, so we know the data has been chunked. Other ways to examine the chunk structure are
and zarr.cdata_shape
Type: property String form: <property object at 0x7efde6ecfb00> Docstring: A tuple of integers describing the number of chunks along each dimension of the array.
zarr_no_chunks.cdata_shape, zarr_with_chunks.cdata_shape
((1,), (64,))
The zarr store with chunks has 64 chunks. The number of chunks multiplied by the chunk size equals the length of the whole array.
0] * zarr_with_chunks.chunks[0] == zarr_with_chunks.shape[0] zarr_with_chunks.cdata_shape[
What’s the storage size of these chunks?
The default chunks are pretty small.
'0']) # this is in bytes sys.getsizeof(zarr_with_chunks.chunk_store[
= zarr.array(np.arange(10000000), chunks=(500000)) zarr_with_big_chunks
zarr_with_big_chunks.chunks, zarr_with_big_chunks.shape, zarr_with_big_chunks.cdata_shape
((500000,), (10000000,), (20,))
This Zarr store has 10 million values, stored in 20 chunks of 500,000 data values.
'0']) sys.getsizeof(zarr_with_big_chunks.chunk_store[
These chunks are still pretty small, but this is just a silly example. In the real world, you will likely want to deal in Zarr chunks of 1MB or greater, especially when dealing with remote storatge options where data is read over a network and the number of requests should be minimized.
Exploring and Modifying Data Compression
Continuing with data from the example above, we can tell that Zarr has also compressed the data for us using
or zarr.compressor
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
The Blosc
compressor is actually a meta compressor so actually implements multiple different internal compressors. In this case, it has implemented lz4
compression. We can also explore how much space was saved by using this compression method.
Type | zarr.core.Array |
Data type | int64 |
Shape | (10000000,) |
Chunk shape | (156250,) |
Order | C |
Read-only | False |
Compressor | Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) |
Store type | |
No. bytes | 80000000 (76.3M) |
No. bytes stored | 514193 (502.1K) |
Storage ratio | 155.6 |
Chunks initialized | 64/64 |
We can see, from the storage ratio above, that compression has made our data 155 times smaller 😱 .
You can set compression=None
when creating a Zarr array to turn off this behavior, but I’m not sure why you would do that.
Let’s see what happens when we use a different compression method. We can checkout a full list of numcodecs compressors here:
from numcodecs import GZip
= GZip()
compressor = zarr.array(np.arange(10000000), chunks=True, compressor=compressor)
Type | zarr.core.Array |
Data type | int64 |
Shape | (10000000,) |
Chunk shape | (156250,) |
Order | C |
Read-only | False |
Compressor | GZip(level=1) |
Store type | |
No. bytes | 80000000 (76.3M) |
No. bytes stored | 15086009 (14.4M) |
Storage ratio | 5.3 |
Chunks initialized | 64/64 |
In this case, the storage ratio is 5.3 - so not as good! How to chose a compression algorithm is a topic for future investigation.
Consolidating metadata
It’s important to consolidate metadata to minimize requests. Each group and array will have a metadata file, so in order to limit requests to read the whole tree of metadata files, Zarr provides the ability to consolidate metdata into a metadata file at the of the store.
So far we have only been dealing in single array Zarr data stores. In this next example, we will create a zarr store with multiple arrays and then consolidate metadata. The speed up with local storage is insignificant, but becomes significant when dealing in remote storage options, which we will see in the following example on accessing cloud storage.
root = 'example.zarr'
zarr_store # Let's create many groups and many arrays
= 100, 100
num_groups, num_arrays_per_group for i in range(num_groups):
= root.create_group(f'group-{i}')
group for j in range(num_arrays_per_group):
f'array-{j}', shape=(1000,1000), dtype='i4')
= zarr.DirectoryStore(zarr_store)
store, root)
# We don't expect it to exist yet!
!cat {zarr_store}/.zmetadata
cat: {zarr_store}/.zmetadata: No such file or directory
<zarr.core.Array (100,) <U8>
<zarr.core.Array (100,) <U8>
!cat {zarr_store}/.zmetadata
"metadata": {
".zarray": {
"chunks": [
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
"dtype": "<U8",
"fill_value": "",
"filters": null,
"order": "C",
"shape": [
"zarr_format": 2
"zarr_consolidated_format": 1
Example of Cloud-Optimized Access for this Format
Fortunately, there are many publicly accessible cloud archives of Zarr data.
Zarr provides storage backends for all of these cloud providers: Zarr Tutorial - Distributed/cloud storage.
Here are a few we are aware of:
- Zarr data in Microsoft’s Planetary Computer
- Zarr data from Google
- Amazon Sustainability Data Initiative available from Registry of Open Data on AWS - Enter “Zarr” in the Search input box.
- Pangeo-Forge Data Catalog
The Pangeo-Forge Data Catalog provides handy examples of how to open each dataset, for example, from the Global Precipitation Climatology Project (GPCP) page:
= '' store
= xr.open_dataset(store, engine='zarr', chunks={}, consolidated=True)
ds ds
<xarray.Dataset> Dimensions: (latitude: 180, nv: 2, longitude: 360, time: 9226) Coordinates: lat_bounds (latitude, nv) float32 dask.array<chunksize=(180, 2), meta=np.ndarray> * latitude (latitude) float32 -90.0 -89.0 -88.0 -87.0 ... 87.0 88.0 89.0 lon_bounds (longitude, nv) float32 dask.array<chunksize=(360, 2), meta=np.ndarray> * longitude (longitude) float32 0.0 1.0 2.0 3.0 ... 356.0 357.0 358.0 359.0 * time (time) datetime64[ns] 1996-10-01 1996-10-02 ... 2021-12-31 time_bounds (time, nv) datetime64[ns] dask.array<chunksize=(200, 2), meta=np.ndarray> Dimensions without coordinates: nv Data variables: precip (time, latitude, longitude) float32 dask.array<chunksize=(200, 180, 360), meta=np.ndarray> Attributes: (12/45) Conventions: CF-1.6, ACDD 1.3 Metadata_Conventions: CF-1.6, Unidata Dataset Discovery v1.0, NOAA ... acknowledgment: This project was supported in part by a grant... cdm_data_type: Grid cdr_program: NOAA Climate Data Record Program for satellit... cdr_variable: precipitation ... ... standard_name_vocabulary: CF Standard Name Table (v41, 22 February 2017) summary: Global Precipitation Climatology Project (GPC... time_coverage_duration: P1D time_coverage_end: 1996-10-01T23:59:59Z time_coverage_start: 1996-10-01T00:00:00Z title: Global Precipitation Climatatology Project (G...
Microsoft’s Planetary Computer goes above and beyond, providing tutorials alongside each dataset. We recommend exploring these on your own to get an idea of what you can do with Zarr and Xarray. See all tutorials here: microsoft/PlanetaryComputerExamples. Note, this repo contains ALL tutorials, not just Zarr tutorials, so you may want to filter for Zarr.
For example, here is some code from the Daymet Puerto Rico Dataset on MS Planetary Computer:
import as ccrs
import fsspec
import matplotlib.pyplot as plt
import pystac
import xarray as xr
import warnings
"ignore", RuntimeWarning) warnings.simplefilter(
= ""
url = pystac.read_file(url)
collection = collection.assets["zarr-https"]
asset = fsspec.get_mapper(asset.href)
store = xr.open_zarr(store, **asset.extra_fields["xarray:open_kwargs"])
ds ds
<xarray.Dataset> Dimensions: (time: 14965, y: 584, x: 284, nv: 2) Coordinates: lat (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray> lon (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray> * time (time) datetime64[ns] 1980-01-01T12:00:00 ... 20... * x (x) float32 -5.802e+06 -5.801e+06 ... -5.519e+06 * y (y) float32 -3.9e+04 -4e+04 ... -6.21e+05 -6.22e+05 Dimensions without coordinates: nv Data variables: dayl (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray> lambert_conformal_conic int16 ... prcp (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray> srad (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray> swe (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray> time_bnds (time, nv) datetime64[ns] dask.array<chunksize=(365, 2), meta=np.ndarray> tmax (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray> tmin (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray> vp (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray> yearday (time) int16 dask.array<chunksize=(365,), meta=np.ndarray> Attributes: Conventions: CF-1.6 Version_data: Daymet Data Version 4.0 Version_software: Daymet Software Version 4.0 citation: Please see for current Daymet ... references: Please see for current informa... source: Daymet Software Version 4.0 start_year: 1980