Zarr in Practice

This notebook demonstrates how to create, explore and modify a Zarr store.

These concepts are explored in more detail in the official Zarr User Guide.

It also shows the use of public Zarr stores for geospatial data.

Environment

The packages needed for this notebook can be installed with conda or mamba. Using the environment.yml from this folder run:

conda env create -f environment.yml

mamba env create -f environment.yml

Finally, you may activate and select the kernel in the notebook (running in Jupyter)

conda activate coguide-zarr

The notebook has been tested to work with the listed Conda environment.

How to create a Zarr store

import numpy as np
import xarray as xr
import zarr

First let’s clean up in case we are running this notebook multiple times.

!rm -rf test.zarr
!rm -rf example.zarr

Here we will create a simple local Zarr store.

store = zarr.storage.LocalStore("test.zarr")

The Zarr store does not write anything to disk until there is data in it.

!tree test.zarr

test.zarr  [error opening dir]

0 directories, 0 files

Now let’s create an array. Just calling zarr.create_array writes the array to disk.

arr = zarr.create_array(store=store, data=np.arange(10))

!tree test.zarr

test.zarr

├── c

│   └── 0

└── zarr.json



1 directory, 2 files

We can open the metadata about this Zarr array, which gives us some interesting information. The dataset has a shape of 10 and chunk_shape of 10, so we know all the data was stored in 1 chunk. From the codecs we can see tell that the data was compressed with the zstd compressor.

Notice that the zarr_format (this is the Zarr format version) and node_type are also specified in the metadata.

!cat test.zarr/zarr.json

{
  "shape": [
    10
  ],
  "data_type": "int64",
  "chunk_grid": {
    "name": "regular",
    "configuration": {
      "chunk_shape": [
        10
      ]
    }
  },
  "chunk_key_encoding": {
    "name": "default",
    "configuration": {
      "separator": "/"
    }
  },
  "fill_value": 0,
  "codecs": [
    {
      "name": "bytes",
      "configuration": {
        "endian": "little"
      }
    },
    {
      "name": "zstd",
      "configuration": {
        "level": 0,
        "checksum": false
      }
    }
  ],
  "attributes": {},
  "zarr_format": 3,
  "node_type": "array",
  "storage_transformers": []
}

This was a pretty basic example. Let’s explore the other things we might want to do when creating Zarr.

How to create a group

In this example we’ll use a MemoryStore. This is an in-memory representation of a Zarr store that has no physical representation on-disk. Similar to how numpy is an in-memory representation of an array.

store = zarr.storage.MemoryStore()
root = zarr.create_group(store)
group1 = root.create_group('group1')
group2 = root.create_group('group2')
z1 = group1.create_array(name='array_in_group', shape=(100,100), chunks=(10,10), dtype='i4')
z2 = group2.create_array(name='array_in_group', shape=(1000,1000), chunks=(10,10), dtype='i4')
root.tree()

/
├── group1
│   └── array_in_group (100, 100) int32
└── group2
    └── array_in_group (1000, 1000) int32

How to Examine and Modify the Chunk Shape

If your data is sufficiently large, Zarr will chose a chunksize for you.

store = zarr.storage.MemoryStore()
zarr_no_chunks = zarr.create_array(store, name="no_chunks", data=np.arange(100))
zarr_no_chunks.chunks, zarr_no_chunks.shape

((100,), (100,))

zarr_with_chunks = zarr.create_array(store, name="with_chunks", data=np.arange(10000000))
zarr_with_chunks.chunks, zarr_with_chunks.shape

((156250,), (10000000,))

For zarr_with_chunks we see the chunks are smaller than the shape, so we know the data has been chunked. Other ways to examine the chunk structure are zarr.info_complete() and zarr.cdata_shape.

?zarr_no_chunks.cdata_shape

Type:        property

String form: <property object at 0x71251d4a9670>

Docstring:   The shape of the chunk grid for this array.

zarr_no_chunks.cdata_shape, zarr_with_chunks.cdata_shape

((1,), (64,))

The zarr store with chunks has 64 chunks. The number of chunks multiplied by the chunk size equals the length of the whole array.

zarr_with_chunks.cdata_shape[0] * zarr_with_chunks.chunks[0] == zarr_with_chunks.shape[0]

True

What’s the storage size of these chunks?

First we need to know how to access a particular chunk via its prefix. Think of a prefix as the relative path for every item in the MemoryStore.

[p async for p in store.list()][:10]

['no_chunks/zarr.json',
 'zarr.json',
 'no_chunks/c/0',
 'with_chunks/zarr.json',
 'with_chunks/c/1',
 'with_chunks/c/6',
 'with_chunks/c/7',
 'with_chunks/c/0',
 'with_chunks/c/9',
 'with_chunks/c/2']

Now we can use these prefixes to get the size of the first chunk in bytes. Note that this is the compressed size. We will talk more about compression in the next section.

await store.getsize_prefix("with_chunks/c/0")

If we want the chunksize to be bigger, then we can specify the chunks rather than using the default.

zarr_with_big_chunks = zarr.create_array(store, name="with_big_chunks", data=np.arange(10000000), chunks=(500000))

zarr_with_big_chunks.chunks, zarr_with_big_chunks.shape, zarr_with_big_chunks.cdata_shape

((500000,), (10000000,), (20,))

This Zarr store has 10 million values, stored in 20 chunks of 500,000 data values.

await store.getsize_prefix("with_big_chunks/c/0")

In the real world, you will likely want to deal in Zarr chunks of 1MB uncompressed or greater, and when dealing with remote storage options where data is read over a network and the number of requests should be minimized.

Exploring and Modifying Data Compression

Continuing with data from the example above, we can use zarr.info_complete() or zarr.compressors to learn more about how Zarr has compressed the data for us.

zarr_with_chunks.info_complete()

Type               : Array
Zarr format        : 3
Data type          : Int64(endianness='little')
Fill value         : 0
Shape              : (10000000,)
Chunk shape        : (156250,)
Order              : C
Read-only          : False
Store type         : MemoryStore
Filters            : ()
Serializer         : BytesCodec(endian=<Endian.little: 'little'>)
Compressors        : (ZstdCodec(level=0, checksum=False),)
No. bytes          : 80000000 (76.3M)
No. bytes stored   : 10192187 (9.7M)
Storage ratio      : 7.8
Chunks Initialized : 64

From the information above, we can see that the default compression, zstd, is a good choice for n-dimensional data.

The Blosc compressor is a meta compressor, which means it implements multiple different internal compressors. Let’s try it with lz4 compression (if available – this was the default compression for zarr-python<3).

zarr_blosc_compressed = zarr.create_array(
    store,
    name="blosc_compressed",
    data=np.arange(10000000),
    compressors=zarr.codecs.BloscCodec(cname='lz4', clevel=5, shuffle="shuffle", blocksize=0),
)
zarr_blosc_compressed.info_complete()

Type               : Array
Zarr format        : 3
Data type          : Int64(endianness='little')
Fill value         : 0
Shape              : (10000000,)
Chunk shape        : (156250,)
Order              : C
Read-only          : False
Store type         : MemoryStore
Filters            : ()
Serializer         : BytesCodec(endian=<Endian.little: 'little'>)
Compressors        : (BloscCodec(typesize=8, cname=<BloscCname.lz4: 'lz4'>, clevel=5, shuffle=<BloscShuffle.shuffle: 'shuffle'>, blocksize=0),)
No. bytes          : 80000000 (76.3M)
No. bytes stored   : 514573 (502.5K)
Storage ratio      : 155.5
Chunks Initialized : 64

We can see, from the storage ratio above, that compression has made our data 155 times smaller 😱 .

You can set compressors=None when creating a Zarr array to turn off all compression, but I’m not sure why you would do that.

zarr_uncompressed = zarr.create_array(
    store,
    name="not_compressed",
    data=np.arange(10000000),
    compressors=None,
)
zarr_uncompressed.info_complete()

Type               : Array
Zarr format        : 3
Data type          : Int64(endianness='little')
Fill value         : 0
Shape              : (10000000,)
Chunk shape        : (156250,)
Order              : C
Read-only          : False
Store type         : MemoryStore
Filters            : ()
Serializer         : BytesCodec(endian=<Endian.little: 'little'>)
Compressors        : ()
No. bytes          : 80000000 (76.3M)
No. bytes stored   : 80000511 (76.3M)
Storage ratio      : 1.0
Chunks Initialized : 64

Let’s see what happens when we use a different compression method. We can checkout a full list of registerd Zarr codecs here: https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html.

zarr_gzip_compressed = zarr.create_array(
    store,
    name="gzip_compressed",
    data=np.arange(10000000),
    compressors=zarr.codecs.GzipCodec()
)
zarr_gzip_compressed.info_complete()

Type               : Array
Zarr format        : 3
Data type          : Int64(endianness='little')
Fill value         : 0
Shape              : (10000000,)
Chunk shape        : (156250,)
Order              : C
Read-only          : False
Store type         : MemoryStore
Filters            : ()
Serializer         : BytesCodec(endian=<Endian.little: 'little'>)
Compressors        : (GzipCodec(level=5),)
No. bytes          : 80000000 (76.3M)
No. bytes stored   : 15141856 (14.4M)
Storage ratio      : 5.3
Chunks Initialized : 64

In this case, the storage ratio is 5.3 - so not as good! How to chose a compression algorithm is a topic for future investigation. Read more in the Compressors section of the Zarr User Guide.

Consolidating metadata

It’s important to consolidate metadata to minimize requests. Each group and array has a metadata file and when opening the Zarr store every metadata file has be to read. This is the case even if you open the file lazily (for instance using dask within xarray). In order to limit requests to read the whole tree of metadata files, Zarr provides the ability to consolidate metadata at the root of the store.

So far we have only been dealing in single array Zarr data stores. In this next example, we will create a Zarr store with multiple arrays and then consolidate metadata. The speed up with local storage is insignificant, but becomes significant when dealing in remote storage options, which we will see in the following example on accessing cloud storage.

store = zarr.storage.LocalStore("example.zarr")
root = zarr.create_group(store, path="root")

# Let's create many groups and many arrays
num_groups, num_arrays_per_group = 10, 10
for i in range(num_groups):
    group = root.create_group(f'group-{i}')
    for j in range(num_arrays_per_group):
        group.create_array(f'array-{j}', shape=(1000,1000), dtype='i4')

We don’t expect the consolidated metadata to be populated.

!cat {store.root}/zarr.json

{
  "attributes": {},
  "zarr_format": 3,
  "consolidated_metadata": null,
  "node_type": "group"
}

zarr.consolidate_metadata(store)

/home/jsignell/zarr-python/src/zarr/api/asynchronous.py:227: UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  warnings.warn(

<Group file://example.zarr>

There is a warning about consolidated not being an official part of the specification, but don’t worry about that. It’s normal that sometimes implementation outpaces specification.

Now if we check the root metadata file again we see that the consolidated metadata is populated.

!head -n100 {store.root}/zarr.json

{
  "attributes": {},
  "zarr_format": 3,
  "consolidated_metadata": {
    "kind": "inline",
    "must_understand": false,
    "metadata": {
      "root": {
        "attributes": {},
        "zarr_format": 3,
        "consolidated_metadata": {
          "kind": "inline",
          "must_understand": false,
          "metadata": {}
        },
        "node_type": "group"
      },
      "root/group-1": {
        "attributes": {},
        "zarr_format": 3,
        "consolidated_metadata": {
          "kind": "inline",
          "must_understand": false,
          "metadata": {}
        },
        "node_type": "group"
      },
      "root/group-1/array-4": {
        "shape": [
          1000,
          1000
        ],
        "data_type": "int32",
        "chunk_grid": {
          "name": "regular",
          "configuration": {
            "chunk_shape": [
              250,
              500
            ]
          }
        },
        "chunk_key_encoding": {
          "name": "default",
          "configuration": {
            "separator": "/"
          }
        },
        "fill_value": 0,
        "codecs": [
          {
            "name": "bytes",
            "configuration": {
              "endian": "little"
            }
          },
          {
            "name": "zstd",
            "configuration": {
              "level": 0,
              "checksum": false
            }
          }
        ],
        "attributes": {},
        "zarr_format": 3,
        "node_type": "array",
        "storage_transformers": []
      },
      "root/group-1/array-1": {
        "shape": [
          1000,
          1000
        ],
        "data_type": "int32",
        "chunk_grid": {
          "name": "regular",
          "configuration": {
            "chunk_shape": [
              250,
              500
            ]
          }
        },
        "chunk_key_encoding": {
          "name": "default",
          "configuration": {
            "separator": "/"
          }
        },
        "fill_value": 0,
        "codecs": [
          {
            "name": "bytes",
            "configuration": {
              "endian": "little"
            }
          },
          {
            "name": "zstd",

When opening Zarr stores that have consolidated metadata, use use_consolidated=True.

zarr.open(store, use_consolidated=True)

<Group file://example.zarr>

Example of Cloud-Optimized Access for this Format

Fortunately, there are many publicly accessible cloud archives of Zarr data.

Zarr provides storage backends for many different cloud providers as well as : Zarr Storage guide.

Here are a few we are aware of:

Zarr data in Microsoft’s Planetary Computer
Zarr data from Google
Amazon Sustainability Data Initiative available from Registry of Open Data on AWS - Enter “Zarr” in the Search input box.

Once you have identified a remote Zarr it should be straightforward to open it using xarray.

store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'

Tip: You might have seen chunks={} instead of chunks="auto". The difference is chunks={} means use exactly the same chunks as the Zarr store whereas chunks="auto" allows chunks to be combined into meta-chunks if that makes the tasks a better size for dask.

ds = xr.open_dataset(store, engine='zarr', chunks="auto", consolidated=True, zarr_format=2)
ds

<xarray.Dataset> Size: 2GB
Dimensions:      (latitude: 180, nv: 2, longitude: 360, time: 9226)
Coordinates:
    lat_bounds   (latitude, nv) float32 1kB dask.array<chunksize=(180, 2), meta=np.ndarray>
  * latitude     (latitude) float32 720B -90.0 -89.0 -88.0 ... 87.0 88.0 89.0
    lon_bounds   (longitude, nv) float32 3kB dask.array<chunksize=(360, 2), meta=np.ndarray>
  * longitude    (longitude) float32 1kB 0.0 1.0 2.0 3.0 ... 357.0 358.0 359.0
  * time         (time) datetime64[ns] 74kB 1996-10-01 1996-10-02 ... 2021-12-31
    time_bounds  (time, nv) datetime64[ns] 148kB dask.array<chunksize=(9226, 2), meta=np.ndarray>
Dimensions without coordinates: nv
Data variables:
    precip       (time, latitude, longitude) float32 2GB dask.array<chunksize=(400, 180, 360), meta=np.ndarray>
Attributes: (12/45)
    Conventions:                CF-1.6, ACDD 1.3
    Metadata_Conventions:       CF-1.6, Unidata Dataset Discovery v1.0, NOAA ...
    acknowledgment:             This project was supported in part by a grant...
    cdm_data_type:              Grid
    cdr_program:                NOAA Climate Data Record Program for satellit...
    cdr_variable:               precipitation
    ...                         ...
    standard_name_vocabulary:   CF Standard Name Table (v41, 22 February 2017)
    summary:                    Global Precipitation Climatology Project (GPC...
    time_coverage_duration:     P1D
    time_coverage_end:          1996-10-01T23:59:59Z
    time_coverage_start:        1996-10-01T00:00:00Z
    title:                      Global Precipitation Climatatology Project (G...

Microsoft’s Planetary Computer goes above and beyond, providing tutorials alongside each dataset. We recommend exploring these on your own to get an idea of what you can do with Zarr and Xarray. The following example is based off the Zarr tutorial.

This example demonstrates how to access the Daymet Puerto Rico Dataset on MS Planetary Computer:

import pystac
import planetary_computer

url = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-daily-hi"
collection = pystac.read_file(url)

asset = collection.assets["zarr-abfs"]

# planetary computer requires a special token to be added to the url
planetary_computer.sign_inplace(asset)

ds = xr.open_zarr(
    asset.href,
    **asset.extra_fields["xarray:open_kwargs"],
    storage_options=asset.extra_fields["xarray:storage_options"],
    zarr_format=2,
)
ds

<xarray.Dataset> Size: 69GB
Dimensions:                  (time: 14965, y: 584, x: 284, nv: 2)
Coordinates:
    lat                      (y, x) float32 663kB dask.array<chunksize=(584, 284), meta=np.ndarray>
    lon                      (y, x) float32 663kB dask.array<chunksize=(584, 284), meta=np.ndarray>
  * time                     (time) datetime64[ns] 120kB 1980-01-01T12:00:00 ...
  * x                        (x) float32 1kB -5.802e+06 ... -5.519e+06
  * y                        (y) float32 2kB -3.9e+04 -4e+04 ... -6.22e+05
Dimensions without coordinates: nv
Data variables:
    dayl                     (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    lambert_conformal_conic  int16 2B ...
    prcp                     (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    srad                     (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    swe                      (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    time_bnds                (time, nv) datetime64[ns] 239kB dask.array<chunksize=(365, 2), meta=np.ndarray>
    tmax                     (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    tmin                     (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    vp                       (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
    yearday                  (time) int16 30kB dask.array<chunksize=(365,), meta=np.ndarray>
Attributes:
    Conventions:       CF-1.6
    Version_data:      Daymet Data Version 4.0
    Version_software:  Daymet Software Version 4.0
    citation:          Please see http://daymet.ornl.gov/ for current Daymet ...
    references:        Please see http://daymet.ornl.gov/ for current informa...
    source:            Daymet Software Version 4.0
    start_year:        1980

Additional Resources

Jupyter Notebook for a high level overview of Zarr on Google Cloud by Tyson Swetnam: