Zarr in Practice

This notebook demonstrates how to create, explore and modify a Zarr store.

These concepts are explored in more detail in the official Zarr Tutorial.

It also shows the use of public Zarr stores for geospatial data.

How to create a Zarr store

import sys
import numpy as np
import xarray as xr
import zarr

# Here we create a simple Zarr store.
zstore = zarr.array(np.arange(10))

This is an in-memory Zarr store. To persist it to disk, we can use .save.

zarr.save("test.zarr", zstore)

We can open the metadata about this dataset, which gives us some interesting information. The dataset has a shape of 10 chunks of 10, so we know all the data was stored in 1 chunk, and was compressed with the blosc compressor.

!cat test.zarr/.zarray 
{
    "chunks": [
        10
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "<i8",
    "fill_value": 0,
    "filters": null,
    "order": "C",
    "shape": [
        10
    ],
    "zarr_format": 2
}

This was a pretty basic example. Let’s explore the other things we might want to do when creating Zarr.

How to create a group

root = zarr.group()
group1 = root.create_group('group1')
group2 = root.create_group('group2')
z1 = group1.create_dataset('ds_in_group', shape=(100,100), chunks=(10,10), dtype='i4')
z2 = group2.create_dataset('ds_in_group', shape=(1000,1000), chunks=(10,10), dtype='i4')
root.tree(expand=True)

How to Examine and Modify the Chunk Shape

If your data is sufficiently large, Zarr will chose a chunksize for you.

zarr_no_chunks = zarr.array(np.arange(100), chunks=True)
zarr_no_chunks.chunks, zarr_no_chunks.shape
((100,), (100,))
zarr_with_chunks = zarr.array(np.arange(10000000), chunks=True)
zarr_with_chunks.chunks, zarr_with_chunks.shape
((156250,), (10000000,))

For zarr_with_chunks we see the chunks are smaller than the shape, so we know the data has been chunked. Other ways to examine the chunk structure are zarr.info and zarr.cdata_shape.

?zarr_no_chunks.cdata_shape
Type:        property
String form: <property object at 0x7efde6ecfb00>
Docstring:  
A tuple of integers describing the number of chunks along each
dimension of the array.
zarr_no_chunks.cdata_shape, zarr_with_chunks.cdata_shape
((1,), (64,))

The zarr store with chunks has 64 chunks. The number of chunks multiplied by the chunk size equals the length of the whole array.

zarr_with_chunks.cdata_shape[0] * zarr_with_chunks.chunks[0] == zarr_with_chunks.shape[0]
True

What’s the storage size of these chunks?

The default chunks are pretty small.

sys.getsizeof(zarr_with_chunks.chunk_store['0']) # this is in bytes
8049
zarr_with_big_chunks = zarr.array(np.arange(10000000), chunks=(500000))
zarr_with_big_chunks.chunks, zarr_with_big_chunks.shape, zarr_with_big_chunks.cdata_shape
((500000,), (10000000,), (20,))

This Zarr store has 10 million values, stored in 20 chunks of 500,000 data values.

sys.getsizeof(zarr_with_big_chunks.chunk_store['0'])
24941

These chunks are still pretty small, but this is just a silly example. In the real world, you will likely want to deal in Zarr chunks of 1MB or greater, especially when dealing with remote storatge options where data is read over a network and the number of requests should be minimized.

Exploring and Modifying Data Compression

Continuing with data from the example above, we can tell that Zarr has also compressed the data for us using zarr.info or zarr.compressor.

zarr_with_chunks.compressor
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)

The Blosc compressor is actually a meta compressor so actually implements multiple different internal compressors. In this case, it has implemented lz4 compression. We can also explore how much space was saved by using this compression method.

zarr_with_chunks.info
Type zarr.core.Array
Data type int64
Shape (10000000,)
Chunk shape (156250,)
Order C
Read-only False
Compressor Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type zarr.storage.KVStore
No. bytes 80000000 (76.3M)
No. bytes stored 514193 (502.1K)
Storage ratio 155.6
Chunks initialized 64/64

We can see, from the storage ratio above, that compression has made our data 155 times smaller 😱 .

You can set compression=None when creating a Zarr array to turn off this behavior, but I’m not sure why you would do that.

Let’s see what happens when we use a different compression method. We can checkout a full list of numcodecs compressors here: https://numcodecs.readthedocs.io/.

from numcodecs import GZip
compressor = GZip()
zstore_gzip_compressed = zarr.array(np.arange(10000000), chunks=True, compressor=compressor)
zstore_gzip_compressed.info
Type zarr.core.Array
Data type int64
Shape (10000000,)
Chunk shape (156250,)
Order C
Read-only False
Compressor GZip(level=1)
Store type zarr.storage.KVStore
No. bytes 80000000 (76.3M)
No. bytes stored 15086009 (14.4M)
Storage ratio 5.3
Chunks initialized 64/64

In this case, the storage ratio is 5.3 - so not as good! How to chose a compression algorithm is a topic for future investigation.

Consolidating metadata

It’s important to consolidate metadata to minimize requests. Each group and array will have a metadata file, so in order to limit requests to read the whole tree of metadata files, Zarr provides the ability to consolidate metdata into a metadata file at the of the store.

So far we have only been dealing in single array Zarr data stores. In this next example, we will create a zarr store with multiple arrays and then consolidate metadata. The speed up with local storage is insignificant, but becomes significant when dealing in remote storage options, which we will see in the following example on accessing cloud storage.

root = zarr.group()
zarr_store = 'example.zarr'
# Let's create many groups and many arrays
num_groups, num_arrays_per_group = 100, 100
for i in range(num_groups):
    group = root.create_group(f'group-{i}')
    for j in range(num_arrays_per_group):
        group.create_dataset(f'array-{j}', shape=(1000,1000), dtype='i4')

store = zarr.DirectoryStore(zarr_store)
zarr.save(store, root)
# We don't expect it to exist yet!
!cat {zarr_store}/.zmetadata
cat: {zarr_store}/.zmetadata: No such file or directory
zarr.consolidate_metadata(zarr_store)
<zarr.core.Array (100,) <U8>
zarr.open_consolidated(zarr_store)
<zarr.core.Array (100,) <U8>
!cat {zarr_store}/.zmetadata
{
    "metadata": {
        ".zarray": {
            "chunks": [
                100
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<U8",
            "fill_value": "",
            "filters": null,
            "order": "C",
            "shape": [
                100
            ],
            "zarr_format": 2
        }
    },
    "zarr_consolidated_format": 1
}

Example of Cloud-Optimized Access for this Format

Fortunately, there are many publicly accessible cloud archives of Zarr data.

Zarr provides storage backends for all of these cloud providers: Zarr Tutorial - Distributed/cloud storage.

Here are a few we are aware of:

The Pangeo-Forge Data Catalog provides handy examples of how to open each dataset, for example, from the Global Precipitation Climatology Project (GPCP) page:

store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'
ds = xr.open_dataset(store, engine='zarr', chunks={}, consolidated=True)
ds
<xarray.Dataset>
Dimensions:      (latitude: 180, nv: 2, longitude: 360, time: 9226)
Coordinates:
    lat_bounds   (latitude, nv) float32 dask.array<chunksize=(180, 2), meta=np.ndarray>
  * latitude     (latitude) float32 -90.0 -89.0 -88.0 -87.0 ... 87.0 88.0 89.0
    lon_bounds   (longitude, nv) float32 dask.array<chunksize=(360, 2), meta=np.ndarray>
  * longitude    (longitude) float32 0.0 1.0 2.0 3.0 ... 356.0 357.0 358.0 359.0
  * time         (time) datetime64[ns] 1996-10-01 1996-10-02 ... 2021-12-31
    time_bounds  (time, nv) datetime64[ns] dask.array<chunksize=(200, 2), meta=np.ndarray>
Dimensions without coordinates: nv
Data variables:
    precip       (time, latitude, longitude) float32 dask.array<chunksize=(200, 180, 360), meta=np.ndarray>
Attributes: (12/45)
    Conventions:                CF-1.6, ACDD 1.3
    Metadata_Conventions:       CF-1.6, Unidata Dataset Discovery v1.0, NOAA ...
    acknowledgment:             This project was supported in part by a grant...
    cdm_data_type:              Grid
    cdr_program:                NOAA Climate Data Record Program for satellit...
    cdr_variable:               precipitation
    ...                         ...
    standard_name_vocabulary:   CF Standard Name Table (v41, 22 February 2017)
    summary:                    Global Precipitation Climatology Project (GPC...
    time_coverage_duration:     P1D
    time_coverage_end:          1996-10-01T23:59:59Z
    time_coverage_start:        1996-10-01T00:00:00Z
    title:                      Global Precipitation Climatatology Project (G...
    • latitude
      PandasIndex
      PandasIndex(Index([-90.0, -89.0, -88.0, -87.0, -86.0, -85.0, -84.0, -83.0, -82.0, -81.0,
             ...
              80.0,  81.0,  82.0,  83.0,  84.0,  85.0,  86.0,  87.0,  88.0,  89.0],
            dtype='float32', name='latitude', length=180))
    • longitude
      PandasIndex
      PandasIndex(Index([  0.0,   1.0,   2.0,   3.0,   4.0,   5.0,   6.0,   7.0,   8.0,   9.0,
             ...
             350.0, 351.0, 352.0, 353.0, 354.0, 355.0, 356.0, 357.0, 358.0, 359.0],
            dtype='float32', name='longitude', length=360))
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['1996-10-01', '1996-10-02', '1996-10-03', '1996-10-04',
                     '1996-10-05', '1996-10-06', '1996-10-07', '1996-10-08',
                     '1996-10-09', '1996-10-10',
                     ...
                     '2021-12-22', '2021-12-23', '2021-12-24', '2021-12-25',
                     '2021-12-26', '2021-12-27', '2021-12-28', '2021-12-29',
                     '2021-12-30', '2021-12-31'],
                    dtype='datetime64[ns]', name='time', length=9226, freq=None))
  • Conventions :
    CF-1.6, ACDD 1.3
    Metadata_Conventions :
    CF-1.6, Unidata Dataset Discovery v1.0, NOAA CDR v1.0, GDS v2.0
    acknowledgment :
    This project was supported in part by a grant from the NOAA Climate Data Record (CDR) Program for satellites.
    cdm_data_type :
    Grid
    cdr_program :
    NOAA Climate Data Record Program for satellites, FY 2011.
    cdr_variable :
    precipitation
    comment :
    Processing computer: eagle2.umd.edu
    contributor_name :
    Robert Adler, Jian-Jian Wang
    contributor_role :
    principalInvestigator, processor and custodian
    creator_email :
    jjwang@umd.edu
    creator_name :
    Dr. Jian-Jian Wang
    date_created :
    2017-05-30T16:52:42Z
    geospatial_lat_max :
    90.0
    geospatial_lat_min :
    -90.0
    geospatial_lat_resolution :
    1 degree
    geospatial_lat_units :
    degrees_north
    geospatial_lon_max :
    360.0
    geospatial_lon_min :
    0.0
    geospatial_lon_resolution :
    1 degree
    geospatial_lon_units :
    degrees_east
    history :
    1) 2017-05-30T16:52:42Z, Dr. Jian-Jian Wang, U of Maryland, Created beta (B1) file
    id :
    199610/gpcp_v01r03_daily_d19961001_c20170530.nc
    institution :
    ACADEMIC > UMD/ESSIC > Earth System Science Interdisciplinary Center, University of Maryland
    keywords :
    EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > PRECIPITATION AMOUNT
    keywords_vocabulary :
    NASA Global Change Master Directory (GCMD) Earth Science Keywords, Version 7.0
    license :
    No constraints on data access or use.
    metadata_link :
    gov.noaa.ncdc:XXXXX
    naming_authority :
    gov.noaa.ncdc
    platform :
    GOES (Geostationary Operational Environmental Satellite), GMS (Japan Geostationary Meteorological Satellite), METEOSAT, TIROS > Television Infrared Observation Satellite, DMSP (Defense Meteorological Satellite Program)
    processing_level :
    NASA Level 3
    product_version :
    v01r03
    project :
    GPCP > Global Precipitation Climatology Project
    publisher_email :
    jjwang@umd.edu
    publisher_name :
    NOAA National Centers for Environmental Information (NCEI)
    publisher_url :
    https://www.ncei.noaa.gov
    references :
    Huffman et al. 1997, http://dx.doi.org/10.1175/1520-0477(1997)078<0005:TGPCPG>2.0.CO;2; Adler et al. 2003, http://dx.doi.org/10.1175/1525-7541(2003)004<1147:TVGPCP>2.0.CO;2; Huffman et al. 2009, http://dx.doi.org/10.1029/2009GL040000; Adler et al. 2017, Global Precipitation Climatology Project (GPCP) Daily Analysis: Climate Algorithm Theoretical Basis Document (C-ATBD)
    sensor :
    Imager, TOVS > TIROS Operational Vertical Sounder, SSMI > Special Sensor Microwave/Imager
    source :
    /data1/GPCP_CDR/GPCP_Output/1DD//bin/199610/stfsg3.19961001.s
    spatial_resolution :
    1 degree
    standard_name_vocabulary :
    CF Standard Name Table (v41, 22 February 2017)
    summary :
    Global Precipitation Climatology Project (GPCP) Daily Version 1.3 gridded, merged satellite/gauge precipitation Climate data Record (CDR) from 1996 to present.
    time_coverage_duration :
    P1D
    time_coverage_end :
    1996-10-01T23:59:59Z
    time_coverage_start :
    1996-10-01T00:00:00Z
    title :
    Global Precipitation Climatatology Project (GPCP) Climate Data Record (CDR), Daily V1.3
  • Microsoft’s Planetary Computer goes above and beyond, providing tutorials alongside each dataset. We recommend exploring these on your own to get an idea of what you can do with Zarr and Xarray. See all tutorials here: microsoft/PlanetaryComputerExamples. Note, this repo contains ALL tutorials, not just Zarr tutorials, so you may want to filter for Zarr.

    For example, here is some code from the Daymet Puerto Rico Dataset on MS Planetary Computer:

    import cartopy.crs as ccrs
    import fsspec
    import matplotlib.pyplot as plt
    import pystac
    import xarray as xr
    import warnings
    
    warnings.simplefilter("ignore", RuntimeWarning)
    url = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-daily-hi"
    collection = pystac.read_file(url)
    asset = collection.assets["zarr-https"]
    store = fsspec.get_mapper(asset.href)
    ds = xr.open_zarr(store, **asset.extra_fields["xarray:open_kwargs"])
    ds
    <xarray.Dataset>
    Dimensions:                  (time: 14965, y: 584, x: 284, nv: 2)
    Coordinates:
        lat                      (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
        lon                      (y, x) float32 dask.array<chunksize=(584, 284), meta=np.ndarray>
      * time                     (time) datetime64[ns] 1980-01-01T12:00:00 ... 20...
      * x                        (x) float32 -5.802e+06 -5.801e+06 ... -5.519e+06
      * y                        (y) float32 -3.9e+04 -4e+04 ... -6.21e+05 -6.22e+05
    Dimensions without coordinates: nv
    Data variables:
        dayl                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
        lambert_conformal_conic  int16 ...
        prcp                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
        srad                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
        swe                      (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
        time_bnds                (time, nv) datetime64[ns] dask.array<chunksize=(365, 2), meta=np.ndarray>
        tmax                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
        tmin                     (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
        vp                       (time, y, x) float32 dask.array<chunksize=(365, 584, 284), meta=np.ndarray>
        yearday                  (time) int16 dask.array<chunksize=(365,), meta=np.ndarray>
    Attributes:
        Conventions:       CF-1.6
        Version_data:      Daymet Data Version 4.0
        Version_software:  Daymet Software Version 4.0
        citation:          Please see http://daymet.ornl.gov/ for current Daymet ...
        references:        Please see http://daymet.ornl.gov/ for current informa...
        source:            Daymet Software Version 4.0
        start_year:        1980
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['1980-01-01 12:00:00', '1980-01-02 12:00:00',
                     '1980-01-03 12:00:00', '1980-01-04 12:00:00',
                     '1980-01-05 12:00:00', '1980-01-06 12:00:00',
                     '1980-01-07 12:00:00', '1980-01-08 12:00:00',
                     '1980-01-09 12:00:00', '1980-01-10 12:00:00',
                     ...
                     '2020-12-21 12:00:00', '2020-12-22 12:00:00',
                     '2020-12-23 12:00:00', '2020-12-24 12:00:00',
                     '2020-12-25 12:00:00', '2020-12-26 12:00:00',
                     '2020-12-27 12:00:00', '2020-12-28 12:00:00',
                     '2020-12-29 12:00:00', '2020-12-30 12:00:00'],
                    dtype='datetime64[ns]', name='time', length=14965, freq=None))
    • x
      PandasIndex
      PandasIndex(Index([-5802250.0, -5801250.0, -5800250.0, -5799250.0, -5798250.0, -5797250.0,
             -5796250.0, -5795250.0, -5794250.0, -5793250.0,
             ...
             -5528250.0, -5527250.0, -5526250.0, -5525250.0, -5524250.0, -5523250.0,
             -5522250.0, -5521250.0, -5520250.0, -5519250.0],
            dtype='float32', name='x', length=284))
    • y
      PandasIndex
      PandasIndex(Index([ -39000.0,  -40000.0,  -41000.0,  -42000.0,  -43000.0,  -44000.0,
              -45000.0,  -46000.0,  -47000.0,  -48000.0,
             ...
             -613000.0, -614000.0, -615000.0, -616000.0, -617000.0, -618000.0,
             -619000.0, -620000.0, -621000.0, -622000.0],
            dtype='float32', name='y', length=584))
  • Conventions :
    CF-1.6
    Version_data :
    Daymet Data Version 4.0
    Version_software :
    Daymet Software Version 4.0
    citation :
    Please see http://daymet.ornl.gov/ for current Daymet data citation information
    references :
    Please see http://daymet.ornl.gov/ for current information on Daymet references
    source :
    Daymet Software Version 4.0
    start_year :
    1980
  • Additional Resources