import numpy as np
import xarray as xr
import zarr
Zarr in Practice
This notebook demonstrates how to create, explore and modify a Zarr store.
These concepts are explored in more detail in the official Zarr User Guide.
It also shows the use of public Zarr stores for geospatial data.
Environment
The packages needed for this notebook can be installed with conda
or mamba
. Using the environment.yml
from this folder run:
conda env create -f environment.yml
or
mamba env create -f environment.yml
Finally, you may activate and select the kernel in the notebook (running in Jupyter)
conda activate coguide-zarr
The notebook has been tested to work with the listed Conda environment.
How to create a Zarr store
First let’s clean up in case we are running this notebook multiple times.
!rm -rf test.zarr
!rm -rf example.zarr
Here we will create a simple local Zarr store.
= zarr.storage.LocalStore("test.zarr") store
The Zarr store does not write anything to disk until there is data in it.
!tree test.zarr
test.zarr [error opening dir]
0 directories, 0 files
Now let’s create an array. Just calling zarr.create_array
writes the array to disk.
= zarr.create_array(store=store, data=np.arange(10)) arr
!tree test.zarr
test.zarr ├── c │ └── 0 └── zarr.json 1 directory, 2 files
We can open the metadata about this Zarr array, which gives us some interesting information. The dataset has a shape
of 10 and chunk_shape
of 10, so we know all the data was stored in 1 chunk. From the codecs
we can see tell that the data was compressed with the zstd
compressor.
Notice that the zarr_format
(this is the Zarr format version) and node_type
are also specified in the metadata.
!cat test.zarr/zarr.json
{
"shape": [
10
],
"data_type": "int64",
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [
10
]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"fill_value": 0,
"codecs": [
{
"name": "bytes",
"configuration": {
"endian": "little"
}
},
{
"name": "zstd",
"configuration": {
"level": 0,
"checksum": false
}
}
],
"attributes": {},
"zarr_format": 3,
"node_type": "array",
"storage_transformers": []
}
This was a pretty basic example. Let’s explore the other things we might want to do when creating Zarr.
How to create a group
In this example we’ll use a MemoryStore
. This is an in-memory representation of a Zarr store that has no physical representation on-disk. Similar to how numpy is an in-memory representation of an array.
= zarr.storage.MemoryStore()
store = zarr.create_group(store)
root = root.create_group('group1')
group1 = root.create_group('group2')
group2 = group1.create_array(name='array_in_group', shape=(100,100), chunks=(10,10), dtype='i4')
z1 = group2.create_array(name='array_in_group', shape=(1000,1000), chunks=(10,10), dtype='i4')
z2 root.tree()
/ ├── group1 │ └── array_in_group (100, 100) int32 └── group2 └── array_in_group (1000, 1000) int32
How to Examine and Modify the Chunk Shape
If your data is sufficiently large, Zarr will chose a chunksize for you.
= zarr.storage.MemoryStore()
store = zarr.create_array(store, name="no_chunks", data=np.arange(100))
zarr_no_chunks zarr_no_chunks.chunks, zarr_no_chunks.shape
((100,), (100,))
= zarr.create_array(store, name="with_chunks", data=np.arange(10000000))
zarr_with_chunks zarr_with_chunks.chunks, zarr_with_chunks.shape
((156250,), (10000000,))
For zarr_with_chunks
we see the chunks are smaller than the shape, so we know the data has been chunked. Other ways to examine the chunk structure are zarr.info_complete()
and zarr.cdata_shape
.
?zarr_no_chunks.cdata_shape
Type: property String form: <property object at 0x71251d4a9670> Docstring: The shape of the chunk grid for this array.
zarr_no_chunks.cdata_shape, zarr_with_chunks.cdata_shape
((1,), (64,))
The zarr store with chunks has 64 chunks. The number of chunks multiplied by the chunk size equals the length of the whole array.
0] * zarr_with_chunks.chunks[0] == zarr_with_chunks.shape[0] zarr_with_chunks.cdata_shape[
True
What’s the storage size of these chunks?
First we need to know how to access a particular chunk via its prefix. Think of a prefix as the relative path for every item in the MemoryStore
.
async for p in store.list()][:10] [p
['no_chunks/zarr.json',
'zarr.json',
'no_chunks/c/0',
'with_chunks/zarr.json',
'with_chunks/c/1',
'with_chunks/c/6',
'with_chunks/c/7',
'with_chunks/c/0',
'with_chunks/c/9',
'with_chunks/c/2']
Now we can use these prefixes to get the size of the first chunk in bytes. Note that this is the compressed size. We will talk more about compression in the next section.
await store.getsize_prefix("with_chunks/c/0")
195984
If we want the chunksize to be bigger, then we can specify the chunks rather than using the default.
= zarr.create_array(store, name="with_big_chunks", data=np.arange(10000000), chunks=(500000)) zarr_with_big_chunks
zarr_with_big_chunks.chunks, zarr_with_big_chunks.shape, zarr_with_big_chunks.cdata_shape
((500000,), (10000000,), (20,))
This Zarr store has 10 million values, stored in 20 chunks of 500,000 data values.
await store.getsize_prefix("with_big_chunks/c/0")
544926
In the real world, you will likely want to deal in Zarr chunks of 1MB uncompressed or greater, and when dealing with remote storage options where data is read over a network and the number of requests should be minimized.
Exploring and Modifying Data Compression
Continuing with data from the example above, we can use zarr.info_complete()
or zarr.compressors
to learn more about how Zarr has compressed the data for us.
zarr_with_chunks.info_complete()
Type : Array
Zarr format : 3
Data type : Int64(endianness='little')
Fill value : 0
Shape : (10000000,)
Chunk shape : (156250,)
Order : C
Read-only : False
Store type : MemoryStore
Filters : ()
Serializer : BytesCodec(endian=<Endian.little: 'little'>)
Compressors : (ZstdCodec(level=0, checksum=False),)
No. bytes : 80000000 (76.3M)
No. bytes stored : 10192187 (9.7M)
Storage ratio : 7.8
Chunks Initialized : 64
From the information above, we can see that the default compression, zstd
, is a good choice for n-dimensional data.
The Blosc
compressor is a meta compressor, which means it implements multiple different internal compressors. Let’s try it with lz4
compression (if available – this was the default compression for zarr-python<3).
= zarr.create_array(
zarr_blosc_compressed
store,="blosc_compressed",
name=np.arange(10000000),
data=zarr.codecs.BloscCodec(cname='lz4', clevel=5, shuffle="shuffle", blocksize=0),
compressors
) zarr_blosc_compressed.info_complete()
Type : Array
Zarr format : 3
Data type : Int64(endianness='little')
Fill value : 0
Shape : (10000000,)
Chunk shape : (156250,)
Order : C
Read-only : False
Store type : MemoryStore
Filters : ()
Serializer : BytesCodec(endian=<Endian.little: 'little'>)
Compressors : (BloscCodec(typesize=8, cname=<BloscCname.lz4: 'lz4'>, clevel=5, shuffle=<BloscShuffle.shuffle: 'shuffle'>, blocksize=0),)
No. bytes : 80000000 (76.3M)
No. bytes stored : 514573 (502.5K)
Storage ratio : 155.5
Chunks Initialized : 64
We can see, from the storage ratio above, that compression has made our data 155 times smaller 😱 .
You can set compressors=None
when creating a Zarr array to turn off all compression, but I’m not sure why you would do that.
= zarr.create_array(
zarr_uncompressed
store,="not_compressed",
name=np.arange(10000000),
data=None,
compressors
) zarr_uncompressed.info_complete()
Type : Array
Zarr format : 3
Data type : Int64(endianness='little')
Fill value : 0
Shape : (10000000,)
Chunk shape : (156250,)
Order : C
Read-only : False
Store type : MemoryStore
Filters : ()
Serializer : BytesCodec(endian=<Endian.little: 'little'>)
Compressors : ()
No. bytes : 80000000 (76.3M)
No. bytes stored : 80000511 (76.3M)
Storage ratio : 1.0
Chunks Initialized : 64
Let’s see what happens when we use a different compression method. We can checkout a full list of registerd Zarr codecs here: https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html.
= zarr.create_array(
zarr_gzip_compressed
store,="gzip_compressed",
name=np.arange(10000000),
data=zarr.codecs.GzipCodec()
compressors
) zarr_gzip_compressed.info_complete()
Type : Array
Zarr format : 3
Data type : Int64(endianness='little')
Fill value : 0
Shape : (10000000,)
Chunk shape : (156250,)
Order : C
Read-only : False
Store type : MemoryStore
Filters : ()
Serializer : BytesCodec(endian=<Endian.little: 'little'>)
Compressors : (GzipCodec(level=5),)
No. bytes : 80000000 (76.3M)
No. bytes stored : 15141856 (14.4M)
Storage ratio : 5.3
Chunks Initialized : 64
In this case, the storage ratio is 5.3 - so not as good! How to chose a compression algorithm is a topic for future investigation. Read more in the Compressors section of the Zarr User Guide.
Consolidating metadata
It’s important to consolidate metadata to minimize requests. Each group and array has a metadata file and when opening the Zarr store every metadata file has be to read. This is the case even if you open the file lazily (for instance using dask within xarray). In order to limit requests to read the whole tree of metadata files, Zarr provides the ability to consolidate metadata at the root of the store.
So far we have only been dealing in single array Zarr data stores. In this next example, we will create a Zarr store with multiple arrays and then consolidate metadata. The speed up with local storage is insignificant, but becomes significant when dealing in remote storage options, which we will see in the following example on accessing cloud storage.
= zarr.storage.LocalStore("example.zarr")
store = zarr.create_group(store, path="root")
root
# Let's create many groups and many arrays
= 10, 10
num_groups, num_arrays_per_group for i in range(num_groups):
= root.create_group(f'group-{i}')
group for j in range(num_arrays_per_group):
f'array-{j}', shape=(1000,1000), dtype='i4') group.create_array(
We don’t expect the consolidated metadata to be populated.
!cat {store.root}/zarr.json
{
"attributes": {},
"zarr_format": 3,
"consolidated_metadata": null,
"node_type": "group"
}
zarr.consolidate_metadata(store)
/home/jsignell/zarr-python/src/zarr/api/asynchronous.py:227: UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
warnings.warn(
<Group file://example.zarr>
There is a warning about consolidated not being an official part of the specification, but don’t worry about that. It’s normal that sometimes implementation outpaces specification.
Now if we check the root metadata file again we see that the consolidated metadata is populated.
!head -n100 {store.root}/zarr.json
{
"attributes": {},
"zarr_format": 3,
"consolidated_metadata": {
"kind": "inline",
"must_understand": false,
"metadata": {
"root": {
"attributes": {},
"zarr_format": 3,
"consolidated_metadata": {
"kind": "inline",
"must_understand": false,
"metadata": {}
},
"node_type": "group"
},
"root/group-1": {
"attributes": {},
"zarr_format": 3,
"consolidated_metadata": {
"kind": "inline",
"must_understand": false,
"metadata": {}
},
"node_type": "group"
},
"root/group-1/array-4": {
"shape": [
1000,
1000
],
"data_type": "int32",
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [
250,
500
]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"fill_value": 0,
"codecs": [
{
"name": "bytes",
"configuration": {
"endian": "little"
}
},
{
"name": "zstd",
"configuration": {
"level": 0,
"checksum": false
}
}
],
"attributes": {},
"zarr_format": 3,
"node_type": "array",
"storage_transformers": []
},
"root/group-1/array-1": {
"shape": [
1000,
1000
],
"data_type": "int32",
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [
250,
500
]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"fill_value": 0,
"codecs": [
{
"name": "bytes",
"configuration": {
"endian": "little"
}
},
{
"name": "zstd",
When opening Zarr stores that have consolidated metadata, use use_consolidated=True
.
open(store, use_consolidated=True) zarr.
<Group file://example.zarr>
Example of Cloud-Optimized Access for this Format
Fortunately, there are many publicly accessible cloud archives of Zarr data.
Zarr provides storage backends for many different cloud providers as well as : Zarr Storage guide.
Here are a few we are aware of:
- Zarr data in Microsoft’s Planetary Computer
- Zarr data from Google
- Amazon Sustainability Data Initiative available from Registry of Open Data on AWS - Enter “Zarr” in the Search input box.
Once you have identified a remote Zarr it should be straightforward to open it using xarray.
= 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr' store
Tip: You might have seen chunks={}
instead of chunks="auto"
. The difference is chunks={}
means use exactly the same chunks as the Zarr store whereas chunks="auto"
allows chunks to be combined into meta-chunks if that makes the tasks a better size for dask.
= xr.open_dataset(store, engine='zarr', chunks="auto", consolidated=True, zarr_format=2)
ds ds
<xarray.Dataset> Size: 2GB Dimensions: (latitude: 180, nv: 2, longitude: 360, time: 9226) Coordinates: lat_bounds (latitude, nv) float32 1kB dask.array<chunksize=(180, 2), meta=np.ndarray> * latitude (latitude) float32 720B -90.0 -89.0 -88.0 ... 87.0 88.0 89.0 lon_bounds (longitude, nv) float32 3kB dask.array<chunksize=(360, 2), meta=np.ndarray> * longitude (longitude) float32 1kB 0.0 1.0 2.0 3.0 ... 357.0 358.0 359.0 * time (time) datetime64[ns] 74kB 1996-10-01 1996-10-02 ... 2021-12-31 time_bounds (time, nv) datetime64[ns] 148kB dask.array<chunksize=(9226, 2), meta=np.ndarray> Dimensions without coordinates: nv Data variables: precip (time, latitude, longitude) float32 2GB dask.array<chunksize=(400, 180, 360), meta=np.ndarray> Attributes: (12/45) Conventions: CF-1.6, ACDD 1.3 Metadata_Conventions: CF-1.6, Unidata Dataset Discovery v1.0, NOAA ... acknowledgment: This project was supported in part by a grant... cdm_data_type: Grid cdr_program: NOAA Climate Data Record Program for satellit... cdr_variable: precipitation ... ... standard_name_vocabulary: CF Standard Name Table (v41, 22 February 2017) summary: Global Precipitation Climatology Project (GPC... time_coverage_duration: P1D time_coverage_end: 1996-10-01T23:59:59Z time_coverage_start: 1996-10-01T00:00:00Z title: Global Precipitation Climatatology Project (G...
Microsoft’s Planetary Computer goes above and beyond, providing tutorials alongside each dataset. We recommend exploring these on your own to get an idea of what you can do with Zarr and Xarray. The following example is based off the Zarr tutorial.
This example demonstrates how to access the Daymet Puerto Rico Dataset on MS Planetary Computer:
import pystac
import planetary_computer
= "https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-daily-hi"
url = pystac.read_file(url)
collection
= collection.assets["zarr-abfs"]
asset
# planetary computer requires a special token to be added to the url
planetary_computer.sign_inplace(asset)
= xr.open_zarr(
ds
asset.href,**asset.extra_fields["xarray:open_kwargs"],
=asset.extra_fields["xarray:storage_options"],
storage_options=2,
zarr_format
) ds
<xarray.Dataset> Size: 69GB Dimensions: (time: 14965, y: 584, x: 284, nv: 2) Coordinates: lat (y, x) float32 663kB dask.array<chunksize=(584, 284), meta=np.ndarray> lon (y, x) float32 663kB dask.array<chunksize=(584, 284), meta=np.ndarray> * time (time) datetime64[ns] 120kB 1980-01-01T12:00:00 ... * x (x) float32 1kB -5.802e+06 ... -5.519e+06 * y (y) float32 2kB -3.9e+04 -4e+04 ... -6.22e+05 Dimensions without coordinates: nv Data variables: dayl (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray> lambert_conformal_conic int16 2B ... prcp (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray> srad (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray> swe (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray> time_bnds (time, nv) datetime64[ns] 239kB dask.array<chunksize=(365, 2), meta=np.ndarray> tmax (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray> tmin (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray> vp (time, y, x) float32 10GB dask.array<chunksize=(365, 584, 284), meta=np.ndarray> yearday (time) int16 30kB dask.array<chunksize=(365,), meta=np.ndarray> Attributes: Conventions: CF-1.6 Version_data: Daymet Data Version 4.0 Version_software: Daymet Software Version 4.0 citation: Please see http://daymet.ornl.gov/ for current Daymet ... references: Please see http://daymet.ornl.gov/ for current informa... source: Daymet Software Version 4.0 start_year: 1980