import sys
import numpy as np
import xarray as xr
import zarr
# Here we create a simple Zarr store.
= zarr.array(np.arange(10)) zstore
Zarr in Practice
This notebook demonstrates how to create, explore and modify a Zarr store.
These concepts are explored in more detail in the official Zarr Tutorial.
It also shows the use of public Zarr stores for geospatial data.
How to create a Zarr store
This is an in-memory Zarr store. To persist it to disk, we can use .save
.
"test.zarr", zstore) zarr.save(
We can open the metadata about this dataset, which gives us some interesting information. The dataset has a shape of 10 chunks of 10, so we know all the data was stored in 1 chunk, and was compressed with the blosc
compressor.
!cat test.zarr/.zarray
{
"chunks": [
10
],
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
},
"dtype": "<i8",
"fill_value": 0,
"filters": null,
"order": "C",
"shape": [
10
],
"zarr_format": 2
}
This was a pretty basic example. Let’s explore the other things we might want to do when creating Zarr.
How to create a group
= zarr.group()
root = root.create_group('group1')
group1 = root.create_group('group2')
group2 = group1.create_dataset('ds_in_group', shape=(100,100), chunks=(10,10), dtype='i4')
z1 = group2.create_dataset('ds_in_group', shape=(1000,1000), chunks=(10,10), dtype='i4')
z2 =True) root.tree(expand
How to Examine and Modify the Chunk Shape
If your data is sufficiently large, Zarr will chose a chunksize for you.
= zarr.array(np.arange(100), chunks=True)
zarr_no_chunks zarr_no_chunks.chunks, zarr_no_chunks.shape
((100,), (100,))
= zarr.array(np.arange(10000000), chunks=True)
zarr_with_chunks zarr_with_chunks.chunks, zarr_with_chunks.shape
((156250,), (10000000,))
For zarr_with_chunks
we see the chunks are smaller than the shape, so we know the data has been chunked. Other ways to examine the chunk structure are zarr.info
and zarr.cdata_shape
.
?zarr_no_chunks.cdata_shape
Type: property String form: <property object at 0x7efde6ecfb00> Docstring: A tuple of integers describing the number of chunks along each dimension of the array.
zarr_no_chunks.cdata_shape, zarr_with_chunks.cdata_shape
((1,), (64,))
The zarr store with chunks has 64 chunks. The number of chunks multiplied by the chunk size equals the length of the whole array.
0] * zarr_with_chunks.chunks[0] == zarr_with_chunks.shape[0] zarr_with_chunks.cdata_shape[
True
What’s the storage size of these chunks?
The default chunks are pretty small.
'0']) # this is in bytes sys.getsizeof(zarr_with_chunks.chunk_store[
8049
= zarr.array(np.arange(10000000), chunks=(500000)) zarr_with_big_chunks
zarr_with_big_chunks.chunks, zarr_with_big_chunks.shape, zarr_with_big_chunks.cdata_shape
((500000,), (10000000,), (20,))
This Zarr store has 10 million values, stored in 20 chunks of 500,000 data values.
'0']) sys.getsizeof(zarr_with_big_chunks.chunk_store[
24941
These chunks are still pretty small, but this is just a silly example. In the real world, you will likely want to deal in Zarr chunks of 1MB or greater, especially when dealing with remote storatge options where data is read over a network and the number of requests should be minimized.
Exploring and Modifying Data Compression
Continuing with data from the example above, we can tell that Zarr has also compressed the data for us using zarr.info
or zarr.compressor
.
zarr_with_chunks.compressor
Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
The Blosc
compressor is actually a meta compressor so actually implements multiple different internal compressors. In this case, it has implemented lz4
compression. We can also explore how much space was saved by using this compression method.
zarr_with_chunks.info
Type | zarr.core.Array |
Data type | int64 |
Shape | (10000000,) |
Chunk shape | (156250,) |
Order | C |
Read-only | False |
Compressor | Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0) |
Store type | zarr.storage.KVStore |
No. bytes | 80000000 (76.3M) |
No. bytes stored | 514193 (502.1K) |
Storage ratio | 155.6 |
Chunks initialized | 64/64 |
We can see, from the storage ratio above, that compression has made our data 155 times smaller 😱 .
You can set compression=None
when creating a Zarr array to turn off this behavior, but I’m not sure why you would do that.
Let’s see what happens when we use a different compression method. We can checkout a full list of numcodecs compressors here: https://numcodecs.readthedocs.io/.
from numcodecs import GZip
= GZip()
compressor = zarr.array(np.arange(10000000), chunks=True, compressor=compressor)
zstore_gzip_compressed zstore_gzip_compressed.info
Type | zarr.core.Array |
Data type | int64 |
Shape | (10000000,) |
Chunk shape | (156250,) |
Order | C |
Read-only | False |
Compressor | GZip(level=1) |
Store type | zarr.storage.KVStore |
No. bytes | 80000000 (76.3M) |
No. bytes stored | 15086009 (14.4M) |
Storage ratio | 5.3 |
Chunks initialized | 64/64 |
In this case, the storage ratio is 5.3 - so not as good! How to chose a compression algorithm is a topic for future investigation.
Consolidating metadata
It’s important to consolidate metadata to minimize requests. Each group and array will have a metadata file, so in order to limit requests to read the whole tree of metadata files, Zarr provides the ability to consolidate metdata into a metadata file at the of the store.
So far we have only been dealing in single array Zarr data stores. In this next example, we will create a zarr store with multiple arrays and then consolidate metadata. The speed up with local storage is insignificant, but becomes significant when dealing in remote storage options, which we will see in the following example on accessing cloud storage.
= zarr.group()
root = 'example.zarr'
zarr_store # Let's create many groups and many arrays
= 100, 100
num_groups, num_arrays_per_group for i in range(num_groups):
= root.create_group(f'group-{i}')
group for j in range(num_arrays_per_group):
f'array-{j}', shape=(1000,1000), dtype='i4')
group.create_dataset(
= zarr.DirectoryStore(zarr_store)
store zarr.save(store, root)
# We don't expect it to exist yet!
!cat {zarr_store}/.zmetadata
cat: {zarr_store}/.zmetadata: No such file or directory
zarr.consolidate_metadata(zarr_store)
<zarr.core.Array (100,) <U8>
zarr.open_consolidated(zarr_store)
<zarr.core.Array (100,) <U8>
!cat {zarr_store}/.zmetadata
{
"metadata": {
".zarray": {
"chunks": [
100
],
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
},
"dtype": "<U8",
"fill_value": "",
"filters": null,
"order": "C",
"shape": [
100
],
"zarr_format": 2
}
},
"zarr_consolidated_format": 1
}
Example of Cloud-Optimized Access for this Format
Fortunately, there are many publicly accessible cloud archives of Zarr data.
Zarr provides storage backends for all of these cloud providers: Zarr Tutorial - Distributed/cloud storage.
Here are a few we are aware of:
- Zarr data in Microsoft’s Planetary Computer
- Zarr data from Google
- Amazon Sustainability Data Initiative available from Registry of Open Data on AWS - Enter “Zarr” in the Search input box.
- Pangeo-Forge Data Catalog
The Pangeo-Forge Data Catalog provides handy examples of how to open each dataset, for example, from the Global Precipitation Climatology Project (GPCP) page:
= 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr' store
= xr.open_dataset(store, engine='zarr', chunks={}, consolidated=True)
ds ds
<xarray.Dataset> Dimensions: (latitude: 180, nv: 2, longitude: 360, time: 9226) Coordinates: lat_bounds (latitude, nv) float32 dask.array<chunksize=(180, 2), meta=np.ndarray> * latitude (latitude) float32 -90.0 -89.0 -88.0 -87.0 ... 87.0 88.0 89.0 lon_bounds (longitude, nv) float32 dask.array<chunksize=(360, 2), meta=np.ndarray> * longitude (longitude) float32 0.0 1.0 2.0 3.0 ... 356.0 357.0 358.0 359.0 * time (time) datetime64[ns] 1996-10-01 1996-10-02 ... 2021-12-31 time_bounds (time, nv) datetime64[ns] dask.array<chunksize=(200, 2), meta=np.ndarray> Dimensions without coordinates: nv Data variables: precip (time, latitude, longitude) float32 dask.array<chunksize=(200, 180, 360), meta=np.ndarray> Attributes: (12/45) Conventions: CF-1.6, ACDD 1.3 Metadata_Conventions: CF-1.6, Unidata Dataset Discovery v1.0, NOAA ... acknowledgment: This project was supported in part by a grant... cdm_data_type: Grid cdr_program: NOAA Climate Data Record Program for satellit... cdr_variable: precipitation ... ... standard_name_vocabulary: CF Standard Name Table (v41, 22 February 2017) summary: Global Precipitation Climatology Project (GPC... time_coverage_duration: P1D time_coverage_end: 1996-10-01T23:59:59Z time_coverage_start: 1996-10-01T00:00:00Z title: Global Precipitation Climatatology Project (G...
- latitudePandasIndex
PandasIndex(Index([-90.0, -89.0, -88.0, -87.0, -86.0, -85.0, -84.0, -83.0, -82.0, -81.0, ... 80.0, 81.0, 82.0, 83.0, 84.0, 85.0, 86.0, 87.0, 88.0, 89.0], dtype='float32', name='latitude', length=180))
- longitudePandasIndex
PandasIndex(Index([ 0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, ... 350.0, 351.0, 352.0, 353.0, 354.0, 355.0, 356.0, 357.0, 358.0, 359.0], dtype='float32', name='longitude', length=360))
- timePandasIndex
PandasIndex(DatetimeIndex(['1996-10-01', '1996-10-02', '1996-10-03', '1996-10-04', '1996-10-05', '1996-10-06', '1996-10-07', '1996-10-08', '1996-10-09', '1996-10-10', ... '2021-12-22', '2021-12-23', '2021-12-24', '2021-12-25', '2021-12-26', '2021-12-27', '2021-12-28', '2021-12-29', '2021-12-30', '2021-12-31'], dtype='datetime64[ns]', name='time', length=9226, freq=None))
- Conventions :
- CF-1.6, ACDD 1.3
- Metadata_Conventions :
- CF-1.6, Unidata Dataset Discovery v1.0, NOAA CDR v1.0, GDS v2.0
- acknowledgment :
- This project was supported in part by a grant from the NOAA Climate Data Record (CDR) Program for satellites.
- cdm_data_type :
- Grid
- cdr_program :
- NOAA Climate Data Record Program for satellites, FY 2011.
- cdr_variable :
- precipitation
- comment :
- Processing computer: eagle2.umd.edu
- contributor_name :
- Robert Adler, Jian-Jian Wang
- contributor_role :
- principalInvestigator, processor and custodian
- creator_email :
- jjwang@umd.edu
- creator_name :
- Dr. Jian-Jian Wang
- date_created :
- 2017-05-30T16:52:42Z
- geospatial_lat_max :
- 90.0
- geospatial_lat_min :
- -90.0
- geospatial_lat_resolution :
- 1 degree
- geospatial_lat_units :
- degrees_north
- geospatial_lon_max :
- 360.0
- geospatial_lon_min :
- 0.0
- geospatial_lon_resolution :
- 1 degree
- geospatial_lon_units :
- degrees_east
- history :
- 1) 2017-05-30T16:52:42Z, Dr. Jian-Jian Wang, U of Maryland, Created beta (B1) file
- id :
- 199610/gpcp_v01r03_daily_d19961001_c20170530.nc
- institution :
- ACADEMIC > UMD/ESSIC > Earth System Science Interdisciplinary Center, University of Maryland
- keywords :
- EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > PRECIPITATION AMOUNT
- keywords_vocabulary :
- NASA Global Change Master Directory (GCMD) Earth Science Keywords, Version 7.0
- license :
- No constraints on data access or use.
- metadata_link :
- gov.noaa.ncdc:XXXXX
- naming_authority :
- gov.noaa.ncdc
- platform :
- GOES (Geostationary Operational Environmental Satellite), GMS (Japan Geostationary Meteorological Satellite), METEOSAT, TIROS > Television Infrared Observation Satellite, DMSP (Defense Meteorological Satellite Program)
- processing_level :
- NASA Level 3
- product_version :
- v01r03
- project :
- GPCP > Global Precipitation Climatology Project
- publisher_email :
- jjwang@umd.edu
- publisher_name :
- NOAA National Centers for Environmental Information (NCEI)
- publisher_url :
- https://www.ncei.noaa.gov
- references :
- Huffman et al. 1997, http://dx.doi.org/10.1175/1520-0477(1997)078<0005:TGPCPG>2.0.CO;2; Adler et al. 2003, http://dx.doi.org/10.1175/1525-7541(2003)004<1147:TVGPCP>2.0.CO;2; Huffman et al. 2009, http://dx.doi.org/10.1029/2009GL040000; Adler et al. 2017, Global Precipitation Climatology Project (GPCP) Daily Analysis: Climate Algorithm Theoretical Basis Document (C-ATBD)
- sensor :
- Imager, TOVS > TIROS Operational Vertical Sounder, SSMI > Special Sensor Microwave/Imager
- source :
- /data1/GPCP_CDR/GPCP_Output/1DD//bin/199610/stfsg3.19961001.s
- spatial_resolution :
- 1 degree
- standard_name_vocabulary :
- CF Standard Name Table (v41, 22 February 2017)
- summary :
- Global Precipitation Climatology Project (GPCP) Daily Version 1.3 gridded, merged satellite/gauge precipitation Climate data Record (CDR) from 1996 to present.
- time_coverage_duration :
- P1D
- time_coverage_end :
- 1996-10-01T23:59:59Z
- time_coverage_start :
- 1996-10-01T00:00:00Z
- title :
- Global Precipitation Climatatology Project (GPCP) Climate Data Record (CDR), Daily V1.3