import json
import os
from tempfile import TemporaryDirectory
import fsspec
import imagecodecs.numcodecs
import xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr
imagecodecs.numcodecs.register_codecs()
Kerchunk in Practice
In this notebook, we demonstrate how to create a kerchunk reference file for one and then multiple publicly available NetCDF files and how to open a kerchunk store with xarray
.
Generally, NetCDF should work with kerchunk. Some nested data structures and data types, such as those that can exist in HDF5, won’t work with kerchunk. A future release of this guide will provide further information and/or resources on limitations of kerchunk.
Environment
The packages needed for this notebook can be installed with conda
or mamba
. Using the environment.yml
from this folder run:
conda create -f environment.yml
or
mamba create -f environment.yml
This notebook has been tested to work with the listed Conda environment.
How to create a kerchunk store
We can use the publicly available NEX GDDP CMIP6 dataset for this example. This dataset is provided by NASA and publicly available on AWS S3. You can browse that data in the nex-gddp-cmip6 file browser.
# Set variables
## Since there are a number of CMIP6 models and variables to chose from, we make the model and variable selections variables.
= "ACCESS-CM2"
model # `tasmax` is daily-maximum near-surface air temperature, see https://pcmdi.llnl.gov/mips/cmip3/variableList.html.
= "tasmax"
variable ## Note we are only reading historical data here, but model data is available for SSPs (Shared Socio-economic Pathways) as well.
## SSPs are scenarios are used to model the future, so SSP files predict environment variables into the future.
= f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/{model}/historical/r1i1p1*/{variable}/*"
s3_path
# Initiate fsspec filesystem for reading.
## We set anon=True because this specific dataset on AWS does not require users to be logged in to access.
= fsspec.filesystem("s3", anon=True)
fs_read
# Retrieve list of available data.
= fs_read.glob(s3_path)
file_paths print(f"{len(file_paths)} discovered from {s3_path}")
65 discovered from s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1*/tasmax/*
To start, we are just going to create a single reference file for a single NetCDF file.
= file_paths[0]
netcdf_file netcdf_file
'nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc'
# Define a function to generate the kerchunk file so we can use it again for other files.
def generate_json_reference(input_file, temp_dir: str):
"""
Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk` index from a NetCDF file.
"""
with fs_read.open(input_file, **dict(mode="rb")) as infile:
print(f"Running kerchunk generation for {input_file}...")
# Chunks smaller than `inline_threshold` will be stored directly in the reference file as data (as opposed to a URL and byte range).
= SingleHdf5ToZarr(infile, input_file, inline_threshold=300)
h5chunks = input_file.split("/")[-1].strip(".nc")
fname = os.path.join(temp_dir, f"{fname}.json")
outf with open(outf, "wb") as f:
f.write(json.dumps(h5chunks.translate()).encode())return outf
# Create a temporary directory to store the .json reference files.
# Alternately, you could write these to cloud storage.
= TemporaryDirectory()
td = td.name
temp_dir print(f"Writing single file references to {temp_dir}")
Writing single file references to /var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo
= generate_json_reference(netcdf_file, temp_dir) single_kerchunk_reference
Running kerchunk generation for nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc...
We might also want to inspect what was just created. Below we load just the first few keys and values of the “refs” part of the kerchunk reference file.
# Read and load the JSON file
with open(single_kerchunk_reference, 'r') as file:
= json.load(file)
data = ['.zgroup', 'tasmax/.zarray', 'tasmax/0.0.0']
keys_to_select
# Pretty print JSON data
= {}
data_to_print for key, value in data['refs'].items():
if key in keys_to_select:
if isinstance(value, str):
= json.loads(value)
data_to_print[key] else:
= value
data_to_print[key] print(json.dumps(data_to_print, indent=4))
{
".zgroup": {
"zarr_format": 2
},
"tasmax/.zarray": {
"chunks": [
1,
600,
1440
],
"compressor": {
"id": "zlib",
"level": 5
},
"dtype": "<f4",
"fill_value": 1.0000000200408773e+20,
"filters": [
{
"elementsize": 4,
"id": "shuffle"
}
],
"order": "C",
"shape": [
365,
600,
1440
],
"zarr_format": 2
},
"tasmax/0.0.0": [
"nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc",
18097,
674483
]
}
We can also check that our reference file works with xarray
.
# Open dataset as zarr object using fsspec reference file system and Xarray
= fsspec.filesystem(
fs_single "reference", fo=single_kerchunk_reference, remote_protocol="https"
)= fs_single.get_mapper("") single_map
= xr.open_dataset(single_map, engine="zarr", backend_kwargs=dict(consolidated=False))
ds_single ds_single
<xarray.Dataset> Dimensions: (lat: 600, lon: 1440, time: 365) Coordinates: * lat (lat) float64 0.0 1.23e-321 0.0 ... -3.218e+163 -3.218e+163 * lon (lon) float64 0.0 2.164e-314 0.0 ... -2.022e-53 -1.699e+282 * time (time) datetime64[ns] 1950-01-01T12:00:00 ... 1950-12-31T12:00:00 Data variables: tasmax (time, lat, lon) float32 ... Attributes: (12/22) Conventions: CF-1.7 activity: NEX-GDDP-CMIP6 cmip6_institution_id: CSIRO-ARCCSS cmip6_license: CC-BY-SA 4.0 cmip6_source_id: ACCESS-CM2 contact: Dr. Rama Nemani: rama.nemani@nasa.gov, Dr. Bridget... ... ... scenario: historical source: BCSD title: ACCESS-CM2, r1i1p1f1, historical, global downscale... tracking_id: f85d4c2e-48e4-484f-aad4-6a3f30a04326 variant_label: r1i1p1f1 version: 1.0
It worked! But we can do even better. What if you want to open multiple NetCDF files with xarray
? Let’s create kerchunk references for 3 files and then combine them.
= file_paths[0:3]
subset_files subset_files
['nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc',
'nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1951.nc',
'nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1952.nc']
# Iterate through filelist to generate Kerchunked files. Good use for `dask.bag`, see: https://docs.dask.org/en/stable/bag.html.
= []
output_files for single_file in subset_files:
= generate_json_reference(single_file, temp_dir)
out_file output_files.append(out_file)
Running kerchunk generation for nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc...
Running kerchunk generation for nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1951.nc...
Running kerchunk generation for nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1952.nc...
output_files
['/var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.json',
'/var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1951.json',
'/var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1952.json']
# combine individual references into single consolidated reference
= MultiZarrToZarr(
mzz
output_files,='s3',
remote_protocol={'anon': True},
remote_options=['time'],
concat_dims={'time': 'cf:time'},
coo_map# inline_threshold=0 means don't story any raw data in the kerchunk reference file.
=0
inline_threshold
)= mzz.translate() multi_kerchunk
# Write kerchunk .json record
= os.path.join(temp_dir, f"combined_CMIP6_daily_{model}_{variable}_kerchunk.json")
output_fname with open(f"{output_fname}", "wb") as f:
print(f"Writing combined kerchunk reference file {output_fname}")
f.write(json.dumps(multi_kerchunk).encode())
Writing combined kerchunk reference file /var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo/combined_CMIP6_daily_ACCESS-CM2_tasmax_kerchunk.json
# open dataset as zarr object using fsspec reference file system and Xarray
= fsspec.filesystem(
fs_multi "reference",
=multi_kerchunk,
fo="s3"
remote_protocol
)= fs_multi.get_mapper("") multi_map
= xr.open_dataset(multi_map, engine="zarr", backend_kwargs=dict(consolidated=False))
ds_multi ds_multi
<xarray.Dataset> Dimensions: (lat: 600, lon: 1440, time: 1096) Coordinates: * lat (lat) float64 0.0 2.164e-314 0.0 ... 2.961e-314 2.961e-314 * lon (lon) float64 0.0 2.164e-314 0.0 ... -6.915e+193 -4.603e+95 * time (time) datetime64[ns] 1950-01-01T12:00:00 ... 1952-12-31T12:00:00 Data variables: tasmax (time, lat, lon) float32 ... Attributes: (12/22) Conventions: CF-1.7 activity: NEX-GDDP-CMIP6 cmip6_institution_id: CSIRO-ARCCSS cmip6_license: CC-BY-SA 4.0 cmip6_source_id: ACCESS-CM2 contact: Dr. Rama Nemani: rama.nemani@nasa.gov, Dr. Bridget... ... ... scenario: historical source: BCSD title: ACCESS-CM2, r1i1p1f1, historical, global downscale... tracking_id: f85d4c2e-48e4-484f-aad4-6a3f30a04326 variant_label: r1i1p1f1 version: 1.0
Cool! Now we have 1096 days (3 years) of data.
How to read a Kerchunk Store
We’ve already demonstrated how to open the datasets with Xarray
:
= fsspec.filesystem(
fs_multi "reference",
=multi_kerchunk,
fo="s3"
remote_protocol )
Let’s take it line by line to understand what’s happening.
fsspec.filesystem
is used to open the kerchunk reference. It is not necessary to have kerchunk installed to read data.- The first argument to
fsspec.filesystem
is the protocol. In the case of a kerchunk reference the protocol is the string"reference"
. - The
fo
argument is the set of reference files used to create aReferenceFileSystem
instance. - The
remote_protocol
argument is the protocol of the filesystem on which the references will be evaluated (unless fs is provided). If not given, will be derived from the first URL that has a protocol in the templates or in the references.
Notice how the fs_multi
object we’ve created is a fsspec.implementations.reference.ReferenceFileSystem
.
type(fs_multi)
fsspec.implementations.reference.ReferenceFileSystem
Read about all the options for a fsspec.ReferenceFileSystem
in the fsspec
docs.
One other common situation is to load data over HTTP (as opposed to a local filesystem or via the S3 protocol). Here’s an example from the kerchunk case studies that loads a reference file and data files over HTTP:
='https://sentinel-1-global-coherence-earthbigdata.s3.us-west-2.amazonaws.com/data/wrappers/zarr-all.json'
zarr_all_url
= fsspec.get_mapper(
mapper 'reference://',
=zarr_all_url,
fo='http',
target_protocol='http'
remote_protocol
)= xr.open_dataset(
dataset ='zarr', backend_kwargs={'consolidated': False}
mapper, engine
) dataset
<xarray.Dataset> Dimensions: (season: 4, polarization: 4, latitude: 193200, longitude: 432000, coherence: 6, flightdirection: 2, orbit: 175) Coordinates: * coherence (coherence) float32 6.0 12.0 18.0 24.0 36.0 48.0 * flightdirection (flightdirection) object 'A' 'D' * latitude (latitude) float32 82.0 82.0 82.0 ... -79.0 -79.0 -79.0 * longitude (longitude) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 * orbit (orbit) float64 1.0 2.0 3.0 4.0 ... 172.0 173.0 174.0 175.0 * polarization (polarization) object 'vv' 'vh' 'hv' 'hh' * season (season) object 'winter' 'spring' 'summer' 'fall' Data variables: AMP (season, polarization, latitude, longitude) float32 ... COH (season, polarization, coherence, latitude, longitude) float32 ... inc (orbit, flightdirection, latitude, longitude) float32 ... lsmap (orbit, flightdirection, latitude, longitude) float32 ... rho (season, polarization, latitude, longitude) float32 ... rmse (season, polarization, latitude, longitude) float32 ... tau (season, polarization, latitude, longitude) float32 ...
Because xarray uses fsspec to read data, you can also bypass creating a fsspec object explicitly. Here’s an example using of opening a kerchunk reference generated with pangeo-forge for the NOAA 1/4° daily Optimum Interpolation Sea Surface Temperature (or daily OISST) Climate Data Record (CDR).
= "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/pangeo-forge/aws-noaa-oisst-feedstock/aws-noaa-oisst-avhrr-only.zarr/reference.json"
url = xr.open_dataset(
ds "reference://",
='zarr',
engine={
backend_kwargs'consolidated': False,
'storage_options': {
'fo': url,
'remote_options': {'anon': True},
'remote_protocol': 's3'}},
={})
chunks ds
<xarray.Dataset> Dimensions: (time: 15044, zlev: 1, lat: 720, lon: 1440) Coordinates: * lat (lat) float32 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88 * lon (lon) float32 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9 * time (time) datetime64[ns] 1981-09-01T12:00:00 ... 2022-11-08T12:00:00 * zlev (zlev) float32 0.0 Data variables: anom (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray> err (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray> ice (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray> sst (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray> Attributes: (12/37) Conventions: CF-1.6, ACDD-1.3 cdm_data_type: Grid comment: Data was converted from NetCDF-3 to NetCDF-4 ... creator_email: oisst-help@noaa.gov creator_url: https://www.ncei.noaa.gov/ date_created: 2020-05-08T19:05:13Z ... ... source: ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin... standard_name_vocabulary: CF Standard Name Table (v40, 25 January 2017) summary: NOAAs 1/4-degree Daily Optimum Interpolation ... time_coverage_end: 1981-09-01T23:59:59Z time_coverage_start: 1981-09-01T00:00:00Z title: NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...