Kerchunk in Practice

In this notebook, we demonstrate how to create a kerchunk reference file for one and then multiple publicly available NetCDF files and how to open a kerchunk store with xarray.

Generally, NetCDF should work with kerchunk. Some nested data structures and data types, such as those that can exist in HDF5, won’t work with kerchunk. A future release of this guide will provide further information and/or resources on limitations of kerchunk.

Environment

The packages needed for this notebook can be installed with conda or mamba. Using the environment.yml from this folder run:

conda create -f environment.yml

mamba create -f environment.yml

This notebook has been tested to work with the listed Conda environment.

How to create a kerchunk store

We can use the publicly available NEX GDDP CMIP6 dataset for this example. This dataset is provided by NASA and publicly available on AWS S3. You can browse that data in the nex-gddp-cmip6 file browser.

import json
import os
from tempfile import TemporaryDirectory

import fsspec
import imagecodecs.numcodecs
import xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr

imagecodecs.numcodecs.register_codecs()

# Set variables
## Since there are a number of CMIP6 models and variables to chose from, we make the model and variable selections variables.
model = "ACCESS-CM2"
# `tasmax` is daily-maximum near-surface air temperature, see https://pcmdi.llnl.gov/mips/cmip3/variableList.html.
variable = "tasmax"
## Note we are only reading historical data here, but model data is available for SSPs (Shared Socio-economic Pathways) as well.
## SSPs are scenarios are used to model the future, so SSP files predict environment variables into the future.
s3_path = f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/{model}/historical/r1i1p1*/{variable}/*"

# Initiate fsspec filesystem for reading.
## We set anon=True because this specific dataset on AWS does not require users to be logged in to access.
fs_read = fsspec.filesystem("s3", anon=True)

# Retrieve list of available data.
file_paths = fs_read.glob(s3_path)
print(f"{len(file_paths)} discovered from {s3_path}")

65 discovered from s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1*/tasmax/*

To start, we are just going to create a single reference file for a single NetCDF file.

netcdf_file = file_paths[0]
netcdf_file

'nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc'

# Define a function to generate the kerchunk file so we can use it again for other files.
def generate_json_reference(input_file, temp_dir: str):
    """
    Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk` index from a NetCDF file.
    """
    with fs_read.open(input_file, **dict(mode="rb")) as infile:
        print(f"Running kerchunk generation for {input_file}...")
        # Chunks smaller than `inline_threshold` will be stored directly in the reference file as data (as opposed to a URL and byte range).
        h5chunks = SingleHdf5ToZarr(infile, input_file, inline_threshold=300)
        fname = input_file.split("/")[-1].strip(".nc")
        outf = os.path.join(temp_dir, f"{fname}.json")
        with open(outf, "wb") as f:
            f.write(json.dumps(h5chunks.translate()).encode())
        return outf

# Create a temporary directory to store the .json reference files.
# Alternately, you could write these to cloud storage.
td = TemporaryDirectory()
temp_dir = td.name
print(f"Writing single file references to {temp_dir}")

Writing single file references to /var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo

single_kerchunk_reference = generate_json_reference(netcdf_file, temp_dir)

Running kerchunk generation for nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc...

We might also want to inspect what was just created. Below we load just the first few keys and values of the “refs” part of the kerchunk reference file.

# Read and load the JSON file
with open(single_kerchunk_reference, 'r') as file:
    data = json.load(file)
keys_to_select = ['.zgroup', 'tasmax/.zarray', 'tasmax/0.0.0']

# Pretty print JSON data
data_to_print = {}
for key, value in data['refs'].items():
    if key in keys_to_select:
        if isinstance(value, str):
            data_to_print[key] = json.loads(value)
        else:
            data_to_print[key] = value
print(json.dumps(data_to_print, indent=4))

{
    ".zgroup": {
        "zarr_format": 2
    },
    "tasmax/.zarray": {
        "chunks": [
            1,
            600,
            1440
        ],
        "compressor": {
            "id": "zlib",
            "level": 5
        },
        "dtype": "<f4",
        "fill_value": 1.0000000200408773e+20,
        "filters": [
            {
                "elementsize": 4,
                "id": "shuffle"
            }
        ],
        "order": "C",
        "shape": [
            365,
            600,
            1440
        ],
        "zarr_format": 2
    },
    "tasmax/0.0.0": [
        "nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc",
        18097,
        674483
    ]
}

We can also check that our reference file works with xarray.

# Open dataset as zarr object using fsspec reference file system and Xarray
fs_single = fsspec.filesystem(
    "reference", fo=single_kerchunk_reference, remote_protocol="https"
)
single_map = fs_single.get_mapper("")

ds_single = xr.open_dataset(single_map, engine="zarr", backend_kwargs=dict(consolidated=False))
ds_single

It worked! But we can do even better. What if you want to open multiple NetCDF files with xarray? Let’s create kerchunk references for 3 files and then combine them.

subset_files = file_paths[0:3]
subset_files

['nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc',
 'nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1951.nc',
 'nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1952.nc']

# Iterate through filelist to generate Kerchunked files. Good use for `dask.bag`, see: https://docs.dask.org/en/stable/bag.html.
output_files = []
for single_file in subset_files:
    out_file = generate_json_reference(single_file, temp_dir)
    output_files.append(out_file)

Running kerchunk generation for nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc...
Running kerchunk generation for nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1951.nc...
Running kerchunk generation for nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1952.nc...

output_files

['/var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.json',
 '/var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1951.json',
 '/var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1952.json']

# combine individual references into single consolidated reference
mzz = MultiZarrToZarr(
    output_files,
    remote_protocol='s3',
    remote_options={'anon': True},
    concat_dims=['time'],
    coo_map={'time': 'cf:time'},
    # inline_threshold=0 means don't story any raw data in the kerchunk reference file.
    inline_threshold=0
)
multi_kerchunk = mzz.translate()

# Write kerchunk .json record
output_fname = os.path.join(temp_dir, f"combined_CMIP6_daily_{model}_{variable}_kerchunk.json")
with open(f"{output_fname}", "wb") as f:
    print(f"Writing combined kerchunk reference file {output_fname}")
    f.write(json.dumps(multi_kerchunk).encode())

Writing combined kerchunk reference file /var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo/combined_CMIP6_daily_ACCESS-CM2_tasmax_kerchunk.json

# open dataset as zarr object using fsspec reference file system and Xarray
fs_multi = fsspec.filesystem(
    "reference",
    fo=multi_kerchunk,
    remote_protocol="s3"
)
multi_map = fs_multi.get_mapper("")

ds_multi = xr.open_dataset(multi_map, engine="zarr", backend_kwargs=dict(consolidated=False))
ds_multi

Cool! Now we have 1096 days (3 years) of data.

How to read a Kerchunk Store

We’ve already demonstrated how to open the datasets with Xarray:

fs_multi = fsspec.filesystem(
    "reference",
    fo=multi_kerchunk,
    remote_protocol="s3"
)

Let’s take it line by line to understand what’s happening.

fsspec.filesystem is used to open the kerchunk reference. It is not necessary to have kerchunk installed to read data.
The first argument to fsspec.filesystem is the protocol. In the case of a kerchunk reference the protocol is the string "reference".
The fo argument is the set of reference files used to create a ReferenceFileSystem instance.
The remote_protocol argument is the protocol of the filesystem on which the references will be evaluated (unless fs is provided). If not given, will be derived from the first URL that has a protocol in the templates or in the references.

Notice how the fs_multi object we’ve created is a fsspec.implementations.reference.ReferenceFileSystem.

type(fs_multi)

fsspec.implementations.reference.ReferenceFileSystem

Read about all the options for a fsspec.ReferenceFileSystem in the fsspec docs.

One other common situation is to load data over HTTP (as opposed to a local filesystem or via the S3 protocol). Here’s an example from the kerchunk case studies that loads a reference file and data files over HTTP:

zarr_all_url='https://sentinel-1-global-coherence-earthbigdata.s3.us-west-2.amazonaws.com/data/wrappers/zarr-all.json'

mapper = fsspec.get_mapper(
    'reference://',
    fo=zarr_all_url,
    target_protocol='http',
    remote_protocol='http'
)
dataset = xr.open_dataset(
    mapper, engine='zarr', backend_kwargs={'consolidated': False}
)
dataset

Because xarray uses fsspec to read data, you can also bypass creating a fsspec object explicitly. Here’s an example using of opening a kerchunk reference generated with pangeo-forge for the NOAA 1/4° daily Optimum Interpolation Sea Surface Temperature (or daily OISST) Climate Data Record (CDR).

url = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/pangeo-forge/aws-noaa-oisst-feedstock/aws-noaa-oisst-avhrr-only.zarr/reference.json"
ds = xr.open_dataset(
    "reference://",
    engine='zarr',
    backend_kwargs={
        'consolidated': False,
        'storage_options': {
            'fo': url,
            'remote_options': {'anon': True},
            'remote_protocol': 's3'}},
    chunks={})
ds

<xarray.Dataset>
Dimensions:  (time: 15044, zlev: 1, lat: 720, lon: 1440)
Coordinates:
  * lat      (lat) float32 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
  * time     (time) datetime64[ns] 1981-09-01T12:00:00 ... 2022-11-08T12:00:00
  * zlev     (zlev) float32 0.0
Data variables:
    anom     (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
    err      (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
    ice      (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
    sst      (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
Attributes: (12/37)
    Conventions:                CF-1.6, ACDD-1.3
    cdm_data_type:              Grid
    comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
    creator_email:              oisst-help@noaa.gov
    creator_url:                https://www.ncei.noaa.gov/
    date_created:               2020-05-08T19:05:13Z
    ...                         ...
    source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
    standard_name_vocabulary:   CF Standard Name Table (v40, 25 January 2017)
    summary:                    NOAAs 1/4-degree Daily Optimum Interpolation ...
    time_coverage_end:          1981-09-01T23:59:59Z
    time_coverage_start:        1981-09-01T00:00:00Z
    title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...

Other examples of existing kerchunk data

Additional resources

Fake it until you make it — Using Kerchunk to read NetCDF4 data on AWS S3 as Zarr for rapid data access by Lucas Sterzinger