Kerchunk in Practice

In this notebook, we demonstrate how to create a kerchunk reference file for one and then multiple publicly available NetCDF files and how to open a kerchunk store with xarray.

Generally, NetCDF should work with kerchunk. Some nested data structures and data types, such as those that can exist in HDF5, won’t work with kerchunk. A future release of this guide will provide further information and/or resources on limitations of kerchunk.

Environment

The packages needed for this notebook can be installed with conda or mamba. Using the environment.yml from this folder run:

conda create -f environment.yml

or

mamba create -f environment.yml

This notebook has been tested to work with the listed Conda environment.

How to create a kerchunk store

We can use the publicly available NEX GDDP CMIP6 dataset for this example. This dataset is provided by NASA and publicly available on AWS S3. You can browse that data in the nex-gddp-cmip6 file browser.

import json
import os
from tempfile import TemporaryDirectory

import fsspec
import imagecodecs.numcodecs
import xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr

imagecodecs.numcodecs.register_codecs() 
# Set variables
## Since there are a number of CMIP6 models and variables to chose from, we make the model and variable selections variables.
model = "ACCESS-CM2"
# `tasmax` is daily-maximum near-surface air temperature, see https://pcmdi.llnl.gov/mips/cmip3/variableList.html.
variable = "tasmax"
## Note we are only reading historical data here, but model data is available for SSPs (Shared Socio-economic Pathways) as well.
## SSPs are scenarios are used to model the future, so SSP files predict environment variables into the future.
s3_path = f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/{model}/historical/r1i1p1*/{variable}/*"

# Initiate fsspec filesystem for reading.
## We set anon=True because this specific dataset on AWS does not require users to be logged in to access.
fs_read = fsspec.filesystem("s3", anon=True)

# Retrieve list of available data.
file_paths = fs_read.glob(s3_path)
print(f"{len(file_paths)} discovered from {s3_path}")
65 discovered from s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1*/tasmax/*

To start, we are just going to create a single reference file for a single NetCDF file.

netcdf_file = file_paths[0]
netcdf_file
'nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc'
# Define a function to generate the kerchunk file so we can use it again for other files.
def generate_json_reference(input_file, temp_dir: str):
    """
    Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk` index from a NetCDF file.
    """
    with fs_read.open(input_file, **dict(mode="rb")) as infile:
        print(f"Running kerchunk generation for {input_file}...")
        # Chunks smaller than `inline_threshold` will be stored directly in the reference file as data (as opposed to a URL and byte range).
        h5chunks = SingleHdf5ToZarr(infile, input_file, inline_threshold=300)
        fname = input_file.split("/")[-1].strip(".nc")
        outf = os.path.join(temp_dir, f"{fname}.json")
        with open(outf, "wb") as f:
            f.write(json.dumps(h5chunks.translate()).encode())
        return outf
# Create a temporary directory to store the .json reference files.
# Alternately, you could write these to cloud storage.
td = TemporaryDirectory()
temp_dir = td.name
print(f"Writing single file references to {temp_dir}")
Writing single file references to /var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo
single_kerchunk_reference = generate_json_reference(netcdf_file, temp_dir)
Running kerchunk generation for nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc...

We might also want to inspect what was just created. Below we load just the first few keys and values of the “refs” part of the kerchunk reference file.

# Read and load the JSON file
with open(single_kerchunk_reference, 'r') as file:
    data = json.load(file)
keys_to_select = ['.zgroup', 'tasmax/.zarray', 'tasmax/0.0.0']

# Pretty print JSON data
data_to_print = {}
for key, value in data['refs'].items():
    if key in keys_to_select:
        if isinstance(value, str):
            data_to_print[key] = json.loads(value)
        else:
            data_to_print[key] = value
print(json.dumps(data_to_print, indent=4))
{
    ".zgroup": {
        "zarr_format": 2
    },
    "tasmax/.zarray": {
        "chunks": [
            1,
            600,
            1440
        ],
        "compressor": {
            "id": "zlib",
            "level": 5
        },
        "dtype": "<f4",
        "fill_value": 1.0000000200408773e+20,
        "filters": [
            {
                "elementsize": 4,
                "id": "shuffle"
            }
        ],
        "order": "C",
        "shape": [
            365,
            600,
            1440
        ],
        "zarr_format": 2
    },
    "tasmax/0.0.0": [
        "nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc",
        18097,
        674483
    ]
}

We can also check that our reference file works with xarray.

# Open dataset as zarr object using fsspec reference file system and Xarray
fs_single = fsspec.filesystem(
    "reference", fo=single_kerchunk_reference, remote_protocol="https"
)
single_map = fs_single.get_mapper("")
ds_single = xr.open_dataset(single_map, engine="zarr", backend_kwargs=dict(consolidated=False))
ds_single
<xarray.Dataset>
Dimensions:  (lat: 600, lon: 1440, time: 365)
Coordinates:
  * lat      (lat) float64 0.0 1.23e-321 0.0 ... -3.218e+163 -3.218e+163
  * lon      (lon) float64 0.0 2.164e-314 0.0 ... -2.022e-53 -1.699e+282
  * time     (time) datetime64[ns] 1950-01-01T12:00:00 ... 1950-12-31T12:00:00
Data variables:
    tasmax   (time, lat, lon) float32 ...
Attributes: (12/22)
    Conventions:           CF-1.7
    activity:              NEX-GDDP-CMIP6
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    cmip6_source_id:       ACCESS-CM2
    contact:               Dr. Rama Nemani: rama.nemani@nasa.gov, Dr. Bridget...
    ...                    ...
    scenario:              historical
    source:                BCSD
    title:                 ACCESS-CM2, r1i1p1f1, historical, global downscale...
    tracking_id:           f85d4c2e-48e4-484f-aad4-6a3f30a04326
    variant_label:         r1i1p1f1
    version:               1.0

It worked! But we can do even better. What if you want to open multiple NetCDF files with xarray? Let’s create kerchunk references for 3 files and then combine them.

subset_files = file_paths[0:3]
subset_files
['nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc',
 'nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1951.nc',
 'nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1952.nc']
# Iterate through filelist to generate Kerchunked files. Good use for `dask.bag`, see: https://docs.dask.org/en/stable/bag.html.
output_files = []
for single_file in subset_files:
    out_file = generate_json_reference(single_file, temp_dir)
    output_files.append(out_file)
Running kerchunk generation for nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.nc...
Running kerchunk generation for nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1951.nc...
Running kerchunk generation for nex-gddp-cmip6/NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tasmax/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1952.nc...
output_files
['/var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1950.json',
 '/var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1951.json',
 '/var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo/tasmax_day_ACCESS-CM2_historical_r1i1p1f1_gn_1952.json']
# combine individual references into single consolidated reference
mzz = MultiZarrToZarr(
    output_files,
    remote_protocol='s3',
    remote_options={'anon': True},
    concat_dims=['time'],
    coo_map={'time': 'cf:time'},
    # inline_threshold=0 means don't story any raw data in the kerchunk reference file.
    inline_threshold=0
)
multi_kerchunk = mzz.translate()
# Write kerchunk .json record
output_fname = os.path.join(temp_dir, f"combined_CMIP6_daily_{model}_{variable}_kerchunk.json")
with open(f"{output_fname}", "wb") as f:
    print(f"Writing combined kerchunk reference file {output_fname}")
    f.write(json.dumps(multi_kerchunk).encode())
Writing combined kerchunk reference file /var/folders/42/5jr6891d4ds4xysz7q0rsghw0000gn/T/tmpn1bas0mo/combined_CMIP6_daily_ACCESS-CM2_tasmax_kerchunk.json
# open dataset as zarr object using fsspec reference file system and Xarray
fs_multi = fsspec.filesystem(
    "reference",
    fo=multi_kerchunk,
    remote_protocol="s3"
)
multi_map = fs_multi.get_mapper("")
ds_multi = xr.open_dataset(multi_map, engine="zarr", backend_kwargs=dict(consolidated=False))
ds_multi
<xarray.Dataset>
Dimensions:  (lat: 600, lon: 1440, time: 1096)
Coordinates:
  * lat      (lat) float64 0.0 2.164e-314 0.0 ... 2.961e-314 2.961e-314
  * lon      (lon) float64 0.0 2.164e-314 0.0 ... -6.915e+193 -4.603e+95
  * time     (time) datetime64[ns] 1950-01-01T12:00:00 ... 1952-12-31T12:00:00
Data variables:
    tasmax   (time, lat, lon) float32 ...
Attributes: (12/22)
    Conventions:           CF-1.7
    activity:              NEX-GDDP-CMIP6
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    cmip6_source_id:       ACCESS-CM2
    contact:               Dr. Rama Nemani: rama.nemani@nasa.gov, Dr. Bridget...
    ...                    ...
    scenario:              historical
    source:                BCSD
    title:                 ACCESS-CM2, r1i1p1f1, historical, global downscale...
    tracking_id:           f85d4c2e-48e4-484f-aad4-6a3f30a04326
    variant_label:         r1i1p1f1
    version:               1.0

Cool! Now we have 1096 days (3 years) of data.

How to read a Kerchunk Store

We’ve already demonstrated how to open the datasets with Xarray:

fs_multi = fsspec.filesystem(
    "reference",
    fo=multi_kerchunk,
    remote_protocol="s3"
)

Let’s take it line by line to understand what’s happening.

  1. fsspec.filesystem is used to open the kerchunk reference. It is not necessary to have kerchunk installed to read data.
  2. The first argument to fsspec.filesystem is the protocol. In the case of a kerchunk reference the protocol is the string "reference".
  3. The fo argument is the set of reference files used to create a ReferenceFileSystem instance.
  4. The remote_protocol argument is the protocol of the filesystem on which the references will be evaluated (unless fs is provided). If not given, will be derived from the first URL that has a protocol in the templates or in the references.

Notice how the fs_multi object we’ve created is a fsspec.implementations.reference.ReferenceFileSystem.

type(fs_multi)
fsspec.implementations.reference.ReferenceFileSystem

Read about all the options for a fsspec.ReferenceFileSystem in the fsspec docs.

One other common situation is to load data over HTTP (as opposed to a local filesystem or via the S3 protocol). Here’s an example from the kerchunk case studies that loads a reference file and data files over HTTP:

zarr_all_url='https://sentinel-1-global-coherence-earthbigdata.s3.us-west-2.amazonaws.com/data/wrappers/zarr-all.json'

mapper = fsspec.get_mapper(
    'reference://',
    fo=zarr_all_url,
    target_protocol='http',
    remote_protocol='http'
)
dataset = xr.open_dataset(
    mapper, engine='zarr', backend_kwargs={'consolidated': False}
)
dataset
<xarray.Dataset>
Dimensions:          (season: 4, polarization: 4, latitude: 193200,
                      longitude: 432000, coherence: 6, flightdirection: 2,
                      orbit: 175)
Coordinates:
  * coherence        (coherence) float32 6.0 12.0 18.0 24.0 36.0 48.0
  * flightdirection  (flightdirection) object 'A' 'D'
  * latitude         (latitude) float32 82.0 82.0 82.0 ... -79.0 -79.0 -79.0
  * longitude        (longitude) float32 -180.0 -180.0 -180.0 ... 180.0 180.0
  * orbit            (orbit) float64 1.0 2.0 3.0 4.0 ... 172.0 173.0 174.0 175.0
  * polarization     (polarization) object 'vv' 'vh' 'hv' 'hh'
  * season           (season) object 'winter' 'spring' 'summer' 'fall'
Data variables:
    AMP              (season, polarization, latitude, longitude) float32 ...
    COH              (season, polarization, coherence, latitude, longitude) float32 ...
    inc              (orbit, flightdirection, latitude, longitude) float32 ...
    lsmap            (orbit, flightdirection, latitude, longitude) float32 ...
    rho              (season, polarization, latitude, longitude) float32 ...
    rmse             (season, polarization, latitude, longitude) float32 ...
    tau              (season, polarization, latitude, longitude) float32 ...

Because xarray uses fsspec to read data, you can also bypass creating a fsspec object explicitly. Here’s an example using of opening a kerchunk reference generated with pangeo-forge for the NOAA 1/4° daily Optimum Interpolation Sea Surface Temperature (or daily OISST) Climate Data Record (CDR).

url = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/pangeo-forge/aws-noaa-oisst-feedstock/aws-noaa-oisst-avhrr-only.zarr/reference.json"
ds = xr.open_dataset(
    "reference://",
    engine='zarr',
    backend_kwargs={
        'consolidated': False,
        'storage_options': {
            'fo': url,
            'remote_options': {'anon': True},
            'remote_protocol': 's3'}},
    chunks={})
ds
<xarray.Dataset>
Dimensions:  (time: 15044, zlev: 1, lat: 720, lon: 1440)
Coordinates:
  * lat      (lat) float32 -89.88 -89.62 -89.38 -89.12 ... 89.38 89.62 89.88
  * lon      (lon) float32 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
  * time     (time) datetime64[ns] 1981-09-01T12:00:00 ... 2022-11-08T12:00:00
  * zlev     (zlev) float32 0.0
Data variables:
    anom     (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
    err      (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
    ice      (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
    sst      (time, zlev, lat, lon) float32 dask.array<chunksize=(1, 1, 720, 1440), meta=np.ndarray>
Attributes: (12/37)
    Conventions:                CF-1.6, ACDD-1.3
    cdm_data_type:              Grid
    comment:                    Data was converted from NetCDF-3 to NetCDF-4 ...
    creator_email:              oisst-help@noaa.gov
    creator_url:                https://www.ncei.noaa.gov/
    date_created:               2020-05-08T19:05:13Z
    ...                         ...
    source:                     ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfin...
    standard_name_vocabulary:   CF Standard Name Table (v40, 25 January 2017)
    summary:                    NOAAs 1/4-degree Daily Optimum Interpolation ...
    time_coverage_end:          1981-09-01T23:59:59Z
    time_coverage_start:        1981-09-01T00:00:00Z
    title:                      NOAA/NCEI 1/4 Degree Daily Optimum Interpolat...
    • lat
      PandasIndex
      PandasIndex(Index([-89.875, -89.625, -89.375, -89.125, -88.875, -88.625, -88.375, -88.125,
             -87.875, -87.625,
             ...
              87.625,  87.875,  88.125,  88.375,  88.625,  88.875,  89.125,  89.375,
              89.625,  89.875],
            dtype='float32', name='lat', length=720))
    • lon
      PandasIndex
      PandasIndex(Index([  0.125,   0.375,   0.625,   0.875,   1.125,   1.375,   1.625,   1.875,
               2.125,   2.375,
             ...
             357.625, 357.875, 358.125, 358.375, 358.625, 358.875, 359.125, 359.375,
             359.625, 359.875],
            dtype='float32', name='lon', length=1440))
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['1981-09-01 12:00:00', '1981-09-02 12:00:00',
                     '1981-09-03 12:00:00', '1981-09-04 12:00:00',
                     '1981-09-05 12:00:00', '1981-09-06 12:00:00',
                     '1981-09-07 12:00:00', '1981-09-08 12:00:00',
                     '1981-09-09 12:00:00', '1981-09-10 12:00:00',
                     ...
                     '2022-10-30 12:00:00', '2022-10-31 12:00:00',
                     '2022-11-01 12:00:00', '2022-11-02 12:00:00',
                     '2022-11-03 12:00:00', '2022-11-04 12:00:00',
                     '2022-11-05 12:00:00', '2022-11-06 12:00:00',
                     '2022-11-07 12:00:00', '2022-11-08 12:00:00'],
                    dtype='datetime64[ns]', name='time', length=15044, freq=None))
    • zlev
      PandasIndex
      PandasIndex(Index([0.0], dtype='float32', name='zlev'))
  • Conventions :
    CF-1.6, ACDD-1.3
    cdm_data_type :
    Grid
    comment :
    Data was converted from NetCDF-3 to NetCDF-4 format with metadata updates in November 2017.
    creator_email :
    oisst-help@noaa.gov
    creator_url :
    https://www.ncei.noaa.gov/
    date_created :
    2020-05-08T19:05:13Z
    date_modified :
    2020-05-08T19:05:13Z
    geospatial_lat_max :
    90.0
    geospatial_lat_min :
    -90.0
    geospatial_lat_resolution :
    0.25
    geospatial_lat_units :
    degrees_north
    geospatial_lon_max :
    360.0
    geospatial_lon_min :
    0.0
    geospatial_lon_resolution :
    0.25
    geospatial_lon_units :
    degrees_east
    history :
    Final file created using preliminary as first guess, and 3 days of AVHRR data. Preliminary uses only 1 day of AVHRR data.
    id :
    oisst-avhrr-v02r01.19810901.nc
    institution :
    NOAA/National Centers for Environmental Information
    instrument :
    Earth Remote Sensing Instruments > Passive Remote Sensing > Spectrometers/Radiometers > Imaging Spectrometers/Radiometers > AVHRR > Advanced Very High Resolution Radiometer
    instrument_vocabulary :
    Global Change Master Directory (GCMD) Instrument Keywords
    keywords :
    Earth Science > Oceans > Ocean Temperature > Sea Surface Temperature
    keywords_vocabulary :
    Global Change Master Directory (GCMD) Earth Science Keywords
    metadata_link :
    https://doi.org/10.25921/RE9P-PT57
    naming_authority :
    gov.noaa.ncei
    ncei_template_version :
    NCEI_NetCDF_Grid_Template_v2.0
    platform :
    Ships, buoys, Argo floats, MetOp-A, MetOp-B
    platform_vocabulary :
    Global Change Master Directory (GCMD) Platform Keywords
    processing_level :
    NOAA Level 4
    product_version :
    Version v02r01
    references :
    Reynolds, et al.(2007) Daily High-Resolution-Blended Analyses for Sea Surface Temperature (available at https://doi.org/10.1175/2007JCLI1824.1). Banzon, et al.(2016) A long-term record of blended satellite and in situ sea-surface temperature for climate monitoring, modeling and environmental studies (available at https://doi.org/10.5194/essd-8-165-2016). Huang et al. (2020) Improvements of the Daily Optimum Interpolation Sea Surface Temperature (DOISST) Version v02r01, submitted.Climatology is based on 1971-2000 OI.v2 SST. Satellite data: Pathfinder AVHRR SST and Navy AVHRR SST. Ice data: NCEP Ice and GSFC Ice.
    sensor :
    Thermometer, AVHRR
    source :
    ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfinder_AVHRR, Navy_AVHRR
    standard_name_vocabulary :
    CF Standard Name Table (v40, 25 January 2017)
    summary :
    NOAAs 1/4-degree Daily Optimum Interpolation Sea Surface Temperature (OISST) (sometimes referred to as Reynolds SST, which however also refers to earlier products at different resolution), currently available as version v02r01, is created by interpolating and extrapolating SST observations from different sources, resulting in a smoothed complete field. The sources of data are satellite (AVHRR) and in situ platforms (i.e., ships and buoys), and the specific datasets employed may change over time. At the marginal ice zone, sea ice concentrations are used to generate proxy SSTs. A preliminary version of this file is produced in near-real time (1-day latency), and then replaced with a final version after 2 weeks. Note that this is the AVHRR-ONLY DOISST, available from Oct 1981, but there is a companion DOISST product that includes microwave satellite data, available from June 2002
    time_coverage_end :
    1981-09-01T23:59:59Z
    time_coverage_start :
    1981-09-01T00:00:00Z
    title :
    NOAA/NCEI 1/4 Degree Daily Optimum Interpolation Sea Surface Temperature (OISST) Analysis, Version 2.1 - Final
  • Other examples of existing kerchunk data

    Additional resources