Cloud-Optimized Geospatial Formats Overview

These slides are a summarization of Cloud-Optimized Geospatial Formats Guide to support presentations.

Authors + Credits: Aimee Barciauskas, Alex Mandel, Brianna Pagán, Vincent Sarago, Chris Holmes, Patrick Quinn, Matt Hanson, Ryan Abernathey

Cloud-Optimized Geospatial Formats Overview

Google Slides version of this content: Cloud-Optimized Geospatial Formats.

What Makes Cloud-Optimized Challenging?

  • No one size fits all approach
  • Earth observation data may be processed into raster, vector and point cloud data types and stored in a long list of data formats and structures.
  • Optimization depends on the user.
  • Users must learn new tools and which data is accessed and how may differ depending on the user.
  • … hopefully only a few new methods and concepts are necessary.

What Makes Cloud-optimized Challenging?

image source: ui.josiahparry.com/spatial-analysis.html

What Makes Cloud-optimized Challenging?

There is no one-size-fits-all packaging for data, as the optimal packaging is highly use-case dependent.

Task 51 - Cloud-Optimized Format Study

Authors: Chris Durbin, Patrick Quinn, Dana Shum

What Does Cloud-Optimized Mean?

File formats are read-oriented to support:

  • Partial reads
  • Parallel reads

What Does Cloud-Optimized Mean?

  • File metadata in one read
  • When accessing data over the internet, such as when data is in cloud storage, latency is high when compared with local storage so it is preferable to fetch lots of data in fewer reads.
  • An easy win is metadata in one read, which can be used to read a cloud-native dataset.
  • A cloud-native dataset is one with small addressable chunks via files, internal tiles, or both.

What Does Cloud-Optimized Mean?

  • Accessible over HTTP using range requests.
  • This makes it compatible with object storage (a file storage alternative to local disk) and thus accessible via HTTP, from many compute instances.
  • Supports lazy access and intelligent subsetting.
  • Integrates with high-level analysis libraries and distributed frameworks.

higher level libraries

Formats by Data Type

Format Data Type Standard Status
Cloud-Optimized GeoTIFF (COG) Raster OGC standard for comment
Zarr, Kerchunk Multi-dimensional raster ESDIS and OGC standards in development
Cloud-Optimized Point Cloud (COPC), Entwine Point Tiles (EPT) Point Clouds* no known ESDIS or OGC standard
FlatGeobuf, GeoParquet, Vector no known ESDIS, draft OGC standard

Formats by Adoption

Format Adoption Standard Status
Cloud-Optimized GeoTIFF (COG) Widely adopted OGC standard for comment
Zarr, Kerchunk (Less) widely adopted, especially in specific communities ESDIS and OGC standards in development
Entwine Point Tiles (EPT), Cloud-Optimized Point Cloud (COPC) Less common (PDAL Supported) no known ESDIS or OGC standard
GeoParquet, FlatGeobuf Less common (OGR Supported) no known ESDIS, draft OGC standard

What are COGs?

  • COGs are raster data representing a snapshot in time of gridded data, for example digital elevation models (DEMs).
  • COGs are a de facto standard, with an Open Geospatial Consortium (OGC) standard under review.
  • The standard specifies conformance to how the GeoTIFF is formatted, with additional requirements of tiling and overviews.

What are COGs?

  • COGs have internal file directories (IFDs) which are used to tell clients where to find different overview levels and data within the file.
  • Clients can use this metadata to read only the data they need to visualize or calculate.
  • This internal organization is friendly for consumption by clients issuing HTTP GET range request (“bytes: start_offset-end_offset” HTTP header)

What is Zarr?

  • Zarr is used to represent multidimensional raster data or “data cubes”. For example, weather data and climate models.
  • Chunked, compressed, N-dimensional arrays.
  • The metadata is stored external to the data files themselves. The data itself is often reorganized and compressed into many files which can be accessed according to which chunks the user is interested in.

What is Kerchunk?

  • Kerchunk is a way to create Zarr metadata for archival formats, so that you can leverage the benefits of partial and parallel reads for archives in NetCDF4, HDF5, GRIB2, TIFF and FITS.

Zarr Specs in Development

  • V2 and older specs exist, however,
  • A cross-organization working group has just formed to establish a GeoZarr standards working group, organized by Brianna Pagán (NASA) and includes representatives from many other orgs in the industry.
  • The GeoZarr spec defines conventions for how geospatial data should be organized in a Zarr store. The spec details how the Zarr DataArray and DataSet metadata, and subsequent organization of data, must be in order to be conformant as GeoZarr archive.
  • There is a proposal for Zarr v3 which will address challenges in language support, and storage organization to address the issues of high-latency reads and volume of reads for the many objects stored.
  • There is recent work on a parquet alternative to JSON for indexing.

COPC (Cloud-Optimized Point Clouds)

  • Point clouds are a set of data points in space, such as gathered from LiDAR measurements.
  • COPC is a valid LAZ file.
  • Similar to COGs but for point clouds: COPC is just one file, but data is reorganized into a clustered octree instead of regularly gridded overviews.
  • 2 key features:
    • Support for partial decompression via storage of data in a series of chunks
    • Variable-length records (VLRs) can store application-specific metadata of any kind. VLRs describe the octree structure.
  • Limitation: Not all attribute types are compatible.

FlatGeoBuf

  • Vector data is traditionally stored as rows representing points, lines, or polygons with an attribute table.
  • FlatGeobuf is a binary encoding format for geographic data. Flatbuffers that hold a collection of Simple Features. Single-File.
  • A row-based streamable-spatial index optimizes for remote reading.
  • Developed with OGR compatibility in mind. Works with existing OGR APIs, e.g. python and R.
  • Works with HTTP range requests, and has CDN compatibility.
  • Limitation: Not compressed specifically to allow random reads.
  • Learn more: https://github.com/flatgeobuf/flatgeobuf, Kicking the Tires: Flatgeobuf

Geoparquet

  • Vector data is traditionally stored as rows representing points, lines, or polygons with an attribute table
  • GeoParquet defines how to store vector data in Apache Parquet, which is a columnar storage format (like many cloud data warehouses). “Give me all points with height greater than 10m”.
  • Highly compressed
  • Single-file or multi-file
  • Recent support added to geopandas as a distinct function, R support with geoarrow
  • Potential for cross language in-memory shared access
  • Specifications for spatial-indexing, projection handling, etc. are still in discussion
  • Learn more: https://github.com/opengeospatial/geoparquet


The End?

Return to Cloud-Optimized Geospatial Formats Guide or …

Not Quite

  • These formats and their tooling are in active development
  • Some formats were not mentioned, such as EPT, geopkg, tiledb, Cloud-Optimized HDF5. This presentation was scoped to those known best by the authors.
  • This site will continue to be updated with new content.

References

Prior presentations and studies discussing multiple formats

Format Homepages and Explainers