Glossary

This glossary aims to describe all the jargon of geospatial in the cloud! Are we missing something? Create an issue to suggest an improvement.

Amazon S3 (S3)

The object storage service offered by Amazon. Part of Amazon Web Services.

Amazon Web Services (AWS)

Cloud computing services offered by Amazon.

Archive format

A file format which stores one or more other files, possibly with compression. Examples include ZIP archives and PMTiles.

Array Dimensions

The number of variables represented by an array. If an array represents longitude, latitude, time, and temperature, the array has four dimensions.

Asynchronous

A manner of scaling computing, to allow more operations to happen at the same time.

Think of a glass of water. Synchronous computing is akin to having one straw: when you’ve finished drinking all you wish to drink, you give the straw to your friend for them to drink. Asynchronous computing is akin to sharing the straw between you and your friend. There’s still only one straw, but you can hand off sips. Parallel computing (like multithreading or multiprocessing) is like having two straws, where both you and your friend can drink out of the glass at the same time.

Bandwidth

The speed at which data travels over a network. Usually used in reference to downloading or uploading files.

See also: latency.

Chunk

A grouping of data as part of a file format.

In a COG, this refers to a slice of the full array, usually 256 pixels high by 256 pixels wide (256x256), or 512 pixels high by 512 pixels wide (512x512).

In a GeoParquet file, this refers to a slice of a group of columns, where the slice has the same number of rows in each column.

Chunk size

The size of each chunk in a file format.

The chunk size plays a large part in how efficient random access within the file can be. If the chunk size is too small, then the metadata describing the file and the chunk byte ranges will be very large, and many HTTP range requests may have to be made for each small piece desired within the file. On the other hand, if the chunk size is too large, then a reader will have to read a large amount of data even for a very small query.

Cloud

Computing services hosted by an external provider, where the provider pays for the upfront cost of buying hardware, earning a profit by selling services. This allows users to scale workloads efficiently because users do not need to pay large upfront costs for computers. These rented services can include compute time or object storage.

Usually refers to services hosted by Amazon, Google, or Microsoft.

Cloud-Optimized

The property of a file format to be able to read a meaningful part of the file without needing to download all of the file. In particular, this means the file can be used efficiently from cloud storage via HTTP range requests.

Cloud-Optimized GeoTIFF (COG)

An extension of GeoTIFF with well-defined internal chunking, designed for efficient random access of the contained raster data.

Cloud-Optimized Point Cloud (COPC)

A cloud-optimized file format for point cloud data.

Compression

An algorithm that makes data smaller, at the cost of having to encode data into the compressed format before saving and having to decode data out of the compressed format before usage. In most cases, the benefits of smaller file sizes when stored outweigh the time it takes to encode and decode the compressed format.

Compression can either be external or internal to a file, and can either be lossless or lossy.

Content Delivery Network (CDN)

A globally-distributed network of storage servers designed to cache HTTP requests so that future requests can use the cached copy instead of asking the origin server or storage.

Coordinate Reference System (CRS)

Also called a projection

Data Type

The data type refers to the specific encoding in which values are stored in binary. Data types can be numeric or non-numeric, including string, binary, or nested data structures. The usual numeric data types used most often for scientific data include:

  • Byte or Int8: signed integer with 8 bit capacity, which can hold values from -128 to 127 (inclusive).
  • Unsigned byte or Uint8: unsigned integer with 8 bit capacity, which can hold values from 0 to 255 (inclusive).
  • Int or Int16: signed integer with 16 bit capacity, which can hold values from -32,768 to 32,767 (inclusive).
  • Unsigned int or Uint16: unsigned integer with 16 bit capacity, which can hold values from 0 to 65,535 (inclusive).
  • Short or Int32: signed integer with 32 bit capacity, which can hold values from -2,147,483,648 to 2,147,483,647 (inclusive).
  • Unsigned short or Uint32: unsigned integer with 32 bit capacity, which can hold values from 0 to 4,294,967,295 (inclusive).
  • Long or Int64: signed integer with 64 bit capacity, which can hold values from -9223372036854775808 to 9223372036854775807 (inclusive).
  • Unsigned long or Uint64: unsigned integer with 64 bit capacity, which can hold values from 0 to 18446744073709551615 (inclusive).
  • float: 32 bit floating point number.
  • double: 64 bit floating point number.

For a good explainer on how floating point numbers work, refer to this blog post.

Deflate

A lossless compression codec used as part of ZIP archives and internally within COG and GeoTIFF files.

Entwine Point Cloud (EPT)

A cloud-optimized file format for point cloud data. Entwine has largely been superseded by COPC because COPC is backwards-compatible with previous point cloud data formats.

EPSG code

A projection definition referring to the EPSG database. EPSG codes tend to be four or five digits and tend to be easier to remember and use than longer definitions, such as WKT strings or PROJJSON. The downside of EPSG codes is that the program needs to have the EPSG database available so that it can perform a lookup from the EPSG code to the full projection definition.

External compression

Compression that is not part of a file format’s own specification, and which is added on after the main file has been saved. This tends to be used as part of ZIP archives (with file extension .zip) or with standalone gzip compression (file extension .gz).

External compression tends to make a file no longer cloud-optimized, as it is usually no longer possible to read part of the file without fetching the entire file, as the entire file is necessary for decompression.

This is in contrast to internal compression.

fsspec

A Python library for abstracting across several different file storage solutions, including local file storage, cloud storage, and HTTP web urls. Allows uploading and downloading files to each backend with a consistent API.

GDAL

The Geospatial Data Abstraction Library, a widely-used open-source library for converting between different raster data formats, as well as reprojecting between coordinate reference systems.

Its common command-line tools include gdalinfo and gdal_translate. It can be used from Python with the rasterio library or from R with the terra library.

GDAL includes OGR for processing vector data.

GeoJSON

A file format for vector data, built on top of JSON. GeoJSON is a common format for transferring vector data to web browsers, because it’s easy for most programming languages to read and write, but tends to have a large size. It’s not cloud-optimized because it can’t be partially parsed; the entire file needs to be downloaded in order to use.

GeoPackage

A file format for vector data. GeoPackage supports multiple layers as part of a single file. Because a GeoPackage is internally stored as a SQLite database, it is not cloud-optimized because the entire file must be downloaded in order to read any part of the file.

GeoPandas

A Python library for using and managing vector data, organized around geospatial data frames.

GeoParquet

An extension of the Parquet file format to store geospatial vector data. Can be read and written by tools including GDAL and GeoPandas.

Geospatial Data Frame

A tabular data structure for storing geospatial vector data, where each geometry is paired with one or more attributes in a given row.

A geospatial data frame structure works best when every feature has the same range of attributes, such as when there is a timestamp or other value associated with every geometry.

GeoPandas in Python and sf in R are two common implementations of the geospatial data frame concept.

GeoTIFF

An extension of TIFF to store geospatially-referenced image and raster data. Includes extra information such as the coordinate reference system and geotransform.

Geotransform

A set of six numbers that describe where a raster image lies within its coordinate reference system.

The geotransform describes the resolution and real-world location of each pixel. The geotransform needs to be used in conjunction with a projection definition for pixels to be located accurately.

For more information (in a Python context) read Python affine transforms.

Google Cloud (GCP)

Cloud computing services provided by Google.

gzip

A type of lossless compression for general use. Gzip is based on the deflate algorithm and tends to be used standalone for external compression. Files that end with .gz have been encoded with gzip compression.

Hilbert curve

A type of space-filling curve used as part of many spatial indexes that ensures that objects near each other in two-dimensional space (e.g. longitude-latitude) are also near each other when ordered in a file.

HTTP Range Request

HTTP is the protocol that governs how computers ask for data across a network. HTTP range requests is a part of the HTTP specification that defines how to ask for a specific byte range from a file, instead of the entire file.

HTTP range requests is a core part of what makes a file format cloud-optimized, because it means that part of a geospatial data file can be read and used without needing to download the entire file.

Internal compression

Compression that is part of a file format’s own specification.

File formats such as COG, COPC, and GeoParquet include internal compression. Internal compression is useful for cloud-optimized data formats because it allows internal chunks to be fetched with range requests but still have smaller sizes from compression.

For files that have already been internally compressed, adding another layer of external compression, such as ZIP or gzip, will likely not make the file smaller, and only serve to reduce performance by requiring an extra decompression step before the data can be used.

JPEG

A lossy compression codec used for visual images. It tends to have a better compression ratio than lossless compression codecs like deflate or LZW.

Latency

The time it takes for data to start being retrieved from a server.

See also: bandwidth.

LERC

LERC (Limited Error Raster Compression) is a lossy but very efficient compression algorithm for floating point raster data. This compression rounds values to a precision provided by the user and tends to be useful e.g. for elevation data where the source data is known to not have precision beyond a known value.

LERC is a relatively new algorithm and may not be supported everywhere. For example, GDAL needs to be compiled with the LERC driver in order to load a GeoTIFF with LERC compression.

Lossless compression

A type of compression where the exact original values can be recovered after decompression. This means that the compression process does not lose any information. Lossless compression codecs tend to give larger file sizes than lossy compression codecs.

Examples include deflate, LZW, gzip, and ZSTD.

Lossy compression

A type of compression where the exact original values cannot be recovered after decompression. This means that the compression process will lose information. Lossy compression codecs tend to give smaller file sizes than lossless compression codecs.

Examples include LERC and JPEG.

LZW

A lossless compression codec for general use. It tends to be slightly slower than deflate.

Mapbox Vector Tile

A file format for tiled vector data, usually used for visualization on web maps. PMTiles is a cloud-optimized archive format for storing millions of Mapbox Vector Tile files in an efficient manner, accessible via HTTP range requests.

Metadata

Information about the actual data, saved as part of the file format. This allows for recreating the exact data that existed before saving and for cloud-optimized data formats, usually stores the byte ranges of relevant data sections within the file, which allows for using HTTP range requests for efficient random access to that data section.

Microsoft Azure

Cloud computing services offered by Microsoft.

Multi-dimensional raster data

A type of gridded raster data where multiple dimensions help conceptualize various attribtes. For example, a data value may exist for every longitude, latitude, time, and elevation, in which case the raster data would have four dimensions.

Multithreading

A manner of scaling computing, to allow more operations to happen at the same time.

Think of a glass of water. Synchronous computing is akin to having one straw: when you’ve finished drinking all you wish to drink, you give the straw to your friend for them to drink. Asynchronous computing is akin to sharing the straw between you and your friend. There’s still only one straw, but you can hand off sips. Parallel computing (including multithreading and multiprocessing) is like having two straws, where both you and your friend can drink out of the glass at the same time.

Numpy

The foundational Python library for managing multi-dimensional array data.

Object storage (Cloud storage)

Object storage, or cloud storage, refers to massively scalable cloud storage solutions like Amazon S3. It is relatively cheap, able to hold files small or large, and supports reading data via HTTP range requests. Most open geospatial data is hosted in such cloud storage solutions.

OGR

A widely-used open-source library for converting between different vector data formats, as well as reprojecting between coordinate reference systems.

Its common command-line tools include ogrinfo and ogr2ogr. It can be used from Python with the pyogrio or fiona libraries or from R with the sf library.

OGR is installed as part of GDAL.

Overviews

Downsampled (aggregated) data intended for visualization and stored as part of a file format. Overviews are part of the COG specification, and allow reading “zoomed out” data without needing to read and downsample from “full resolution” data.

Overviews are also known as pyramids.

Parquet

A file format for tabular data with internal chunking and internal compression. Data are stored per column instead of per row, making it very fast to select all data from a specific column.

PDAL

The Point Data Abstraction Library, a widely-used open-source library for converting between different point cloud data formats and managing point cloud data.

PMTiles

A type of cloud-optimized archive format for tiled data. It can be used with either vector or raster tiled data, but is most often used with Mapbox Vector Tile files. The individual tiles in a PMTiles file are accessible via HTTP range requests.

Point Cloud Data

A type of geospatial data storing three dimensional point locations along with attributes for each point. Point cloud data may come from LIDAR sensors or other photogrammetry, and may represent three-dimensional terrain or buildings.

PROJJSON

A projection definition that uses JSON for encoding.

The specification is defined as part of the PROJ project and used as part of the GeoParquet vector file format.

Random access

The ability to quickly fetch part of a file without reading the entire file.

For example, consider videos on YouTube. If you select to watch a video starting from the ten minute mark, YouTube does not need to download the video up to that point. Rather it is able to use the metadata from the video file to know what byte range in the video file corresponds to the ten minute mark, and then use HTTP range requests to download only the part of the video you’ve reqested.

The ability to perform efficient random access over a network is a core part of what makes data cloud-optimized.

Raster data

A type of geospatial data that stores regularly-gridded data with cells of known and constant size. This often comes from aerial or satellite imagery sensors.

sf

An R library for using and managing vector data, organized around geospatial data frames.

Shapefile

A vector data file format. There are many reasons to no longer use Shapefile.

Space-filling curve

An algorithm that translates two- or n- dimensional data into one-dimensional data. In practice, this is used as part of spatial indexes to group vector geometries nearby within a file according to their two-dimensional location.

Read more at Wikipedia.

Spatial index

A data structure used for searching through spatial data more efficiently.

For further reading: A dive into spatial search algorithms.

Tagged Image File Format (TIFF)

A file format for image and raster data that supports lossless compression.

Vector data

A type of geospatial data to represent points, lines, and polygons.

Web Mercator

A coordinate reference system often used with tiled data for web maps.

Well-Known Binary (WKB)

A binary encoding for vector geometries that many systems can read and write. For example, GeoParquet uses WKB in its definition of the geometry column.

WKT (Geometry encoding)

A text encoding for vector geometries that many systems can read and write. WKT tends to be larger in size and slower to read and write than WKB, so only use WKT if you need to store geometries in a text file, such as a CSV.

Note that this is different than WKT (Projection definition).

WKT (Projection definition)

An encoding to store coordinate reference system information.

There have been multiple versions of WKT; it is suggested to use WKT2 whenever possible.

Zarr

A chunked, compressed file format for multi-dimensional raster data.

ZIP Archive

A type of archive format that is used to group together existing files. It can also be used to apply external compression onto existing files.

ZSTD

A very efficient lossless compression codec. ZSTD tends to give a very good compression ratio at very good performance, but may not be available everywhere. Check that your expected programs have access to ZSTD before using this on your data.