Accessing Data in Cloud-Optimized GeoTIFFs (COGs) with terra in R

Cloud-Optimized GeoTIFFs (COGs) are a specialized format of GeoTIFF designed to enable efficient access to raster data, particularly in cloud environments. By organizing data into tiled structures and enabling partial reads, COGs allow users to fetch only the portions they need, significantly reducing bandwidth and storage costs.

In this notebook, we will explore how terra, an R package for spatial data, interacts with COGs. While it is generally unnecessary to think about tiling explicitly, this notebook aligns data requests with internal tile boundaries to improve performance. Understanding how COGs structure data can help in designing efficient workflows. Another option for accessing cogs in in R is the stars package, for which another version of this notebook is available here.

Using COGs provides significant advantages over traditional GeoTIFFs with minimal additional effort. As such, when writing raster data in GeoTIFF format, it is best practice to produce valid COGs whenever possible. For more details on how to generate COGs, see the guide on writing COGs with python.

If you need to verify whether a GeoTIFF is a COG, refer to this guide on COG validation


Demonstration Goals

This notebook demonstrates how to use terra to efficiently access and process COGs in R:

  • Accessing Metadata:
    • Using terra to inspect key metadata, including dimensions, resolution, projection, and internal tiling structure.
  • Visualizing the Internal Tiling Structure:
    • Overlaying a grid to illustrate COG tile boundaries and how they affect data access.
  • Reading Specific Regions:
    • Extracting a tile-aligned region to demonstrate efficient data access.
    • Extracting a region spanning multiple tiles to show a less-optimal but still performant case.

Key Takeaways

  • COGs allow for selective data access, reducing bandwidth use and improving efficiency.
  • Aligning requests with internal tile grids minimizes redundant reads and speeds up workflows.
  • Even when misaligned, COGs are significantly more efficient than traditional GeoTIFFs due to their internal organization.
  • Using terra provides an intuitive way to work with COGs, leveraging R’s geospatial ecosystem.

By the end of this notebook, you’ll have a practical understanding of how to work with COGs in R using terra and how to optimize data access patterns.

Environment

The packages needed for this notebook can be installed with conda or mamba. Using the environment.yml from this folder run:

conda env create -f environment.yml

or

mamba env create -f environment.yml

Finally, you may activate and select the kernel in the notebook (running in Jupyter)

conda activate coguide-cog

The notebook has been tested to work with the listed Conda environment.

TL;DR: Efficiently Read and Display the Bottom Half of a COG

The following example is a concise demonstration of how to efficiently read and visualize a portion of a Cloud-Optimized GeoTIFF (COG) using terra. Here, we extract and display the bottom half of the raster without loading the entire dataset. There’s nothing special about our choice of the bottom half of the tile - this just illustrates the process of reading with a randomly selected bounding box that intersects the tif in question.

library(terra)

# Define the COG URL
cog_url <- "/vsis3/sentinel-cogs/sentinel-s2-l2a-cogs/12/R/UU/2024/4/S2A_12RUU_20240421_0_L2A/B02.tif"

# Set environment variables for GDAL
setGDALconfig("AWS_NO_SIGN_REQUEST", "YES")
setGDALconfig("GDAL_DISABLE_READDIR_ON_OPEN", "EMPTY_DIR")

# Open the COG
r <- rast(cog_url)

# Get the extent and resolution
ext_r <- ext(r)
res_r <- res(r)
height <- nrow(r)

# ✅ Fix: Compute ymin for the bottom half correctly
bottom_half_ymax <- ymax(ext_r) - (height / 2) * abs(res_r[2])  # Use absolute value of res

# Define the correct extent for cropping
bottom_half_extent <- ext(xmin(ext_r), xmax(ext_r), ymin(ext_r), bottom_half_ymax)

# Crop to the bottom half and plot
bottom_half <- crop(r, bottom_half_extent)
plot(bottom_half)
terra 1.8.21

Detailed Demonstration

Now that we’ve seen how to quickly and efficiently read a specific region of a Cloud-Optimized GeoTIFF (COG) in the TL;DR section immediately above, let’s explore things in greater detail. Terra documentation can be found at https://rspatial.github.io/terra/.

Setting Up for Cloud-Optimized GeoTIFF (COG) Access with terra

This section configures the environment and prepares terra to access a Cloud-Optimized GeoTIFF (COG) hosted on AWS.

1. Loading terra

The only library required for this workflow is terra, which provides powerful tools for geospatial data analysis in R.

2. Configuring the environment

To enable seamless access to publicly accessible data on AWS: - Anonymous Access: - AWS-hosted public datasets allow anonymous access without credentials. This is enabled by setting the AWS_NO_SIGN_REQUEST environment variable to "YES". This configuration ensures that GDAL makes unsigned requests when accessing the file.

  • Metadata from File Header:
    • By setting the GDAL_DISABLE_READDIR_ON_OPEN environment variable to "EMPTY_DIR", we inform GDAL that all necessary metadata is contained within the file header. This avoids unnecessary checks for external “sidecar” metadata files, improving efficiency.

For more configuration options, consult the GDAL configuration documentation.

3. Defining the COG URL

The file is accessed using its virtual filesystem (VSI) path: - The path /vsis3/ indicates that the file is hosted on Amazon S3. - This example points to a Sentinel-2 COG available in the public Sentinel COG dataset. - GDAL supports other virtual filesystems, including http, zip, tar, and combinations thereof. For details, see GDAL virtual filesystem documentation.

4. Opening the COG and Inspecting Metadata

Using the rast() function from terra, we create a reference to the COG: - Important Note: No data is downloaded at this stage! Only the file’s metadata is read. - Metadata includes details such as resolution, coordinate reference system (CRS), and geographic extent, which can be accessed immediately without fetching pixel data.

library(terra)

setGDALconfig("AWS_NO_SIGN_REQUEST", "YES")
setGDALconfig("GDAL_DISABLE_READDIR_ON_OPEN", "EMPTY_DIR")

cog_url <- "/vsis3/sentinel-cogs/sentinel-s2-l2a-cogs/12/R/UU/2024/4/S2A_12RUU_20240421_0_L2A/B02.tif"
r <- rast(cog_url)
print(r)
class       : SpatRaster 
dimensions  : 10980, 10980, 1  (nrow, ncol, nlyr)
resolution  : 10, 10  (x, y)
extent      : 3e+05, 409800, 3290220, 3400020  (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 12N 
source      : B02.tif 
name        : B02 

Exploring Metadata with terra::describe (basically gdalinfo)

Before diving further into visualization or analysis, it’s essential to understand the structure and metadata of the Cloud-Optimized GeoTIFF (COG) we’re working with. Metadata provides critical insights about the file, such as:

  • Dimensions: The raster’s size in pixels.
  • Projection: The coordinate reference system (CRS) used for the dataset.
  • Resolution: The pixel size in geographic or projected units.
  • Tile Information: The internal structure of the COG (e.g., block size and block layout).
  • Overviews: Precomputed lower-resolution representations for efficient access.
  • Statistics: Summary statistics about the dataset’s pixel values.

In the following cell, we use the describe command to retrieve this metadata. This step helps verify the internal structure of the TIFF and confirm details like tile size, overview levels, and compression, which are crucial for understanding data access and performance characteristics.

print(describe(cog_url))
 [1] "Driver: GTiff/GeoTIFF"                                                                           
 [2] "Files: /vsis3/sentinel-cogs/sentinel-s2-l2a-cogs/12/R/UU/2024/4/S2A_12RUU_20240421_0_L2A/B02.tif"
 [3] "Size is 10980, 10980"                                                                            
 [4] "Coordinate System is:"                                                                           
 [5] "PROJCRS[\"WGS 84 / UTM zone 12N\","                                                              
 [6] "    BASEGEOGCRS[\"WGS 84\","                                                                     
 [7] "        DATUM[\"World Geodetic System 1984\","                                                   
 [8] "            ELLIPSOID[\"WGS 84\",6378137,298.257223563,"                                         
 [9] "                LENGTHUNIT[\"metre\",1,"                                                         
[10] "                    ID[\"EPSG\",9001]]]],"                                                       
[11] "        PRIMEM[\"Greenwich\",0,"                                                                 
[12] "            ANGLEUNIT[\"degree\",0.0174532925199433,"                                            
[13] "                ID[\"EPSG\",9122]]]],"                                                           
[14] "    CONVERSION[\"Transverse Mercator\","                                                         
[15] "        METHOD[\"Transverse Mercator\","                                                         
[16] "            ID[\"EPSG\",9807]],"                                                                 
[17] "        PARAMETER[\"Latitude of natural origin\",0,"                                             
[18] "            ANGLEUNIT[\"degree\",0.0174532925199433],"                                           
[19] "            ID[\"EPSG\",8801]],"                                                                 
[20] "        PARAMETER[\"Longitude of natural origin\",-111,"                                         
[21] "            ANGLEUNIT[\"degree\",0.0174532925199433],"                                           
[22] "            ID[\"EPSG\",8802]],"                                                                 
[23] "        PARAMETER[\"Scale factor at natural origin\",0.9996,"                                    
[24] "            SCALEUNIT[\"unity\",1],"                                                             
[25] "            ID[\"EPSG\",8805]],"                                                                 
[26] "        PARAMETER[\"False easting\",500000,"                                                     
[27] "            LENGTHUNIT[\"metre\",1],"                                                            
[28] "            ID[\"EPSG\",8806]],"                                                                 
[29] "        PARAMETER[\"False northing\",0,"                                                         
[30] "            LENGTHUNIT[\"metre\",1],"                                                            
[31] "            ID[\"EPSG\",8807]]],"                                                                
[32] "    CS[Cartesian,2],"                                                                            
[33] "        AXIS[\"(E)\",east,"                                                                      
[34] "            ORDER[1],"                                                                           
[35] "            LENGTHUNIT[\"metre\",1,"                                                             
[36] "                ID[\"EPSG\",9001]]],"                                                            
[37] "        AXIS[\"(N)\",north,"                                                                     
[38] "            ORDER[2],"                                                                           
[39] "            LENGTHUNIT[\"metre\",1,"                                                             
[40] "                ID[\"EPSG\",9001]]]]"                                                            
[41] "Data axis to CRS axis mapping: 1,2"                                                              
[42] "Origin = (300000.000000000000000,3400020.000000000000000)"                                       
[43] "Pixel Size = (10.000000000000000,-10.000000000000000)"                                           
[44] "Metadata:"                                                                                       
[45] "  OVR_RESAMPLING_ALG=AVERAGE"                                                                    
[46] "  AREA_OR_POINT=Area"                                                                            
[47] "Image Structure Metadata:"                                                                       
[48] "  COMPRESSION=DEFLATE"                                                                           
[49] "  INTERLEAVE=BAND"                                                                               
[50] "  PREDICTOR=2"                                                                                   
[51] "Corner Coordinates:"                                                                             
[52] "Upper Left  (  300000.000, 3400020.000) (113d 5'18.97\"W, 30d42'58.55\"N)"                       
[53] "Lower Left  (  300000.000, 3290220.000) (113d 4' 4.00\"W, 29d43'33.78\"N)"                       
[54] "Upper Right (  409800.000, 3400020.000) (111d56'31.81\"W, 30d43'46.74\"N)"                       
[55] "Lower Right (  409800.000, 3290220.000) (111d55'57.97\"W, 29d44'20.10\"N)"                       
[56] "Center      (  354900.000, 3345120.000) (112d30'28.19\"W, 30d13'44.33\"N)"                       
[57] "Band 1 Block=1024x1024 Type=UInt16, ColorInterp=Gray"                                            
[58] "  NoData Value=0"                                                                                
[59] "  Overviews: 5490x5490, 2745x2745, 1373x1373, 687x687"                                           

Visualizing the Internal Tile Grid of the COG

After inspecting the metadata with describe, we can use the extracted information to visualize the internal tile grid of the Cloud-Optimized GeoTIFF (COG). This visualization helps confirm how the file is divided into tiles and ensures our understanding aligns with the metadata.

In this cell, we:

  1. Calculate Block Size in Geographic Units:
    • Using the block size in pixels (from the metadata, typically 1024x1024 for Sentinel-2 COGs) and the raster’s resolution, we compute the corresponding geographic dimensions of each tile.
  2. Generate Grid Coordinates:
    • Grid lines are defined by the tile boundaries. The code to generate such a grid is included in the collapsed code block below.
    • Horizontal (x) and vertical (y) grid coordinates are calculated to align with the raster’s extent, ensuring proper alignment starting from the top-left corner (consistent with COG tiling).
  3. Overlay the Grid on the Raster:
    • The raster is plotted at a low resolution (using overviews, if available) for performance.
    • Red grid lines are drawn to represent the tile boundaries.
    • The blue rectangle corresponds to the region or window we’ll be plotting.

This visualization depicts the COG’s internal tiling structure, which is crucial for efficient access to specific regions of the COG.

# COG statistics we'll want to use later
block_size_pixels <- 1024  # Tile size in pixels
r_extent <- ext(r)         # Extent of the raster (total geographic units)
r_res <- res(r)            # Resolution of the raster (units per pixel)

# Calculate block size in geographic units
block_size_x <- block_size_pixels * r_res[1]
block_size_y <- block_size_pixels * r_res[2]

# Define the boundaries of the first tile
first_tile_xmin <- xmin(r_extent)
first_tile_xmax <- xmin(r_extent) + block_size_x
first_tile_ymin <- ymax(r_extent) - block_size_y
first_tile_ymax <- ymax(r_extent)
# Plot the tif preview with grid
plot(
  r,
  overview = TRUE,
  xlab = "Easting (meters in EPSG 9807)",
  ylab = "Northing (meters in EPSG 9807)"
)

# Generate grid coordinates
x_coords <- seq(xmin(r_extent), xmax(r_extent), by = block_size_x)

# Reverse y_coords to start from the top (ymax) and move down
y_coords <- seq(ymax(r_extent), ymin(r_extent), by = -block_size_y)

# Add vertical grid lines
for (x in x_coords) {
  lines(c(x, x), c(ymin(r_extent), ymax(r_extent)), col = "red")
}

# Add horizontal grid lines
for (y in y_coords) {
  lines(c(xmin(r_extent), xmax(r_extent)), c(y, y), col = "red")
}

# Draw the rectangle
rect(
  first_tile_xmin,
  first_tile_ymin,
  first_tile_xmax,
  first_tile_ymax,
  border = "blue", lwd = 2
)

Extracting and Visualizing the First Full Tile (Top-Left 1024x1024 Block)

In this step, we extract the first full tile (a 1024x1024 block) of the dataset, aligned with the internal tiling structure of the Cloud-Optimized GeoTIFF (COG). This demonstrates the efficiency of accessing tiles directly, as COGs enable on-demand retrieval of specific data blocks without downloading the entire file.

Steps Performed:

  1. Define the Extent:
    • The geographic extent of the first tile is calculated, starting from the top-left corner of the raster’s full extent and spanning the dimensions of one full tile.
  2. Crop the Raster:
    • Using Terra’s crop method, the raster is cropped to the calculated extent to retrieve only the first tile.
  3. Visualize the Tile:
    • The cropped tile is plotted to confirm its alignment with the internal tiling structure.

This process demonstrates a best-case scenario for access efficiency: all and only the bytes necessary for the selected region are read. While this level of optimization isn’t necessary, it illustrates how COGs can minimize data access. In most workflows, reading a few extra bytes is acceptable, but the ability to avoid processing the bulk of a large TIFF is a major advantage enabled by this per-tile access.

first_tile_extent = ext(
    first_tile_xmin,
    first_tile_xmax,
    first_tile_ymin,
    first_tile_ymax
)

# Crop the raster to the defined extent
first_tile <- crop(r, first_tile_extent)

# Visualize the first tile
plot(
  first_tile,
  xlab = "Easting (meters in EPSG 9807)",
  ylab = "Northing (meters in EPSG 9807)"
)

Visualizing a Region Spanning Multiple Tiles

In this step, we visualize a 1024x1024 region (the same size as above). This time, however, the window we construct will span portions of 4 internal tiles. This is a less-optimal case because multiple tiles must be accessed to satisfy a single request. From the point of view of someone using the API, things remain seamless. But behind the scenes, 4 times the data is being requested compared to the request above which aligned neatly with the internal tile grid.

Steps:

  1. Define the Extent Centered on a Tile Intersection:
    • The region is centered on the intersection of 4 tiles, ensuring it overlaps multiple internal blocks of the TIFF.
    • Red grid lines represent the boundaries of the internal tiles, showing how the request spans these blocks.
  2. Visualize the Region:
    • The blue rectangle highlights the requested region, superimposed over the red tile grid lines. This illustrates which tiles will need to be accessed to satisfy the request.

Insights:

  • The red grid lines illustrate the boundaries of the COG’s internal tiling, highlighting the efficiency of accessing data aligned with these tiles.
  • The blue rectangle represents the requested region, showing how it spans multiple tiles.
  • This step is purely visual: no reading or cropping data yet.
# Determine the center point where 4 tiles intersect
center_x <- xmin(r) + block_size_x
center_y <- ymax(r) - block_size_y

# Define the 1024x1024 window centered on this point
four_tile_xmin <- center_x - (block_size_x / 2)
four_tile_xmax <- center_x + (block_size_x / 2)
four_tile_ymin <- center_y - (block_size_y / 2)
four_tile_ymax <- center_y + (block_size_y / 2)

# Plot the tif preview with grid
plot(
  r,
  overview = TRUE,
  xlab = "Easting (meters in EPSG 9807)",
  ylab = "Northing (meters in EPSG 9807)"
)

# Add grid lines to show individual tiles
x_coords <- seq(xmin(r), xmax(r), by = block_size_x)
y_coords <- seq(ymax(r), ymin(r), by = -block_size_y)

for (x in x_coords) {
  lines(c(x, x), c(ymin(r), ymax(r)), col = "red")
}
for (y in y_coords) {
  lines(c(xmin(r), xmax(r)), c(y, y), col = "red")
}

# Highlight the window in blue
rect(
    four_tile_xmin,
    four_tile_ymin,
    four_tile_xmax,
    four_tile_ymax,
    border = "blue",
    lwd = 2
)

Extracting and Visualizing the Selected Window

In this step, we read the raster data corresponding to the blue rectangle visualized in the previous step. This window spans portions of 4 internal tiles. Again, this is a less-optimal case where data from multiple tiles must be accessed to fulfill a single request.

Steps:

  1. Read the Window:
    • Using the previously defined extent (the blue rectangle), we extract the corresponding raster data.
    • The operation retrieves data from all 4 tiles that overlap with the window.
  2. Visualize the Region:
    • The extracted region is plotted, confirming its location and spatial coverage within the raster.

Insights:

  • While the request spans multiple tiles, the ability to target only the relevant portions of the raster remains far more efficient than downloading the entire file.
# Create the extent for the window
four_tile_extent <- ext(
    four_tile_xmin,
    four_tile_xmax,
    four_tile_ymin,
    four_tile_ymax
)

# Crop the raster to the 1024x1024 window
window_4_tiles <- crop(r, four_tile_extent)

# Plot the cropped window
plot(
  window_4_tiles,
  xlab = "Easting (meters in EPSG 9807)",
  ylab = "Northing (meters in EPSG 9807)"
)

Review of Key Points

In this notebook, we explored how to efficiently access and process geospatial data stored in a Cloud-Optimized GeoTIFF (COG) using terra in R. The key takeaways include:

Efficient Data Access:

  • COGs allow on-demand access to specific regions of a dataset without requiring the download of the entire file.
  • This capability minimizes bandwidth usage, accelerates workflows, and supports scalable analysis of large geospatial datasets.

Visualization of Internal Tiling:

  • We visualized the internal tile structure of the COG (1024x1024 tiles) by overlaying a grid on the raster.
  • A blue rectangle highlighted the region corresponding to the first tile, illustrating how COGs organize data into blocks for efficient access.

Aligned and Unaligned Access:

  1. Aligned Access:
    • A 1024x1024 region corresponding to the first tile (top-left corner) was extracted.
    • This represents the best-case scenario for COGs, where the requested region aligns perfectly with an internal tile, ensuring minimal data transfer.
  2. Unaligned Access:
    • A 1024x1024 region spanning portions of 4 tiles was extracted.
    • This less-optimal case demonstrates how misalignment with the internal tile grid increases the amount of data accessed while still avoiding the need to download the entire file.

Reflection on Tools:

  • The terra package provides a powerful and flexible interface for geospatial data processing in R, enabling tasks such as cropping, visualizing, and analyzing raster data.
  • By leveraging COG-specific features like internal tiling and on-demand access, workflows can be made more efficient and cost-effective.