import boto3
import rasterio
import matplotlib.pyplot as plt
from rasterio.plot import show
= 's3://sentinel-cogs/sentinel-s2-l2a-cogs/12/R/UU/2024/4/S2A_12RUU_20240421_0_L2A/B02.tif'
cog_url
# Don't forget to create an AWS session with unsigned requests enabled!
= rasterio.session.AWSSession(boto3.Session(), aws_unsigned=True)
aws_session
# Open the COG and read the right half
with rasterio.Env(aws_session):
with rasterio.open(cog_url) as dataset:
# Define the window for the bottom of the COG
= dataset.width
width = dataset.height
height = rasterio.windows.Window(
bottom_half_window =0, row_off=height // 2, width=width, height=height
col_off
)
# Read the data for the right half
= dataset.read(1, window=bottom_half_window)
bottom_half_data
show(bottom_half_data)
Accessing Data in Cloud-Optimized GeoTIFFs (COGs) with Python
Cloud-Optimized GeoTIFFs (COGs) are a sub-specification of GeoTIFF, which itself extends the TIFF format. COGs are designed to store raster data in a manner optimized for cloud environments. These files enable efficient, scalable workflows for geospatial analysis and are widely used in remote sensing and GIS. In this notebook, we’re going to align our reads according to the internal tile structure to illustrate how partial reads work to minimize the data that needs to be downloaded. Please keep in mind that in practice, the internal tile structure is not something you’ll need to think about explicitly.
The benefits to using the COG format, as we’ll see, are significant. At the same time, the overhead necessary to make a normal GeoTiff a COG is very small. As such, you should always produce valid COGs if you’re planning to write your data to the GeoTiff format. For more on authoring your own COGs, please see the guide on writing COGs.
If you’re not sure whether a GeoTiff you’re working with is a COG or not, refer to this brief guide on COG validation
Demonstration Goals
This notebook demonstrates how Python, using rasterio
, can leverage Cloud-Optimized GeoTIFFs for efficient geospatial workflows:
- Accessing Metadata and Logging Byte Ranges:
- Using
rasterio
to access key metadata such as dimensions, resolution, projection, internal tiling structure, and overviews. - Enabling detailed logging to display the specific byte-ranges being read during data access, providing insight into COG performance and efficiency.
- Using
- Visualizing the Internal Tiling Structure:
- Overlaying a grid representing the internal tiles on the raster.
- Highlighting specific windows to illustrate tile alignment and efficient data access.
- Reading Specific Regions:
- Accessing and visualizing a region aligned with a single internal tile to demonstrate optimal access.
- Extracting and visualizing a region spanning multiple tiles to demonstrate less-optimal, yet efficient, access.
Key Takeaways
- Metadata inspection reveals the structure and optimization of COGs, including their tiling and overviews.
- Detailed logging of byte-range requests provides transparency into the efficiency of COG access patterns.
- Efficient workflows are achievable by aligning data access with the COG’s internal tile grid.
- Even in less-optimal cases, COGs provide significant performance advantages over traditional raster formats.
Environment
The packages needed for this notebook can be installed with conda
or mamba
. Using the environment.yml
from this folder run:
conda env create -f environment.yml
or
mamba env create -f environment.yml
Finally, you may activate and select the kernel in the notebook (running in Jupyter)
conda activate coguide-cog
The notebook has been tested to work with the listed Conda environment.
TL;DR: Efficiently Read and Display the Bottom Half of a COG
The following cell is a very concise example which demonstrates how easily portions of a COG can be read and visualized. Here, we’ll read the bottom half of a Cloud-Optimized GeoTIFF (COG) and display it just to see how simple things are. Keep reading for a bit more detail about how this works!
Detailed Demonstration
Now that we’ve seen how to quickly and efficiently read a specific region of a Cloud-Optimized GeoTIFF (COG) in the TL;DR section immediately above, let’s explore things in greater detail. We’ll start with some necessary imports.
import boto3
import rasterio
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from rasterio.plot import show
from rasterio.session import AWSSession
from rasterio.windows import from_bounds, Window
Logging Byte-Range Requests
To demonstrate how Cloud-Optimized GeoTIFFs (COGs) enable request-specific access, we will first set up logging to surface the byte-ranges requested by GDAL. These logs should show how only the necessary portions of the dataset are fetched from the cloud, rather than downloading the entire file.
This logging provides a view into the mechanics of COG access. It is in no way necessary under normal usage of these libraries!
# imports specific to setting up GDAL logging
import logging
import os
# Enable GDAL logging; cf https://gdal.org/en/stable/user/configoptions.html
"CPL_LOG"] = "/dev/stdout"
os.environ["CPL_DEBUG"] = "ON"
os.environ[
=logging.DEBUG)
logging.basicConfig(level= logging.getLogger()
logger = logging.StreamHandler()
handler = logging.Formatter('%(message)s')
formatter
handler.setFormatter(formatter)= [handler]
logger.handlers
# The logs tend to be chatty, so we'll filter them down to only those describing what is being read
class GdalFilter(logging.Filter):
def filter(self, record):
= record.getMessage()
msg return "Downloading" in msg or "HTTP GET request" in msg
handler.addFilter(GdalFilter())
Setting Up the COG URL and AWS Session
For this demonstration, we can use a free, publicly available Sentinel-2 Cloud-Optimized GeoTIFF (COG) hosted on AWS. This dataset provides high-resolution satellite imagery and from the Sentinel-2 program.
The specific file we are accessing contains Band 2 (Blue) data for a tile in the UTM Zone 12N.
Since this dataset is publicly accessible, we need to set up an AWS session with unsigned requests to access the file without authentication.
= 's3://sentinel-cogs/sentinel-s2-l2a-cogs/12/R/UU/2024/4/S2A_12RUU_20240421_0_L2A/B02.tif'
cog_url
# Don't forget to create an AWS session with unsigned requests enabled!
= AWSSession(boto3.Session(), aws_unsigned=True) aws_session
Understanding the Internal Tile Grid of the COG
This section computes and visualizes the internal tile grid of a Cloud-Optimized GeoTIFF (COG). By understanding how tiles are arranged, we can align requests with the internal structure of the COG to optimize access.
Steps in the Code:
- Read Metadata:
- The raster’s metadata is accessed using
rasterio.open
. This includes:- Bounds: Geographic extent of the raster (xmin, xmax, ymin, ymax).
- Resolution: Geographic units per pixel (width, height).
- Block Shapes: Internal tile size in pixels, retrieved from
block_shapes
.
- The raster’s metadata is accessed using
- Calculate Tile Dimensions:
- The tile size in geographic units is computed by multiplying the pixel dimensions of a tile by the raster’s resolution:
tile_width = block_width * res[0]
tile_height = block_height * res[1]
- The tile size in geographic units is computed by multiplying the pixel dimensions of a tile by the raster’s resolution:
- Generate Grid Coordinates:
- The grid coordinates for tile boundaries are calculated based on the raster’s extent and tile size:
- Horizontal (
x
) coordinates span frombounds.left
tobounds.right
, incremented bytile_width
. - Vertical (
y
) coordinates span frombounds.top
tobounds.bottom
, decremented bytile_height
(to align with the raster’s coordinate system).
- Horizontal (
- The grid coordinates for tile boundaries are calculated based on the raster’s extent and tile size:
Notes on Efficiency:
- The logs confirm that only metadata stored in the file header is read during this operation. No pixel data is accessed and only the byte-range 0-16383 is transferred!
- This efficient metadata access highlights the design of COGs, which store critical structural information in the header.
This step sets the stage for visualizing the grid, constructing our read-window, and understanding how tiles align with the raster’s extent.
with rasterio.Env(aws_session): # Set environment for unsigned access
with rasterio.open(cog_url) as dataset:
# Get raster metadata
= dataset.bounds
bounds = dataset.res # Resolution (width, height) in geographic units per pixel
res = dataset.block_shapes[0] # Block size (tile dimensions in pixels)
block_width, block_height
= block_width * res[0]
tile_width = block_height * res[1]
tile_height
# Compute grid coordinates
= list(range(int(bounds.left), int(bounds.right), int(tile_width)))
x_coords = list(range(int(bounds.top), int(bounds.bottom), -int(tile_height))) y_coords
CPLE_None in S3: Downloading 0-16383 (https://sentinel-cogs.s3.amazonaws.com/sentinel-s2-l2a-cogs/12/R/UU/2024/4/S2A_12RUU_20240421_0_L2A/B02.tif)...
Overviews for Efficient Visualization
To visualize the raster efficiently, we leverage Cloud-Optimized GeoTIFF (COG) overviews. Overviews are precomputed, lower-resolution versions of the raster data stored within the file.
Let’s first see the available overviews and do a bit of math to estimate how much less data we need to read for each:
with rasterio.Env(aws_session):
with rasterio.open(cog_url) as dataset:
= dataset.overviews(1)
overviews = dataset.width
pixels_wide = dataset.height
pixel_high
= pixels_wide * pixel_high
full_res_pixels print(f"Available overviews: {overviews}; total pixels in full resolution image: {full_res_pixels}")
for idx, overview in enumerate(overviews):
print(f"With index {idx} and value {overview}, we should have approximately {int(full_res_pixels / (overview * overview))} pixels. That's {overview * overview}x fewer pixels!")
Available overviews: [2, 4, 8, 16]; total pixels in full resolution image: 120560400
With index 0 and value 2, we should have approximately 30140100 pixels. That's 4x fewer pixels!
With index 1 and value 4, we should have approximately 7535025 pixels. That's 16x fewer pixels!
With index 2 and value 8, we should have approximately 1883756 pixels. That's 64x fewer pixels!
With index 3 and value 16, we should have approximately 470939 pixels. That's 256x fewer pixels!
Explanation:
- 2: Represents the first overview, where the resolution is reduced to 1/2 in both dimensions.
- 4: Represents the second overview, where the resolution is reduced to 1/4th in both dimensions.
- 8: Represents the third overview, where the resolution is reduced to 1/8th in both dimensions.
- 16: Represents the fourth overview, where the resolution is reduced to 1/16th in both dimensions.
Why Overviews Matter:
- Faster Performance: By leveraging these precomputed overviews, applications can quickly fetch data at lower resolutions without processing the full dataset.
- Reduced Data Transfer: Overviews minimize the amount of data read from the file, which is especially important when working with large datasets in cloud environments.
In the following cell: - The OVERVIEW_LEVEL
is set to 3
, which selects the overview with the lowest resolution in this tif. Referring to the explanation above, that means 1/16th resolution in both x
and y
dimensions! - The dataset.read
function retrieves the first band, resampling the overview to an output shape of 512x512
pixels for display.
Take note of how small the byte-range request surfaced in logs are. We get to see a (low resolution) version of the entirety of this tif without moving much data across the wire. This is the advantage of provided overviews.
with rasterio.Env(aws_session):
with rasterio.open(cog_url, OVERVIEW_LEVEL=3) as dataset:
= dataset.read(1, out_shape=(512, 512)) overview
CPLE_None in S3: Downloading 16384-393215 (https://sentinel-cogs.s3.amazonaws.com/sentinel-s2-l2a-cogs/12/R/UU/2024/4/S2A_12RUU_20240421_0_L2A/B02.tif)...
Visualizing the First Tile with the Internal Tile Grid
This code block creates a visualization of the Cloud-Optimized GeoTIFF (COG) and its internal tiling structure:
- Plotting the Overview:
- The overview raster data is displayed using
imshow
, with the coordinate extent (bounds
) ensuring the geographic alignment of the visualization. - The title and axis labels provide context, specifying the coordinate reference system (CRS) as EPSG 9807.
- The overview raster data is displayed using
- Adding Grid Lines:
- Red vertical and horizontal lines are drawn to represent the boundaries of the internal tiles.
- These lines are calculated based on the tile dimensions and the raster’s geographic extent.
- Highlighting the First Tile:
- A blue rectangle is drawn around the first tile (top-left corner) of the raster.
- The rectangle is calculated using the geographic coordinates of the tile’s extent, emphasizing alignment with the internal tiling structure.
This visualization illustrates the COG’s tile grid. The grid is important because it is what lets tools like rasterio decide how - and how much - to read. > Note: At the risk of being repetitive, please be aware that you won’t actually have to think about this kind of implementation detail when working with COGs - in practice, they just work and will save you time, data, and memory if you only need part of a TIF!
= plt.subplots(figsize=(10, 8))
fig, ax ="gray", extent=[bounds.left, bounds.right, bounds.bottom, bounds.top])
ax.imshow(overview, cmap"COG Tile Grid; First Tile Highlighted")
ax.set_title("Easting (meters in EPSG 9807)")
ax.set_xlabel("Northing (meters in EPSG 9807)")
ax.set_ylabel(
# Add grid lines
for x in x_coords:
=x, color="red", linewidth=0.5)
ax.axvline(xfor y in y_coords:
=y, color="red", linewidth=0.5)
ax.axhline(y
# Highlight the first tile (top-left)
= bounds.left
first_tile_xmin = bounds.left + tile_width
first_tile_xmax = bounds.top
first_tile_ymax = bounds.top - tile_height
first_tile_ymin = Rectangle(
rect
(first_tile_xmin, first_tile_ymin),
tile_width,
tile_height,=4,
linewidth="blue",
edgecolor="none",
facecolor
)
ax.add_patch(rect)
plt.show()
Reading and Visualizing the First Full Tile
In this step, we read and visualize the first full tile of the dataset, aligned with the internal tiling structure of the Cloud-Optimized GeoTIFF (COG). This demonstrates the efficiency of accessing data in tile-sized chunks, a best-case scenario for COGs.
Steps:
- Define the tile window:
- The window corresponds to the first tile (top-left corner), with dimensions matching the internal tile size provided by the dataset’s
block_shapes
.
- The window corresponds to the first tile (top-left corner), with dimensions matching the internal tile size provided by the dataset’s
- Read the tile data:
- The
rasterio.read
function fetches the data for the first tile. - NOTE: Logs starting with
CPLE_None
describe the byte ranges accessed for this operation.
- The
- Visualize the Tile:
- The tile is plotted with pixel-based x and y axes.
Significance of logs
Pay attention to the logs, starting with ‘CPLE_None’. They indicate which bytes needed to be downloaded. As you can see, this read required only one request!
with rasterio.Env(aws_session):
with rasterio.open(cog_url) as src:
= Window(col_off=0, row_off=0, width=block_width, height=block_height)
first_tile_window = src.read(1, window=first_tile_window)
first_tile show(first_tile)
CPLE_None in S3: Downloading 31424512-32784383 (https://sentinel-cogs.s3.amazonaws.com/sentinel-s2-l2a-cogs/12/R/UU/2024/4/S2A_12RUU_20240421_0_L2A/B02.tif)...
Visualizing a Region Spanning Multiple Tiles with the Internal Tile Grid
This section visualizes a region that spans portions of multiple tiles in a Cloud Optimized GeoTiff (COG). This scenario is less optimal than the above, as requests are not neatly aligned with the internal tile grid.
- Plotting the Overview:
- The overview raster data is displayed using
imshow
, with the coordinate extent (bounds
) ensuring the geographic alignment of the visualization. - The title and axis labels provide context, specifying the coordinate reference system (CRS) as EPSG 9807.
- The overview raster data is displayed using
- Adding Grid Lines:
- Red vertical and horizontal lines are drawn to represent the boundaries of the internal tiles.
- These lines are calculated based on the tile dimensions and the raster’s geographic extent.
- Highlighting the Area of Interest:
- A blue rectangle is drawn around a region which is centered on the intersection of the square of tiles found in the top-left of the raster.
Insights:
- This demonstrates how COGs handle non-aligned access patterns while still reading only the necessary byte ranges.
- The visualization highlights the relationship between the requested region and the internal tile grid.
# Define the region spanning multiple tiles
= bounds.left + tile_width * 0.5
intersection_xmin = bounds.left + tile_width * 1.5
intersection_xmax = bounds.top - tile_height * 0.5
intersection_ymax = bounds.top - tile_height * 1.5
intersection_ymin
# Plot the raster with the grid
= plt.subplots(figsize=(10, 8))
fig, ax ="gray", extent=[bounds.left, bounds.right, bounds.bottom, bounds.top])
ax.imshow(overview, cmap"COG Tile Grid; Window Centered on Tile Intersection Highlighted")
ax.set_title("Easting (meters in EPSG 9807)")
ax.set_xlabel("Northing (meters in EPSG 9807)")
ax.set_ylabel(
# Add grid lines
for x in x_coords:
=x, color="red", linewidth=0.5)
ax.axvline(xfor y in y_coords:
=y, color="red", linewidth=0.5)
ax.axhline(y
# Highlight the read area
= Rectangle(
read_area
(intersection_xmin, intersection_ymin),- intersection_xmin,
intersection_xmax - intersection_ymin,
intersection_ymax =4,
linewidth="blue",
edgecolor="none",
facecolor
)
ax.add_patch(read_area)
plt.show()
Reading and Visualizing a Region Spanning Multiple Tiles
In this step, we read and visualize a region that spans portions of four tiles in the dataset. This operation demonstrates the efficiency of Cloud-Optimized GeoTIFFs (COGs) even in less-optimal scenarios where the requested region intersects multiple tiles.
Steps:
- Define the overlapping tile window:
- The window is defined using geographic bounds, ensuring it intersects the boundaries of four internal tiles.
- Read the data for the overlapping region:
- The
rasterio.read
function fetches the data for the specified window. - NOTE: Logs starting with
CPLE_None
describe the byte ranges accessed for this operation.
- The
- Visualize the Data:
- The overlapping region is plotted with pixel-based x and y axes.
Significance of Logs
Pay attention to the logs, starting with ‘CPLE_None’. They indicate which bytes were downloaded. Despite the region spanning four tiles, the file structure allows for only two HTTP range requests. How is that possible? Behind the scenes, rasterio (GDAL, actually) is able to determine that these four tiles consist of only two contiguous blocks of bytes!
with rasterio.Env(aws_session):
with rasterio.open(cog_url) as src:
= from_bounds(
intersection_window
intersection_xmin, intersection_ymin,
intersection_xmax, intersection_ymax,=src.transform
transform
)= src.read(1, window=intersection_window)
intersection_data show(intersection_data)
CPLE_None in S3: Downloading 31426851-33977114 (https://sentinel-cogs.s3.amazonaws.com/sentinel-s2-l2a-cogs/12/R/UU/2024/4/S2A_12RUU_20240421_0_L2A/B02.tif)...
CPLE_None in S3: Downloading 41301455-43892593 (https://sentinel-cogs.s3.amazonaws.com/sentinel-s2-l2a-cogs/12/R/UU/2024/4/S2A_12RUU_20240421_0_L2A/B02.tif)...
Review of Key Points
In this notebook, we demonstrated the efficiency of accessing geospatial data stored in a Cloud-Optimized GeoTIFF (COG) using rasterio
. The key takeaways include:
- Efficient Data Access:
- COGs enable on-demand access to specific subsets of a dataset, avoiding the need to download entire files.
- We saw the efficiency of reading a single tile (aligned with the grid) and a region spanning multiple tiles.
- By aligning requests with the tile grid, we get optimal performance, minimizing unnecessary reads.
- Insights From Logged Byte-Range Requests:
- Logs showed that COGs allow us to minimize data transfer. Reads are grouped according to contiguous byte ranges to reduce requests.
- For the single tile read, only one byte-range request was required, while the overlapping region spanning four tiles was fetched in just two requests.