FlatGeobuf example

This notebook will give an overview of how to read and write FlatGeobuf files with GeoPandas, putting an emphasis on cloud-native operations where possible.

The primary way to interact with FlatGeobuf in Python is via bindings to GDAL, as there is no pure-Python implementation of FlatGeobuf.

There are two different Python libraries for interacting between Python and GDAL’s vector support: fiona and pyogrio. Both of these are integrated into geopandas.read_file via the engine keyword, but pyogrio is much faster. Set engine="pyogrio" when using read_file or GeoDataFrame.to_file to speed up reading and writing significantly. We also suggest passing use_arrow=True when reading for a slight extra speedup (this is not supported when writing).

Note

fiona is the default engine for geopandas.read_file. It provides full-featured bindings to GDAL but does not implement vectorized operations. Vectorization refers to operating on whole arrays of data at once rather than operating on individual values using a Python for loop. fiona’s non-vectorized approach means that each row of the source file is read individually with Python, and a Python for loop. In contrast, pyogrio’s vectorized implementation reads all rows in C before passing the data to Python, allowing it to achieve vast speedups (up to 40x) over fiona.

You can opt in to using pyogrio with geopandas.read_file by passing engine="pyogrio".

Additionally, if you’re using GDAL version 3.6 or higher (usually the case when using pyogrio), you can pass use_arrow=True to geopandas.read_file to use pyogrio’s support for GDAL’s RFC 86, which speeds up data reading even more.

Local vs Remote reading

The cloud-native vector ecosystem is young and doesn’t work as seamlessly as the COG and Zarr for subsetted access. We download data here to demonstrate important concepts but look to update these methods once better datasets and tools are available for working with FlatGeobuf without downloading entire files.

At the end of the notebook we have an example with reading via a spatial filter. Keep in mind that using a large spatial filter will cause the read to take a long time.

Environment

The packages needed for this notebook can be installed with conda or mamba. Using the environment.yml from this folder run:

conda create -f environment.yml

or

mamba create -f environment.yml

Alternatively, you can install the versions of pyogrio and geopandas used in this notebook with pip:

pip install pyogrio==0.6.0 geopandas==0.13.2

Imports

from tempfile import TemporaryDirectory
from urllib.request import urlretrieve

import geopandas as gpd
import pyogrio

Reading from local disk

First we’ll cover reading FlatGeobuf from local disk storage. As a first example, we’ll use the US counties FlatGeobuf data from this example. This file is only 13 MB in size, which we’ll download to cover simple loading from disk.

# Create a temporary directory in which to save the file
tmpdir = TemporaryDirectory()

# URL to download
url = "https://flatgeobuf.org/test/data/UScounties.fgb"

# Download, saving the output path
local_fgb_path, _ = urlretrieve(url, f"{tmpdir.name}/countries.fgb")

In each of the cases below, we use geopandas.read_file to read the file into a GeoDataFrame.

First we’ll show that reading this file with engine="fiona" (the default) is slower. Taking an extra 500 milliseconds might not seem like a lot, but this file contains only 3,000 rows, so this difference gets magnified with larger files.

%time gdf = gpd.read_file(local_fgb_path, engine="fiona")
CPU times: user 565 ms, sys: 16.9 ms, total: 582 ms
Wall time: 685 ms

Passing engine="pyogrio" speeds up loading by 18x here!

%time gdf = gpd.read_file(local_fgb_path, engine="pyogrio")
CPU times: user 25.3 ms, sys: 6.84 ms, total: 32.1 ms
Wall time: 31.3 ms

Using use_arrow=True often makes loading slightly faster again! We’re now 21x faster than using fiona.

%time gdf = gpd.read_file(local_fgb_path, engine="pyogrio", use_arrow=True)
CPU times: user 19.7 ms, sys: 10.1 ms, total: 29.7 ms
Wall time: 29.1 ms

Writing to local disk

Similarly, we can use GeoDataFrame.to_file to write to a local FlatGeobuf file. As expected, writing can be much faster if you pass engine="pyogrio".

By default, this writes a spatial index to the created FlatGeobuf file.

%time gdf.to_file(f"{tmpdir.name}/out_fiona.fgb")
CPU times: user 362 ms, sys: 44.4 ms, total: 407 ms
Wall time: 418 ms
%time gdf.to_file(f"{tmpdir.name}/out_pyogrio.fgb", engine="pyogrio")
CPU times: user 60.8 ms, sys: 23.4 ms, total: 84.2 ms
Wall time: 83.5 ms

Reading from the cloud

Knowing how to read and write local files is important, but given that FlatGeobuf is a cloud-optimized format, it’s important to be able to read from cloud-hosted files as well.

For this example, we’ll use the EuroCrops data hosted on Source Cooperative because it has versions of the same data in both FlatGeobuf and GeoParquet format. Hopefully using the same dataset for both the FlatGeobuf and GeoParquet example notebooks will be helpful.

url = "https://data.source.coop/cholmes/eurocrops/unprojected/flatgeobuf/FR_2018_EC21.fgb"

Usually when reading from the cloud, you want to filter on some spatial extent. Pyogrio offers a read_info function to access many pieces of information about the file:

pyogrio.read_info(url)
{'crs': 'EPSG:4326',
 'encoding': 'UTF-8',
 'fields': array(['ID_PARCEL', 'SURF_PARC', 'CODE_CULTU', 'CODE_GROUP', 'CULTURE_D1',
        'CULTURE_D2', 'EC_org_n', 'EC_trans_n', 'EC_hcat_n', 'EC_hcat_c'],
       dtype=object),
 'dtypes': array(['object', 'float64', 'object', 'object', 'object', 'object',
        'object', 'object', 'object', 'object'], dtype=object),
 'geometry_type': 'MultiPolygon',
 'features': 9517874,
 'driver': 'FlatGeobuf',
 'capabilities': {'random_read': 1,
  'fast_set_next_by_index': 0,
  'fast_spatial_filter': 1},
 'layer_metadata': None,
 'dataset_metadata': None}
Note

Sadly the output of read_info does not yet include the bounding box of the file, even though the FlatGeobuf file contains that information in the header. This may be a reason to consider externalizing metadata using Spatio-Temporal Asset Catalog files (STAC) in the future.

For now we’ll hard-code a region around Valence in the south of France, which we know the be within our dataset.

# The order of bounds is
# (min longitude, min latitude, max longitude, max latitude)
bounds = (4.301042, 44.822783, 4.410535, 44.877149)

We can fetch a dataframe containing only the records in these bounds by passing a bbox argument to read_file. Note that the Coordinate Reference System of this bounding box must match the CRS of the dataset. Here, we know from the output of read_info that the CRS of the dataset is EPSG:4326, so we can pass a longitude-latitude bounding box.

%time crops_gdf = gpd.read_file(url, bbox=bounds)
CPU times: user 144 ms, sys: 21.4 ms, total: 165 ms
Wall time: 6 s

Passing engine="pyogrio" is only slightly faster, which may mean that most of the time is taken up in network requests, not in parsing the actual data into Python.

%time crops_gdf = gpd.read_file(url, bbox=bounds, engine="pyogrio")
CPU times: user 26.9 ms, sys: 2.98 ms, total: 29.9 ms
Wall time: 490 ms

This gives us a much smaller dataset of only 400 rows (down from 9.5 million rows in the original dataset).

crops_gdf.head()
ID_PARCEL SURF_PARC CODE_CULTU CODE_GROUP CULTURE_D1 CULTURE_D2 EC_org_n EC_trans_n EC_hcat_n EC_hcat_c geometry
0 9484573 11.08 SPL 17 None None Surface pastorale - ressources fourragères lig... Pastoral area - predominant woody fodder resou... other_tree_wood_forest 3306990000 MULTIPOLYGON (((4.41142 44.85441, 4.41145 44.8...
1 487218 2.53 PPH 18 None None Prairie permanente - herbe prédominante (resso... Permanent pasture - predominantly grass (woody... pasture_meadow_grassland_grass 3302000000 MULTIPOLYGON (((4.41366 44.85898, 4.41373 44.8...
2 487224 0.89 CTG 22 None None Châtaigne Chestnut sweet_chestnuts 3303030500 MULTIPOLYGON (((4.41159 44.85891, 4.41034 44.8...
3 9484542 1.31 CTG 22 None None Châtaigne Chestnut sweet_chestnuts 3303030500 MULTIPOLYGON (((4.40904 44.85805, 4.41034 44.8...
4 487216 1.70 BOP 17 None None Bois pâturé Grazed wood other_tree_wood_forest 3306990000 MULTIPOLYGON (((4.41135 44.85476, 4.41134 44.8...
crops_gdf.shape
(415, 11)

There are other useful keyword arguments to read_file. Since we’re using the pyogrio engine, we can pass specific column names into read_file, and only those columns will be parsed. In the case of FlatGeobuf, this doesn’t save us much time, because the same amount of data needs to be fetched. (Though if running this cell soon after the previous one, the relevant data will be cached and won’t be downloaded again.)

column_names = ["ID_PARCEL", "SURF_PARC", "CODE_CULTU", "geometry"]
%time crops_gdf = gpd.read_file(url, bbox=bounds, columns=column_names, engine="pyogrio")
CPU times: user 25 ms, sys: 2.47 ms, total: 27.4 ms
Wall time: 706 ms
crops_gdf.head()
CODE_CULTU ID_PARCEL SURF_PARC geometry
0 SPL 9484573 11.08 MULTIPOLYGON (((4.41142 44.85441, 4.41145 44.8...
1 PPH 487218 2.53 MULTIPOLYGON (((4.41366 44.85898, 4.41373 44.8...
2 CTG 487224 0.89 MULTIPOLYGON (((4.41159 44.85891, 4.41034 44.8...
3 CTG 9484542 1.31 MULTIPOLYGON (((4.40904 44.85805, 4.41034 44.8...
4 BOP 487216 1.70 MULTIPOLYGON (((4.41135 44.85476, 4.41134 44.8...