FlatGeobuf example

This notebook will give an overview of how to read and write FlatGeobuf files with GeoPandas, putting an emphasis on cloud-native operations where possible.

The primary way to interact with FlatGeobuf in Python is via bindings to GDAL, as there is no pure-Python implementation of FlatGeobuf.

There are two different Python libraries for interacting between Python and GDAL’s vector support: fiona and pyogrio. Both of these are integrated into geopandas.read_file via the engine keyword, but pyogrio is much faster. From geopandas version 1.0.0, the default is pyogrio. For prior geopandas versions, set engine="pyogrio" when using read_file or GeoDataFrame.to_file to speed up reading and writing significantly. We also suggest passing use_arrow=True when reading for a slight extra speedup (this is not supported when writing).

Note

fiona was the default engine for geopandas.read_file prior to version 1.0.0. It provides full-featured bindings to GDAL but does not implement vectorized operations. Vectorization refers to operating on whole arrays of data at once rather than operating on individual values using a Python for loop. fiona’s non-vectorized approach means that each row of the source file is read individually with Python, and a Python for loop. In contrast, pyogrio’s vectorized implementation reads all rows in C before passing the data to Python, allowing it to achieve vast speedups (up to 40x) over fiona.

If you’re using GDAL version 3.6 or higher (usually the case when using pyogrio), you can pass use_arrow=True to geopandas.read_file to use pyogrio’s support for GDAL’s RFC 86, which speeds up data reading even more.

Local vs Remote reading

The cloud-native vector ecosystem is young and doesn’t work as seamlessly as the COG and Zarr for subsetted access. We download data here to demonstrate important concepts but look to update these methods once better datasets and tools are available for working with FlatGeobuf without downloading entire files.

At the end of the notebook we have an example with reading via a spatial filter. Keep in mind that using a large spatial filter will cause the read to take a long time.

Environment

The packages needed for this notebook can be installed with conda or mamba. Using the environment.yml from this folder run:

conda create -f environment.yml

mamba create -f environment.yml

Alternatively, you can install the versions of pyogrio and geopandas used in this notebook with pip:

pip install pyogrio==0.11.0 geopandas==1.0.1

Additionally, to reproduce the comparisons between fiona and pyogrio below, you would need to install fiona separately as well, but if you prefer to only run the (faster) pyogrio examples, you can skip the fiona install and examples altogether.

pip install fiona==1.10.1

Imports

from tempfile import TemporaryDirectory
from urllib.request import urlretrieve

import geopandas as gpd
import pyogrio

Reading from local disk

First we’ll cover reading FlatGeobuf from local disk storage. As a first example, we’ll use the US counties FlatGeobuf data from this example. This file is only 13 MB in size, which we’ll download to cover simple loading from disk.

# Create a temporary directory in which to save the file
tmpdir = TemporaryDirectory()

# URL to download
url = "https://flatgeobuf.org/test/data/UScounties.fgb"

# Download, saving the output path
local_fgb_path, _ = urlretrieve(url, f"{tmpdir.name}/countries.fgb")

In each of the cases below, we use geopandas.read_file to read the file into a GeoDataFrame.

First we’ll show that reading this file with engine="fiona" is slower. Taking an extra 500 milliseconds might not seem like a lot, but this file contains only 3,000 rows, so this difference gets magnified with larger files.

%time gdf = gpd.read_file(local_fgb_path, engine="fiona")

CPU times: user 1.85 s, sys: 34.7 ms, total: 1.88 s
Wall time: 2.35 s

Using the (since version 1.0.0 default) engine="pyogrio" speeds up loading by 22x here!

%time gdf = gpd.read_file(local_fgb_path, engine="pyogrio")

CPU times: user 69.6 ms, sys: 15.9 ms, total: 85.5 ms
Wall time: 89.6 ms

Using use_arrow=True often makes loading slightly faster again! We’re now 24x faster than using fiona.

%time gdf = gpd.read_file(local_fgb_path, engine="pyogrio", use_arrow=True)

CPU times: user 48 ms, sys: 30.7 ms, total: 78.8 ms
Wall time: 118 ms

Writing to local disk

Similarly, we can use GeoDataFrame.to_file to write to a local FlatGeobuf file. As expected, writing can be much faster if you pass engine="pyogrio".

By default, this writes a spatial index to the created FlatGeobuf file.

%time gdf.to_file(f"{tmpdir.name}/out_fiona.fgb", engine="fiona")

CPU times: user 875 ms, sys: 53.2 ms, total: 928 ms
Wall time: 944 ms

%time gdf.to_file(f"{tmpdir.name}/out_pyogrio.fgb", engine="pyogrio")

CPU times: user 62.9 ms, sys: 13.9 ms, total: 76.8 ms
Wall time: 76.8 ms

Reading from the cloud

Knowing how to read and write local files is important, but given that FlatGeobuf is a cloud-optimized format, it’s important to be able to read from cloud-hosted files as well.

For this example, we’ll use the EuroCrops data hosted on Source Cooperative because it has versions of the same data in both FlatGeobuf and GeoParquet format. Hopefully using the same dataset for both the FlatGeobuf and GeoParquet example notebooks will be helpful.

url = "https://s3.us-west-2.amazonaws.com/us-west-2.opendata.source.coop/cholmes/eurocrops/unprojected/flatgeobuf/FR_2018_EC21.fgb"

Usually when reading from the cloud, you want to filter on some spatial extent. Pyogrio offers a read_info function to access many pieces of information about the file:

pyogrio.read_info(url)

{'layer_name': 'FR_2018_EC21',
 'crs': 'EPSG:4326',
 'encoding': 'UTF-8',
 'fields': array(['ID_PARCEL', 'SURF_PARC', 'CODE_CULTU', 'CODE_GROUP', 'CULTURE_D1',
        'CULTURE_D2', 'EC_org_n', 'EC_trans_n', 'EC_hcat_n', 'EC_hcat_c'],
       dtype=object),
 'dtypes': array(['object', 'float64', 'object', 'object', 'object', 'object',
        'object', 'object', 'object', 'object'], dtype=object),
 'fid_column': '',
 'geometry_name': '',
 'geometry_type': 'MultiPolygon',
 'features': 9517874,
 'total_bounds': (-6.047022416643922,
  -3.916364769838749,
  68.89050422648864,
  51.075100624023094),
 'driver': 'FlatGeobuf',
 'capabilities': {'random_read': True,
  'fast_set_next_by_index': False,
  'fast_spatial_filter': True,
  'fast_feature_count': True,
  'fast_total_bounds': True},
 'layer_metadata': None,
 'dataset_metadata': None}

For now we’ll hard-code a region around Valence in the south of France, which we know the be within our dataset.

# The order of bounds is
# (min longitude, min latitude, max longitude, max latitude)
bounds = (4.301042, 44.822783, 4.410535, 44.877149)

We can fetch a dataframe containing only the records in these bounds by passing a bbox argument to read_file. Note that the Coordinate Reference System of this bounding box must match the CRS of the dataset. Here, we know from the output of read_info that the CRS of the dataset is EPSG:4326, so we can pass a longitude-latitude bounding box.

%time crops_gdf = gpd.read_file(url, bbox=bounds, engine="fiona")

CPU times: user 195 ms, sys: 20.6 ms, total: 216 ms
Wall time: 2.51 s

Passing engine="pyogrio" is only slightly faster, which may mean that most of the time is taken up in network requests, not in parsing the actual data into Python.

%time crops_gdf = gpd.read_file(url, bbox=bounds, engine="pyogrio")

CPU times: user 26.4 ms, sys: 20 ms, total: 46.4 ms
Wall time: 1.91 s

This gives us a much smaller dataset of only 400 rows (down from 9.5 million rows in the original dataset).

crops_gdf.head()

	ID_PARCEL	SURF_PARC	CODE_CULTU	CODE_GROUP	CULTURE_D1	CULTURE_D2	EC_org_n	EC_trans_n	EC_hcat_n	EC_hcat_c	geometry
0	9484573	11.08	SPL	17	None	None	Surface pastorale - ressources fourragères lig...	Pastoral area - predominant woody fodder resou...	other_tree_wood_forest	3306990000	MULTIPOLYGON (((4.41142 44.85441, 4.41145 44.8...
1	487218	2.53	PPH	18	None	None	Prairie permanente - herbe prédominante (resso...	Permanent pasture - predominantly grass (woody...	pasture_meadow_grassland_grass	3302000000	MULTIPOLYGON (((4.41366 44.85898, 4.41373 44.8...
2	487224	0.89	CTG	22	None	None	Châtaigne	Chestnut	sweet_chestnuts	3303030500	MULTIPOLYGON (((4.41159 44.85891, 4.41034 44.8...
3	9484542	1.31	CTG	22	None	None	Châtaigne	Chestnut	sweet_chestnuts	3303030500	MULTIPOLYGON (((4.40904 44.85805, 4.41034 44.8...
4	487216	1.70	BOP	17	None	None	Bois pâturé	Grazed wood	other_tree_wood_forest	3306990000	MULTIPOLYGON (((4.41135 44.85476, 4.41134 44.8...

crops_gdf.shape

(415, 11)

There are other useful keyword arguments to read_file. Since we’re using the pyogrio engine, we can pass specific column names into read_file, and only those columns will be parsed. In the case of FlatGeobuf, this doesn’t save us much time, because the same amount of data needs to be fetched. (Though if running this cell soon after the previous one, the relevant data will be cached and won’t be downloaded again.)

column_names = ["ID_PARCEL", "SURF_PARC", "CODE_CULTU", "geometry"]
%time crops_gdf = gpd.read_file(url, bbox=bounds, columns=column_names, engine="pyogrio")

CPU times: user 12.9 ms, sys: 0 ns, total: 12.9 ms
Wall time: 155 ms

crops_gdf.head()

	CODE_CULTU	ID_PARCEL	SURF_PARC	geometry
0	SPL	9484573	11.08	MULTIPOLYGON (((4.41142 44.85441, 4.41145 44.8...
1	PPH	487218	2.53	MULTIPOLYGON (((4.41366 44.85898, 4.41373 44.8...
2	CTG	487224	0.89	MULTIPOLYGON (((4.41159 44.85891, 4.41034 44.8...
3	CTG	9484542	1.31	MULTIPOLYGON (((4.40904 44.85805, 4.41034 44.8...
4	BOP	487216	1.70	MULTIPOLYGON (((4.41135 44.85476, 4.41134 44.8...