GeoParquet Example

This notebook will give an overview of how to read and write GeoParquet files with GeoPandas, putting an emphasis on cloud-native operations where possible.

The easiest way to read and write GeoParquet files is to use GeoPandas’ read_parquet and to_parquet functions.

Note

Make sure to use the specific read_parquet and to_parquet functions. These will be much, much faster than using the usual read_file and to_file.

Environment

The packages needed for this notebook can be installed with conda or mamba. Using the environment.yml from this folder run:

conda create -f environment.yml

mamba create -f environment.yml

This notebook has been tested to work with the listed Conda environment. If you don’t want to use Conda or Mamba, install the latest versions of geopandas, fsspec, and pyarrow with pip. Note that you’ll also need the GDAL CLI with Parquet driver. If you’re on MacOS, you can install that via brew install gdal.

Imports

from urllib.request import urlretrieve

import fsspec
import geopandas as gpd
from fsspec.implementations.http import HTTPFileSystem

Comparison with FlatGeobuf

In order to compare reading GeoParquet with FlatGeobuf, we’ll cover reading and writing GeoParquet files on local disk storage. To be consistent with the FlatGeobuf example, we’ll fetch the same US counties FlatGeobuf file (13 MB) and convert it to GeoParquet using ogr2ogr.

# URL to download
url = "https://flatgeobuf.org/test/data/UScounties.fgb"

# Download, saving to the current directory
local_fgb_path, _ = urlretrieve(url, "countries.fgb")

!ogr2ogr countries.parquet countries.fgb

Loading this GeoParquet file is really fast! 13% faster than loading the same data via FlatGeobuf (shown in the FlatGeobuf example notebook).

%time gdf = gpd.read_parquet("countries.parquet")

CPU times: user 23.8 ms, sys: 11.8 ms, total: 35.6 ms
Wall time: 34.1 ms

gdf

	STATE_FIPS	COUNTY_FIP	FIPS	STATE	NAME	LSAD	geometry
0	23	009	23009	ME	Hancock	County	MULTIPOLYGON (((-68.53108 44.33278, -68.53348 ...
1	33	007	33007	NH	Coos	County	MULTIPOLYGON (((-71.05975 45.01485, -71.06939 ...
2	50	009	50009	VT	Essex	County	MULTIPOLYGON (((-71.49463 44.90874, -71.49392 ...
3	50	019	50019	VT	Orleans	County	MULTIPOLYGON (((-72.14193 45.00600, -72.16051 ...
4	23	007	23007	ME	Franklin	County	MULTIPOLYGON (((-70.83471 45.27514, -70.77984 ...
...	...	...	...	...	...	...	...
3216	15	003	15003	HI	Honolulu	County	MULTIPOLYGON (((-171.73761 25.79210, -171.7513...
3217	15	007	15007	HI	Kauai	County	MULTIPOLYGON (((-160.55535 21.66345, -160.5541...
3218	15	009	15009	HI	Maui	County	MULTIPOLYGON (((-157.06121 20.89150, -157.0611...
3219	15	001	15001	HI	Hawaii	County	MULTIPOLYGON (((-155.08767 19.72887, -155.0909...
3220	15	005	15005	HI	Kalawao	County	MULTIPOLYGON (((-157.01455 21.18550, -157.0145...

3221 rows × 7 columns

Writing to local disk

We can use GeoDataFrame.to_parquet to write out this data to GeoParquet files locally. This is about 3x faster than writing the same dataset to FlatGeobuf, but note that FlatGeobuf’s writing is also calculating a spatial index.

%time gdf.to_parquet("countries_written.parquet")

CPU times: user 42.3 ms, sys: 12.6 ms, total: 55 ms
Wall time: 53.9 ms

Reading from the cloud

As of GeoParquet version 1.0.0-rc.1, spatial indexing has not yet been implemented. Therefore, there is not yet an API in GeoPandas to read data given a specific bounding box.

What is already efficient in GeoParquet is reading only specified columns from a dataset.

url = "https://data.source.coop/cholmes/eurocrops/unprojected/geoparquet/FR_2018_EC21.parquet"

Note that since we’re fetching this data directly from the cloud, we need to pass in an fsspec filesystem object. Otherwise GeoPandas will attempt to load a local file.

filesystem = HTTPFileSystem()

By default, calling read_parquet will fetch the entire file and parse it all into a single GeoDataFrame. Since this is a 3GB file, downloading the file takes a long time:

# This cell will take a few minutes to run, because it downloads the entire file
# %time gdf = gpd.read_parquet(url, filesystem=filesystem)

We can make this faster by only fetching specific columns. Because GeoParquet stores data in a columnar fashion, when selecting only specific columns we can download a lot less data.

# This cell will take a few minutes to run, because it downloads the entire file for these columns
# %time gdf = gpd.read_parquet(url, columns=["ID_PARCEL", "geometry"], filesystem=filesystem)

Working with GeoParquet row groups (Advanced)

As described in the intro document, GeoParquet is a chunked format, which allows you to access one of the chunks of rows very efficiently. This can allow you to stream a dataset — loading and operating on one chunk at a time — if the dataset is larger than your memory.

GeoPandas does not yet have built-in support for working with row groups, so this section will use the underlying pyarrow library directly.

import pyarrow.parquet as pq
from geopandas.io.arrow import _arrow_to_geopandas

First, we’ll create a ParquetFile object from the remote URL. All this does is load the metadata from the file, allowing you to inspect the schema and number of columns, rows, and row groups. Because this doesn’t load any actual data, it’s nearly instant to complete.

parquet_file = pq.ParquetFile(url, filesystem=filesystem)

We can access the column names in the dataset:

parquet_file.schema_arrow.names

['ID_PARCEL',
 'SURF_PARC',
 'CODE_CULTU',
 'CODE_GROUP',
 'CULTURE_D1',
 'CULTURE_D2',
 'EC_org_n',
 'EC_trans_n',
 'EC_hcat_n',
 'EC_hcat_c',
 'geometry']

This Parquet file includes 9.5 million rows:

parquet_file.metadata.num_rows

And 146 row groups. Given that each row group is about the same number of rows, each one contains around 65,000 rows.

parquet_file.num_row_groups

Then to load one of the row groups by numeric index, we can call ParquetFile.read_row_group.

pyarrow_table = parquet_file.read_row_group(0)

Note that this returns a pyarrow.Table, not a geopandas.GeoDataFrame. To convert between the two, we can use _arrow_to_geopandas. This conversion is very fast.

geopandas_gdf = _arrow_to_geopandas(pyarrow_table, parquet_file.metadata.metadata)

As expected, this row group contains right around 65,000 rows

geopandas_gdf.shape

(65536, 11)

geopandas_gdf.head()

	ID_PARCEL	SURF_PARC	CODE_CULTU	CODE_GROUP	CULTURE_D1	CULTURE_D2	EC_org_n	EC_trans_n	EC_hcat_n	EC_hcat_c	geometry
0	123563	6.38	CZH	5	None	None	Colza d’hiver	Winter rapeseed	winter_rapeseed_rape	3301060401	MULTIPOLYGON (((3.33896 49.84122, 3.33948 49.8...
1	5527076	2.30	PPH	18	None	None	Prairie permanente - herbe prédominante (resso...	Permanent pasture - predominantly grass (woody...	pasture_meadow_grassland_grass	3302000000	MULTIPOLYGON (((-1.44483 49.61280, -1.44467 49...
2	11479241	6.33	PPH	18	None	None	Prairie permanente - herbe prédominante (resso...	Permanent pasture - predominantly grass (woody...	pasture_meadow_grassland_grass	3302000000	MULTIPOLYGON (((2.87821 46.53674, 2.87820 46.5...
3	12928442	5.10	PPH	18	None	None	Prairie permanente - herbe prédominante (resso...	Permanent pasture - predominantly grass (woody...	pasture_meadow_grassland_grass	3302000000	MULTIPOLYGON (((-0.19026 48.28723, -0.19025 48...
4	318389	0.92	PPH	18	None	None	Prairie permanente - herbe prédominante (resso...	Permanent pasture - predominantly grass (woody...	pasture_meadow_grassland_grass	3302000000	MULTIPOLYGON (((5.72084 44.03576, 5.72081 44.0...

As before, we can speed up the data fetching by requesting only specific columns in the read_row_group call.:

pyarrow_table = parquet_file.read_row_group(0, columns=["ID_PARCEL", "geometry"])

Then the resulting GeoDataFrame will only have those two columns:

_arrow_to_geopandas(pyarrow_table, parquet_file.metadata.metadata).head()

	ID_PARCEL	geometry
0	123563	MULTIPOLYGON (((3.33896 49.84122, 3.33948 49.8...
1	5527076	MULTIPOLYGON (((-1.44483 49.61280, -1.44467 49...
2	11479241	MULTIPOLYGON (((2.87821 46.53674, 2.87820 46.5...
3	12928442	MULTIPOLYGON (((-0.19026 48.28723, -0.19025 48...
4	318389	MULTIPOLYGON (((5.72084 44.03576, 5.72081 44.0...