GeoParquet Example

This notebook will give an overview of how to read and write GeoParquet files with GeoPandas, putting an emphasis on cloud-native operations where possible.

The easiest way to read and write GeoParquet files is to use GeoPandas’ read_parquet and to_parquet functions.

Note

Make sure to use the specific read_parquet and to_parquet functions. These will be much, much faster than using the usual read_file and to_file.

Environment

The packages needed for this notebook can be installed with conda or mamba. Using the environment.yml from this folder run:

conda create -f environment.yml

or

mamba create -f environment.yml

This notebook has been tested to work with the listed Conda environment. If you don’t want to use Conda or Mamba, install the latest versions of geopandas, fsspec, and pyarrow with pip. Note that you’ll also need the GDAL CLI with Parquet driver. If you’re on MacOS, you can install that via brew install gdal.

Imports

from urllib.request import urlretrieve

import fsspec
import geopandas as gpd
from fsspec.implementations.http import HTTPFileSystem

Comparison with FlatGeobuf

In order to compare reading GeoParquet with FlatGeobuf, we’ll cover reading and writing GeoParquet files on local disk storage. To be consistent with the FlatGeobuf example, we’ll fetch the same US counties FlatGeobuf file (13 MB) and convert it to GeoParquet using ogr2ogr.

# URL to download
url = "https://flatgeobuf.org/test/data/UScounties.fgb"

# Download, saving to the current directory
local_fgb_path, _ = urlretrieve(url, "countries.fgb")
!ogr2ogr countries.parquet countries.fgb

Loading this GeoParquet file is really fast! 13% faster than loading the same data via FlatGeobuf (shown in the FlatGeobuf example notebook).

%time gdf = gpd.read_parquet("countries.parquet")
CPU times: user 23.8 ms, sys: 11.8 ms, total: 35.6 ms
Wall time: 34.1 ms
gdf
STATE_FIPS COUNTY_FIP FIPS STATE NAME LSAD geometry
0 23 009 23009 ME Hancock County MULTIPOLYGON (((-68.53108 44.33278, -68.53348 ...
1 33 007 33007 NH Coos County MULTIPOLYGON (((-71.05975 45.01485, -71.06939 ...
2 50 009 50009 VT Essex County MULTIPOLYGON (((-71.49463 44.90874, -71.49392 ...
3 50 019 50019 VT Orleans County MULTIPOLYGON (((-72.14193 45.00600, -72.16051 ...
4 23 007 23007 ME Franklin County MULTIPOLYGON (((-70.83471 45.27514, -70.77984 ...
... ... ... ... ... ... ... ...
3216 15 003 15003 HI Honolulu County MULTIPOLYGON (((-171.73761 25.79210, -171.7513...
3217 15 007 15007 HI Kauai County MULTIPOLYGON (((-160.55535 21.66345, -160.5541...
3218 15 009 15009 HI Maui County MULTIPOLYGON (((-157.06121 20.89150, -157.0611...
3219 15 001 15001 HI Hawaii County MULTIPOLYGON (((-155.08767 19.72887, -155.0909...
3220 15 005 15005 HI Kalawao County MULTIPOLYGON (((-157.01455 21.18550, -157.0145...

3221 rows × 7 columns

Writing to local disk

We can use GeoDataFrame.to_parquet to write out this data to GeoParquet files locally. This is about 3x faster than writing the same dataset to FlatGeobuf, but note that FlatGeobuf’s writing is also calculating a spatial index.

%time gdf.to_parquet("countries_written.parquet")
CPU times: user 42.3 ms, sys: 12.6 ms, total: 55 ms
Wall time: 53.9 ms

Reading from the cloud

As of GeoParquet version 1.0.0-rc.1, spatial indexing has not yet been implemented. Therefore, there is not yet an API in GeoPandas to read data given a specific bounding box.

What is already efficient in GeoParquet is reading only specified columns from a dataset.

url = "https://data.source.coop/cholmes/eurocrops/unprojected/geoparquet/FR_2018_EC21.parquet"

Note that since we’re fetching this data directly from the cloud, we need to pass in an fsspec filesystem object. Otherwise GeoPandas will attempt to load a local file.

filesystem = HTTPFileSystem()

By default, calling read_parquet will fetch the entire file and parse it all into a single GeoDataFrame. Since this is a 3GB file, downloading the file takes a long time:

# This cell will take a few minutes to run, because it downloads the entire file
# %time gdf = gpd.read_parquet(url, filesystem=filesystem)

We can make this faster by only fetching specific columns. Because GeoParquet stores data in a columnar fashion, when selecting only specific columns we can download a lot less data.

# This cell will take a few minutes to run, because it downloads the entire file for these columns
# %time gdf = gpd.read_parquet(url, columns=["ID_PARCEL", "geometry"], filesystem=filesystem)

Working with GeoParquet row groups (Advanced)

As described in the intro document, GeoParquet is a chunked format, which allows you to access one of the chunks of rows very efficiently. This can allow you to stream a dataset — loading and operating on one chunk at a time — if the dataset is larger than your memory.

GeoPandas does not yet have built-in support for working with row groups, so this section will use the underlying pyarrow library directly.

import pyarrow.parquet as pq
from geopandas.io.arrow import _arrow_to_geopandas

First, we’ll create a ParquetFile object from the remote URL. All this does is load the metadata from the file, allowing you to inspect the schema and number of columns, rows, and row groups. Because this doesn’t load any actual data, it’s nearly instant to complete.

parquet_file = pq.ParquetFile(url, filesystem=filesystem)

We can access the column names in the dataset:

parquet_file.schema_arrow.names
['ID_PARCEL',
 'SURF_PARC',
 'CODE_CULTU',
 'CODE_GROUP',
 'CULTURE_D1',
 'CULTURE_D2',
 'EC_org_n',
 'EC_trans_n',
 'EC_hcat_n',
 'EC_hcat_c',
 'geometry']

This Parquet file includes 9.5 million rows:

parquet_file.metadata.num_rows
9517874

And 146 row groups. Given that each row group is about the same number of rows, each one contains around 65,000 rows.

parquet_file.num_row_groups
146

Then to load one of the row groups by numeric index, we can call ParquetFile.read_row_group.

pyarrow_table = parquet_file.read_row_group(0)

Note that this returns a pyarrow.Table, not a geopandas.GeoDataFrame. To convert between the two, we can use _arrow_to_geopandas. This conversion is very fast.

geopandas_gdf = _arrow_to_geopandas(pyarrow_table, parquet_file.metadata.metadata)

As expected, this row group contains right around 65,000 rows

geopandas_gdf.shape
(65536, 11)
geopandas_gdf.head()
ID_PARCEL SURF_PARC CODE_CULTU CODE_GROUP CULTURE_D1 CULTURE_D2 EC_org_n EC_trans_n EC_hcat_n EC_hcat_c geometry
0 123563 6.38 CZH 5 None None Colza d’hiver Winter rapeseed winter_rapeseed_rape 3301060401 MULTIPOLYGON (((3.33896 49.84122, 3.33948 49.8...
1 5527076 2.30 PPH 18 None None Prairie permanente - herbe prédominante (resso... Permanent pasture - predominantly grass (woody... pasture_meadow_grassland_grass 3302000000 MULTIPOLYGON (((-1.44483 49.61280, -1.44467 49...
2 11479241 6.33 PPH 18 None None Prairie permanente - herbe prédominante (resso... Permanent pasture - predominantly grass (woody... pasture_meadow_grassland_grass 3302000000 MULTIPOLYGON (((2.87821 46.53674, 2.87820 46.5...
3 12928442 5.10 PPH 18 None None Prairie permanente - herbe prédominante (resso... Permanent pasture - predominantly grass (woody... pasture_meadow_grassland_grass 3302000000 MULTIPOLYGON (((-0.19026 48.28723, -0.19025 48...
4 318389 0.92 PPH 18 None None Prairie permanente - herbe prédominante (resso... Permanent pasture - predominantly grass (woody... pasture_meadow_grassland_grass 3302000000 MULTIPOLYGON (((5.72084 44.03576, 5.72081 44.0...

As before, we can speed up the data fetching by requesting only specific columns in the read_row_group call.:

pyarrow_table = parquet_file.read_row_group(0, columns=["ID_PARCEL", "geometry"])

Then the resulting GeoDataFrame will only have those two columns:

_arrow_to_geopandas(pyarrow_table, parquet_file.metadata.metadata).head()
ID_PARCEL geometry
0 123563 MULTIPOLYGON (((3.33896 49.84122, 3.33948 49.8...
1 5527076 MULTIPOLYGON (((-1.44483 49.61280, -1.44467 49...
2 11479241 MULTIPOLYGON (((2.87821 46.53674, 2.87820 46.5...
3 12928442 MULTIPOLYGON (((-0.19026 48.28723, -0.19025 48...
4 318389 MULTIPOLYGON (((5.72084 44.03576, 5.72081 44.0...