from urllib.request import urlretrieve
import fsspec
import geopandas as gpd
from fsspec.implementations.http import HTTPFileSystem
GeoParquet Example
This notebook will give an overview of how to read and write GeoParquet files with GeoPandas, putting an emphasis on cloud-native operations where possible.
The easiest way to read and write GeoParquet files is to use GeoPandas’ read_parquet
and to_parquet
functions.
Make sure to use the specific read_parquet
and to_parquet
functions. These will be much, much faster than using the usual read_file
and to_file
.
Environment
The packages needed for this notebook can be installed with conda
or mamba
. Using the environment.yml
from this folder run:
conda create -f environment.yml
or
mamba create -f environment.yml
This notebook has been tested to work with the listed Conda environment. If you don’t want to use Conda or Mamba, install the latest versions of geopandas
, fsspec
, and pyarrow
with pip. Note that you’ll also need the GDAL CLI with Parquet driver. If you’re on MacOS, you can install that via brew install gdal
.
Imports
Comparison with FlatGeobuf
In order to compare reading GeoParquet with FlatGeobuf, we’ll cover reading and writing GeoParquet files on local disk storage. To be consistent with the FlatGeobuf example, we’ll fetch the same US counties FlatGeobuf file (13 MB) and convert it to GeoParquet using ogr2ogr
.
# URL to download
= "https://flatgeobuf.org/test/data/UScounties.fgb"
url
# Download, saving to the current directory
= urlretrieve(url, "countries.fgb") local_fgb_path, _
!ogr2ogr countries.parquet countries.fgb
Loading this GeoParquet file is really fast! 13% faster than loading the same data via FlatGeobuf (shown in the FlatGeobuf example notebook).
%time gdf = gpd.read_parquet("countries.parquet")
CPU times: user 23.8 ms, sys: 11.8 ms, total: 35.6 ms
Wall time: 34.1 ms
gdf
STATE_FIPS | COUNTY_FIP | FIPS | STATE | NAME | LSAD | geometry | |
---|---|---|---|---|---|---|---|
0 | 23 | 009 | 23009 | ME | Hancock | County | MULTIPOLYGON (((-68.53108 44.33278, -68.53348 ... |
1 | 33 | 007 | 33007 | NH | Coos | County | MULTIPOLYGON (((-71.05975 45.01485, -71.06939 ... |
2 | 50 | 009 | 50009 | VT | Essex | County | MULTIPOLYGON (((-71.49463 44.90874, -71.49392 ... |
3 | 50 | 019 | 50019 | VT | Orleans | County | MULTIPOLYGON (((-72.14193 45.00600, -72.16051 ... |
4 | 23 | 007 | 23007 | ME | Franklin | County | MULTIPOLYGON (((-70.83471 45.27514, -70.77984 ... |
... | ... | ... | ... | ... | ... | ... | ... |
3216 | 15 | 003 | 15003 | HI | Honolulu | County | MULTIPOLYGON (((-171.73761 25.79210, -171.7513... |
3217 | 15 | 007 | 15007 | HI | Kauai | County | MULTIPOLYGON (((-160.55535 21.66345, -160.5541... |
3218 | 15 | 009 | 15009 | HI | Maui | County | MULTIPOLYGON (((-157.06121 20.89150, -157.0611... |
3219 | 15 | 001 | 15001 | HI | Hawaii | County | MULTIPOLYGON (((-155.08767 19.72887, -155.0909... |
3220 | 15 | 005 | 15005 | HI | Kalawao | County | MULTIPOLYGON (((-157.01455 21.18550, -157.0145... |
3221 rows × 7 columns
Writing to local disk
We can use GeoDataFrame.to_parquet
to write out this data to GeoParquet files locally. This is about 3x faster than writing the same dataset to FlatGeobuf, but note that FlatGeobuf’s writing is also calculating a spatial index.
%time gdf.to_parquet("countries_written.parquet")
CPU times: user 42.3 ms, sys: 12.6 ms, total: 55 ms
Wall time: 53.9 ms
Reading from the cloud
As of GeoParquet version 1.0.0-rc.1, spatial indexing has not yet been implemented. Therefore, there is not yet an API in GeoPandas to read data given a specific bounding box.
What is already efficient in GeoParquet is reading only specified columns from a dataset.
= "https://data.source.coop/cholmes/eurocrops/unprojected/geoparquet/FR_2018_EC21.parquet" url
Note that since we’re fetching this data directly from the cloud, we need to pass in an fsspec
filesystem object. Otherwise GeoPandas will attempt to load a local file.
= HTTPFileSystem() filesystem
By default, calling read_parquet
will fetch the entire file and parse it all into a single GeoDataFrame
. Since this is a 3GB file, downloading the file takes a long time:
# This cell will take a few minutes to run, because it downloads the entire file
# %time gdf = gpd.read_parquet(url, filesystem=filesystem)
We can make this faster by only fetching specific columns. Because GeoParquet stores data in a columnar fashion, when selecting only specific columns we can download a lot less data.
# This cell will take a few minutes to run, because it downloads the entire file for these columns
# %time gdf = gpd.read_parquet(url, columns=["ID_PARCEL", "geometry"], filesystem=filesystem)
Working with GeoParquet row groups (Advanced)
As described in the intro document, GeoParquet is a chunked format, which allows you to access one of the chunks of rows very efficiently. This can allow you to stream a dataset — loading and operating on one chunk at a time — if the dataset is larger than your memory.
GeoPandas does not yet have built-in support for working with row groups, so this section will use the underlying pyarrow
library directly.
import pyarrow.parquet as pq
from geopandas.io.arrow import _arrow_to_geopandas
First, we’ll create a ParquetFile
object from the remote URL. All this does is load the metadata from the file, allowing you to inspect the schema and number of columns, rows, and row groups. Because this doesn’t load any actual data, it’s nearly instant to complete.
= pq.ParquetFile(url, filesystem=filesystem) parquet_file
We can access the column names in the dataset:
parquet_file.schema_arrow.names
['ID_PARCEL',
'SURF_PARC',
'CODE_CULTU',
'CODE_GROUP',
'CULTURE_D1',
'CULTURE_D2',
'EC_org_n',
'EC_trans_n',
'EC_hcat_n',
'EC_hcat_c',
'geometry']
This Parquet file includes 9.5 million rows:
parquet_file.metadata.num_rows
9517874
And 146 row groups. Given that each row group is about the same number of rows, each one contains around 65,000 rows.
parquet_file.num_row_groups
146
Then to load one of the row groups by numeric index, we can call ParquetFile.read_row_group
.
= parquet_file.read_row_group(0) pyarrow_table
Note that this returns a pyarrow.Table
, not a geopandas.GeoDataFrame
. To convert between the two, we can use _arrow_to_geopandas
. This conversion is very fast.
= _arrow_to_geopandas(pyarrow_table, parquet_file.metadata.metadata) geopandas_gdf
As expected, this row group contains right around 65,000 rows
geopandas_gdf.shape
(65536, 11)
geopandas_gdf.head()
ID_PARCEL | SURF_PARC | CODE_CULTU | CODE_GROUP | CULTURE_D1 | CULTURE_D2 | EC_org_n | EC_trans_n | EC_hcat_n | EC_hcat_c | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 123563 | 6.38 | CZH | 5 | None | None | Colza d’hiver | Winter rapeseed | winter_rapeseed_rape | 3301060401 | MULTIPOLYGON (((3.33896 49.84122, 3.33948 49.8... |
1 | 5527076 | 2.30 | PPH | 18 | None | None | Prairie permanente - herbe prédominante (resso... | Permanent pasture - predominantly grass (woody... | pasture_meadow_grassland_grass | 3302000000 | MULTIPOLYGON (((-1.44483 49.61280, -1.44467 49... |
2 | 11479241 | 6.33 | PPH | 18 | None | None | Prairie permanente - herbe prédominante (resso... | Permanent pasture - predominantly grass (woody... | pasture_meadow_grassland_grass | 3302000000 | MULTIPOLYGON (((2.87821 46.53674, 2.87820 46.5... |
3 | 12928442 | 5.10 | PPH | 18 | None | None | Prairie permanente - herbe prédominante (resso... | Permanent pasture - predominantly grass (woody... | pasture_meadow_grassland_grass | 3302000000 | MULTIPOLYGON (((-0.19026 48.28723, -0.19025 48... |
4 | 318389 | 0.92 | PPH | 18 | None | None | Prairie permanente - herbe prédominante (resso... | Permanent pasture - predominantly grass (woody... | pasture_meadow_grassland_grass | 3302000000 | MULTIPOLYGON (((5.72084 44.03576, 5.72081 44.0... |
As before, we can speed up the data fetching by requesting only specific columns in the read_row_group
call.:
= parquet_file.read_row_group(0, columns=["ID_PARCEL", "geometry"]) pyarrow_table
Then the resulting GeoDataFrame
will only have those two columns:
_arrow_to_geopandas(pyarrow_table, parquet_file.metadata.metadata).head()
ID_PARCEL | geometry | |
---|---|---|
0 | 123563 | MULTIPOLYGON (((3.33896 49.84122, 3.33948 49.8... |
1 | 5527076 | MULTIPOLYGON (((-1.44483 49.61280, -1.44467 49... |
2 | 11479241 | MULTIPOLYGON (((2.87821 46.53674, 2.87820 46.5... |
3 | 12928442 | MULTIPOLYGON (((-0.19026 48.28723, -0.19025 48... |
4 | 318389 | MULTIPOLYGON (((5.72084 44.03576, 5.72081 44.0... |