from urllib.request import urlretrieve
import fsspec
import geopandas as gpd
from fsspec.implementations.http import HTTPFileSystemGeoParquet Example
This notebook will give an overview of how to read and write GeoParquet files with GeoPandas, putting an emphasis on cloud-native operations where possible.
The easiest way to read and write GeoParquet files is to use GeoPandas’ read_parquet and to_parquet functions.
Make sure to use the specific read_parquet and to_parquet functions. These will be much, much faster than using the usual read_file and to_file.
Environment
The packages needed for this notebook can be installed with conda or mamba. Using the environment.yml from this folder run:
conda create -f environment.ymlor
mamba create -f environment.ymlThis notebook has been tested to work with the listed Conda environment. If you don’t want to use Conda or Mamba, install the latest versions of geopandas, fsspec, and pyarrow with pip. Note that you’ll also need the GDAL CLI with Parquet driver. If you’re on MacOS, you can install that via brew install gdal.
Imports
Comparison with FlatGeobuf
In order to compare reading GeoParquet with FlatGeobuf, we’ll cover reading and writing GeoParquet files on local disk storage. To be consistent with the FlatGeobuf example, we’ll fetch the same US counties FlatGeobuf file (13 MB) and convert it to GeoParquet using ogr2ogr.
# URL to download
url = "https://flatgeobuf.org/test/data/UScounties.fgb"
# Download, saving to the current directory
local_fgb_path, _ = urlretrieve(url, "countries.fgb")!ogr2ogr countries.parquet countries.fgbLoading this GeoParquet file is really fast! 13% faster than loading the same data via FlatGeobuf (shown in the FlatGeobuf example notebook).
%time gdf = gpd.read_parquet("countries.parquet")CPU times: user 23.8 ms, sys: 11.8 ms, total: 35.6 ms
Wall time: 34.1 ms
gdf| STATE_FIPS | COUNTY_FIP | FIPS | STATE | NAME | LSAD | geometry | |
|---|---|---|---|---|---|---|---|
| 0 | 23 | 009 | 23009 | ME | Hancock | County | MULTIPOLYGON (((-68.53108 44.33278, -68.53348 ... |
| 1 | 33 | 007 | 33007 | NH | Coos | County | MULTIPOLYGON (((-71.05975 45.01485, -71.06939 ... |
| 2 | 50 | 009 | 50009 | VT | Essex | County | MULTIPOLYGON (((-71.49463 44.90874, -71.49392 ... |
| 3 | 50 | 019 | 50019 | VT | Orleans | County | MULTIPOLYGON (((-72.14193 45.00600, -72.16051 ... |
| 4 | 23 | 007 | 23007 | ME | Franklin | County | MULTIPOLYGON (((-70.83471 45.27514, -70.77984 ... |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 3216 | 15 | 003 | 15003 | HI | Honolulu | County | MULTIPOLYGON (((-171.73761 25.79210, -171.7513... |
| 3217 | 15 | 007 | 15007 | HI | Kauai | County | MULTIPOLYGON (((-160.55535 21.66345, -160.5541... |
| 3218 | 15 | 009 | 15009 | HI | Maui | County | MULTIPOLYGON (((-157.06121 20.89150, -157.0611... |
| 3219 | 15 | 001 | 15001 | HI | Hawaii | County | MULTIPOLYGON (((-155.08767 19.72887, -155.0909... |
| 3220 | 15 | 005 | 15005 | HI | Kalawao | County | MULTIPOLYGON (((-157.01455 21.18550, -157.0145... |
3221 rows × 7 columns
Writing to local disk
We can use GeoDataFrame.to_parquet to write out this data to GeoParquet files locally. This is about 3x faster than writing the same dataset to FlatGeobuf, but note that FlatGeobuf’s writing is also calculating a spatial index.
%time gdf.to_parquet("countries_written.parquet")CPU times: user 42.3 ms, sys: 12.6 ms, total: 55 ms
Wall time: 53.9 ms
Reading from the cloud
As of GeoParquet version 1.0.0-rc.1, spatial indexing has not yet been implemented. Therefore, there is not yet an API in GeoPandas to read data given a specific bounding box.
What is already efficient in GeoParquet is reading only specified columns from a dataset.
url = "https://data.source.coop/cholmes/eurocrops/unprojected/geoparquet/FR_2018_EC21.parquet"Note that since we’re fetching this data directly from the cloud, we need to pass in an fsspec filesystem object. Otherwise GeoPandas will attempt to load a local file.
filesystem = HTTPFileSystem()By default, calling read_parquet will fetch the entire file and parse it all into a single GeoDataFrame. Since this is a 3GB file, downloading the file takes a long time:
# This cell will take a few minutes to run, because it downloads the entire file
# %time gdf = gpd.read_parquet(url, filesystem=filesystem)We can make this faster by only fetching specific columns. Because GeoParquet stores data in a columnar fashion, when selecting only specific columns we can download a lot less data.
# This cell will take a few minutes to run, because it downloads the entire file for these columns
# %time gdf = gpd.read_parquet(url, columns=["ID_PARCEL", "geometry"], filesystem=filesystem)Working with GeoParquet row groups (Advanced)
As described in the intro document, GeoParquet is a chunked format, which allows you to access one of the chunks of rows very efficiently. This can allow you to stream a dataset — loading and operating on one chunk at a time — if the dataset is larger than your memory.
GeoPandas does not yet have built-in support for working with row groups, so this section will use the underlying pyarrow library directly.
import pyarrow.parquet as pq
from geopandas.io.arrow import _arrow_to_geopandasFirst, we’ll create a ParquetFile object from the remote URL. All this does is load the metadata from the file, allowing you to inspect the schema and number of columns, rows, and row groups. Because this doesn’t load any actual data, it’s nearly instant to complete.
parquet_file = pq.ParquetFile(url, filesystem=filesystem)We can access the column names in the dataset:
parquet_file.schema_arrow.names['ID_PARCEL',
'SURF_PARC',
'CODE_CULTU',
'CODE_GROUP',
'CULTURE_D1',
'CULTURE_D2',
'EC_org_n',
'EC_trans_n',
'EC_hcat_n',
'EC_hcat_c',
'geometry']
This Parquet file includes 9.5 million rows:
parquet_file.metadata.num_rows9517874
And 146 row groups. Given that each row group is about the same number of rows, each one contains around 65,000 rows.
parquet_file.num_row_groups146
Then to load one of the row groups by numeric index, we can call ParquetFile.read_row_group.
pyarrow_table = parquet_file.read_row_group(0)Note that this returns a pyarrow.Table, not a geopandas.GeoDataFrame. To convert between the two, we can use _arrow_to_geopandas. This conversion is very fast.
geopandas_gdf = _arrow_to_geopandas(pyarrow_table, parquet_file.metadata.metadata)As expected, this row group contains right around 65,000 rows
geopandas_gdf.shape(65536, 11)
geopandas_gdf.head()| ID_PARCEL | SURF_PARC | CODE_CULTU | CODE_GROUP | CULTURE_D1 | CULTURE_D2 | EC_org_n | EC_trans_n | EC_hcat_n | EC_hcat_c | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 123563 | 6.38 | CZH | 5 | None | None | Colza d’hiver | Winter rapeseed | winter_rapeseed_rape | 3301060401 | MULTIPOLYGON (((3.33896 49.84122, 3.33948 49.8... |
| 1 | 5527076 | 2.30 | PPH | 18 | None | None | Prairie permanente - herbe prédominante (resso... | Permanent pasture - predominantly grass (woody... | pasture_meadow_grassland_grass | 3302000000 | MULTIPOLYGON (((-1.44483 49.61280, -1.44467 49... |
| 2 | 11479241 | 6.33 | PPH | 18 | None | None | Prairie permanente - herbe prédominante (resso... | Permanent pasture - predominantly grass (woody... | pasture_meadow_grassland_grass | 3302000000 | MULTIPOLYGON (((2.87821 46.53674, 2.87820 46.5... |
| 3 | 12928442 | 5.10 | PPH | 18 | None | None | Prairie permanente - herbe prédominante (resso... | Permanent pasture - predominantly grass (woody... | pasture_meadow_grassland_grass | 3302000000 | MULTIPOLYGON (((-0.19026 48.28723, -0.19025 48... |
| 4 | 318389 | 0.92 | PPH | 18 | None | None | Prairie permanente - herbe prédominante (resso... | Permanent pasture - predominantly grass (woody... | pasture_meadow_grassland_grass | 3302000000 | MULTIPOLYGON (((5.72084 44.03576, 5.72081 44.0... |
As before, we can speed up the data fetching by requesting only specific columns in the read_row_group call.:
pyarrow_table = parquet_file.read_row_group(0, columns=["ID_PARCEL", "geometry"])Then the resulting GeoDataFrame will only have those two columns:
_arrow_to_geopandas(pyarrow_table, parquet_file.metadata.metadata).head()| ID_PARCEL | geometry | |
|---|---|---|
| 0 | 123563 | MULTIPOLYGON (((3.33896 49.84122, 3.33948 49.8... |
| 1 | 5527076 | MULTIPOLYGON (((-1.44483 49.61280, -1.44467 49... |
| 2 | 11479241 | MULTIPOLYGON (((2.87821 46.53674, 2.87820 46.5... |
| 3 | 12928442 | MULTIPOLYGON (((-0.19026 48.28723, -0.19025 48... |
| 4 | 318389 | MULTIPOLYGON (((5.72084 44.03576, 5.72081 44.0... |