FlatGeobuf Spatial Index

Breaking down the Hilbert Packed R-Tree

FlatGeobuf optionally supports including a spatial index that enables random access for each geometry in the file.

When to use

When writing a FlatGeobuf file, one must decide whether to include a spatial index. A spatial index cannot be added to a FlatGeobuf file after the file has been written.

A spatial index can enable much more efficient reading from FlatGeobuf, by allowing the reader to skip over portions of the file that fall outside of a qiven spatial query region.

Technical details

In this section we’ll get into some gory technical details of how FlatGeobuf’s spatial index works. Understanding the below isn’t necessary for using FlatGeobuf, but it may add context for understanding how to create and work with FlatGeobuf files, and why FlatGeobuf is performant.

FlatGeobuf’s spatial index is a static packed Hilbert R-tree index. That’s a mouthful, so let’s break it down:

R-tree index

An R-Tree is a hierarchical collection of bounding boxes. At the lowest level of the tree is a bounding box of every geometry. Then one level above the lowest level exists a collection of bounding boxes, each of which is formed as the union of all child boxes. This means that each box encompasses every child box. There are fewer boxes at this level, because each box contains many child boxes, each of which represents one original geometry. This process continues repeatedly until there’s only one bounding box that indirectly contains the entire dataset.

This index allows you to quickly search for features that intersect a given bounding box query. At the top level, compare the bounding box of each node to your query region. If those two don’t intersect, you can discard that node and all of its child nodes from the search, because you know that none of them could possibly fall within your search region.

Continuing this process allows you to quickly find only the specific items that are candidates for your search query.

R-Tree diagram from Wikipedia. From top to bottom, the three levels of this tree are the black, blue, and red boxes. The black boxes contain the most items and encompass the largest area, while the red boxes contain the fewest items and encompass a smaller area.

The Wikipedia article and this Mapbox blog post are great resources for better understanding how R-Trees work.

Hilbert

The elements of an R-Tree must be sorted before insertion to make the R-Tree useful. This is because the core benefit of an R-Tree is to exclude elements that aren’t within a spatial filter. If elements of each node are randomly drawn from different geographies, then each node’s bounding box will be so large that no nodes can be excluded.

But how do you sort geometries? They encompass two dimensions and a range of shapes. If you sort all geometries first on the x coordinate, then you may pair geometries that are far from each other on the y dimension. Instead, it’s ideal to use a space-filling curve. That’s math jargon, but essentially defines a way to sort elements in n dimensions using 1 dimensional numbers.

A Hilbert R-Tree uses a Hilbert Curve, a special type of space-filling curve, to sort the centers of geometries. This ensures that geometries that are nearby on both the x and y dimensions are placed close to each other in the R-Tree. This ensures that the resulting bounding boxes of the R-Tree are as small as possible, which means that the maximum number of elements can be discarded for any given spatial query.

This Crunchy Data blog post has helpful examples for why sorting input is important.

Static

FlatGeobuf files can’t be modified without rewriting the entire file, so this R-Tree is constructed in such a way that it can’t be modified, which allows for improved tree generation.

Packed

An R-Tree has a series of nodes at each level, where each node can contain up to n children. If the R-Tree might be updated, not every node will have a total of n children, because some space needs to be reserved for future elements.

Because the index is static and immutable, we can construct a packed index, where every node is completely full. This achieves better space utilization, and is more efficient for queries because there are fewer nodes to traverse.