Scaling geospatial vector data

Joris Van den Bossche, Julia Signell, Martin Fleischmann

Audience level:


The Python ecosystem provides a nice set of tools for working with geospatial vector data, including Shapely and GeoPandas. Efforts are popping up to improve the performance and scalability of those workflows, such as dask-geopandas and spatialpandas. This workshop will give an overview of work by the community, and foster discussion on improvements and interoperability between the libraries.


The geospatial Python ecosystem provides a nice set of tools for working with vector data, including Shapely for geometry operations and GeoPandas to work with tabular data (and many other packages for IO, visualization, domain specific processing, …). One of the limitations of those core tools is a sub-optimal performance and limited scaling possibilities.

Over the last years, effort has been put in improving the performance through vectorized interfaces to GEOS, the underlying C library of Shapely. In turn, that enables releasing the GIL and makes the Dask - GeoPandas combination more interesting. GeoPandas is an extension to the pandas DataFrame, and thus how Dask scales pandas can be applied on GeoPandas as well. Initial effort to build a bridge between Dask and GeoPandas is currently taking the shape of the dask-geopandas library.

Also other interesting efforts in this space are popping up. The SpatialPandas package provides alternative pandas and Dask extensions for vectorized spatial and geometric operations. Libraries such as datashader and pydeckgl can be used to visualize larger spatial datasets.

This workshop will give a brief overview of some of the packages and ongoing efforts, and provide a place to discuss further improvements and interoperability between the libraries, with an emphasis on the conceptual design of distributed computation on inherently unpredictable vector data.

More detailed agenda: