Dask contains many functions for data IO for arrays and dataframes. In this workshop, we will discuss the current status of various data format integrations for Dask and more generally about the parallel/cloud-friendly data storage landscape.
Detailed agenda
Tabular data
Array data
Catalogs
Topics
Parquet: the IO format of choice for most performance-focused Dask-DataFrame users. Dask exposes a large variety of write and read options for Parquet-based datasets via the PyArrow and fastparquet engines. Although the available API is undoubtedly rich, the most appropriate features and performance considerations are rarely obvious.
Intake: a data cataloging and dissemination tool with specific dask-integrated functionality and many data format drivers. Intake offers a framework to “wrap” libraries offering low-level data access in a standard API, and a natural place to build in Dask support, rather than submitting PRs to the Dask codebase.
Kartothek: specialising in cloud object storage and tabular data, this is a library build especially with cloud distributed and parallel processing in mind
Zarr: a simple array format designed for parallel access in the cloud, with self-describing schema and great compression, as well as support for the HDF/netCDF data model
tileDB: TileDB integrates particularly well with dask-array, to give multi-dimensional, chunkwise access to data stored in the cloud, as well as versioning and transactional updates.
Following talks, there will be extensive time for discussions, in particular around the changing APIs of the target IO engines such as pyarrow.