19 May 14:00 – 19 May 17:00 in Tutorials / Workshops 1

Hacking Dask: Diving Into Dask’s Internals

Julia Signell, James Bourbeau

Audience level:
Intermediate

Description

This tutorial is intended for working and aspiring data professionals. A working knowledge of the basics of Dask and/or distributed computing is required, though knowledge of Dask's internals are not. Tutorial attendees should walk away with a deeper understanding of Dask’s internals, an introduction to more advanced features, and ideas of how they can apply these features effectively to their own

Abstract

Introduction [10 minutes] - Provide an overview of what will be covered in the tutorial. - Ensure everyone is set up with tutorial materials.

An overview of Dask [30 minutes] - Review Dask’s delayed, array, and DataFrame interfaces. - Recap the components of Dask’s distributed scheduler. - Participants will recap the basics of how to use the high-level Dask interfaces to accomplish well supported tasks and parallelize custom algorithms.

10 minute break

Advanced Dask Collections [50 minutes] - Discuss applying custom operations on Dask arrays and DataFrames. - Discuss Dask’s graph optimization system. - Review the Dask collection interface and implement our own custom collection. - Participants will gain a deeper insight into the task graph system underlying Dask collections.

10 minute break

Hacking the distributed scheduler [50 minutes] - Highlight built in coordination primitives like Locks, Events, and Semaphores. - Demonstrate how to customize worker and scheduler behavior using Dask’s plugin system. - Learn how to inspect the internal state of a cluster’s scheduler and workers. - Participants will gain a better understanding of Dask internals and how to troubleshoot common pitfalls of distributed computing.

Conclusion [10 minutes] - Recap what we learned. - Provide references to links to additional community resources.