This talk presents a blueprint for bringing Dask workloads to HPC grids. We implement an architecture which allows dynamic sharing of compute resources within and between multi-tenant environments where Dask clusters are defined in secure, flexible, and repeatable templates.
Our presentation discusses the challenges associated with running Dask batch and interactive workloads on a multi-tenant HPC cluster. This includes enforcing authentication and authorization of Dask clients and secure communications across all components of Dask applications, as well as enabling dynamic sharing of CPU and GPU resources and safely reclaiming these resources between workloads while a Dask application is running. Additionally, we address the challenges which result from the necessity to synchronize the configuration and deployment of Dask clusters across the grid, ensuring the same package versions are used on all hosts for a given application.
We examine how the proposed architecture and implementation addresses these challenges by introducing a new Dask workload manager component, integrated though authenticated secure communications with Dask clients and schedulers, while using an underlying resource manager to launch and reclaim CPU or GPU Dask workers.
Through a live demonstration, we show how the encapsulation of the Dask workload manager, together with tightly integrated Jupyter servers in a group of distributed long-running grid services, can be configured, deployed, and managed together, to allow for simplified definition, onboarding, and administration of a dynamic multi-tenant Dask environment. This walk-through also illustrates how data scientists can work interactively with Dask on an HPC grid.