As workloads scale, the overhead of processing tasks itself becomes a bottleneck. Improved scalability requires not only a faster scheduler, but a coordinated effort across the entire Dask ecosystem. High performance computing is not about doing one thing well; it’s about doing nothing poorly. In this workshop, we’ll cover an ongoing multi-institutional effort to accelerate Dask scheduling.
This workshop will cover the recent effort to improve the performance of Dask’s distributed scheduler. As workloads scale to more data and more workers, performance can degrade as a result of the significant strain on the scheduler. So much so that unlocking more performance for those workloads does not require faster computation, but rather faster coordination of the work.
For example, an extremely common ETL workflow may involve setting a new index
df.set_index(‘id’). This simply expressed statement will trigger a large graph to be constructed -- Dask will need to coordinate an all-to-all exchange of distributed data. As the dataframe increases in size, the number of partitions in the graph will also increase. For the scheduling to scale without loss of performance we need to consider several domains within Dask’s internals:
Where is the graph is generated
How is the graph communicated between the client and the scheduler
How is the graph processed within the scheduler
How does the scheduler communicates those tasks to the workers
None of these steps can be done poorly. That is, each of these items, if done poorly, can and will degrade performance and increase the amount of task scheduling overhead.
During the workshop we’ll discuss scheduler internals, motivating problems where scaling is a problem, and how the Dask community is moving forward to improve performance. We’ll also share several of the profiling techniques we used to measure our progress along the way