This talk illustrates recent and ongoing work that rethinks how Dask.distributed manages memory across the cluster. Data should be automatically and transparently moved around workers to optimize memory occupancy, prevent workers from hanging, and increase robustness all around.
Dask.distributed today only moves data between workers when it's needed by tasks, with no regard of worker memory pressure; this can easily cause large clumps of data to accumulate on individual workers and saturate their memory. There is functionality to rebalance the memory but it is not designed to be run frequently. There is also little consideration for memory usage not caused by dask, e.g. due to memory leaks or fragmentation. This talk illustrates an ongoing work to overhaul this situation and introduce a centralized and fully automated system that shuffles data around the workers to reduce pressure on the individual nodes, restart nodes gracefully in case of memory creep, actively track replicated keys to increase the resilience of computations to accidental worker death, and safely downsize adaptive clusters when the cluster is not completely idle.