20 May 19:00 – 20 May 19:30 in Talks

Dask-on-Ray: Using Dask for Large-scale Data Processing on Ray

Clark Zinzow

Audience level:
Novice

Description

Ray is a distributed task execution system that provides a simple API for building distributed applications, and has a large ecosystem of libraries for training and serving machine learning models. As part of a recent effort to expand support for Ray-based data processing and data analytics, Dask-on-Ray was developed to allow users to run Dask workloads on Ray.

Abstract

Ray is a high-performance task execution engine whose ecosystem has historically targeted reinforcement learning and the training and serving of machine learning models. Recently, the Ray community has focused on expanding the Ray-based data processing and data analytics story, including libraries/integrations such as Modin, Mars-on-Ray, and Spark-on-Ray, as well as a lot of work on improving core Ray's performance and reliability when processing data-intensive workloads. In addition to the data processing/analytics space being valuable to support in its own right, coupling it with Ray-based model training and model serving allows for an entire machine learning pipeline to be run on top of Ray. That shared infrastructure is a big win for cluster operators that are used to stitching together disparate systems within a single pipeline, and some of Ray's system design choices, such as the distributed in-memory object store and fine-grained resource scheduling, also makes this common Ray substrate a big performance and resource utilization win.

Dask-on-Ray is an integration that allows users to run Dask workloads on Ray, providing a very user-friendly option for Ray-based data processing using familiar NumPy and DataFrame APIs, while also leveraging the Dask frontend's automatic data partitioning and task graph optimizations. This integration is achieved by implementing a Dask scheduler that farms Dask tasks out to Ray, executed as Ray tasks. With this thin scheduler layer, the entire Dask ecosystem can now be run on a Ray cluster.

This talk introduces Ray and its ecosystem, describes the Dask-on-Ray integration and a few use cases and examples, and dives deep into some Ray internals that help support large-scale data processing.