Time Zone: UTC

Dask-Yarn: Scaling Distributed Dask Applications on a Hadoop Cluster

Brad Miro

Audience level:
Novice

Description

Hadoop is an established framework for running distributed applications across large clusters of machine. In this talk, you’ll learn about what the Hadoop ecosystem has to offer Dask, how to use Dataproc on Google Cloud to set up a Hadoop cluster configured with Dask-Yarn, as well as how Dask compares to other commonly-used data processing tools within the Hadoop ecosystem such as Spark.

Abstract

This will be an informative talk focusing on the how and why of running Dask on a Hadoop cluster. It will be focused on two primary developer groups: those who have experience using Hadoop, and those who are looking to distribute Dask applications.

The talk will include the following structure:

An introduction to the Hadoop ecosystem, including its history and role in big data processing. An introduction to Dask-Yarn, including the features it carries over from the larger Dask ecosystem. The advantages of using Dask over familiar Hadoop tools such as Spark. How to configure a Hadoop cluster with Dask-Yarn using Dataproc on Google Cloud. Code examples of using Dask-Yarn in a Dask application.