Processing a Petabyte in the Cloud

James Bourbeau

Audience level:
Novice

Description

Dask is a popular library for scalable computing. However, using Dask effectively for large workloads, in particular when running computations in the cloud, involves additional nuances and attention. In this talk, we'll walk through our attempts to scale up data engineering workloads to process a petabyte of data in the cloud, describe the pain points we encountered, and how we got around them.

Abstract

Dask is a popular library for scalable computing. However, using Dask effectively for large workloads, in particular when running computations in the cloud, involves additional nuances and attention. In this talk, we'll walk through our attempts to scale up data engineering workloads to process a petabyte of data in the cloud, describe the pain points we encountered, and how we got around them.