20 May 18:30 – 20 May 20:00 in Tutorials / Workshops 2

Bringing Dask Workloads to GPUs with RAPIDS

Benjamin Zaitlen, Nick Becker

Audience level:
Intermediate

Description

Data volumes and computational complexity of analysis techniques have increased, but the need to quickly explore data and develop models is more important than ever. One of the key ways to achieve this has been through GPU acceleration. This workshop introduces RAPIDS, and illustrates how to use Dask and RAPIDS to accelerate ETL/ML workloads, increasing performance and decreasing total cost.

Abstract

This tutorial will introduce RAPIDS then walk through several end-to-end workflows, demonstrating how RAPIDS is used to accelerate distributed ETL and ML use-cases. These examples will include common ETL patterns: read, groubpy-aggregrate, merge, etc. Attendees will learn how to mix RAPIDS with other GPU accelerated libraries such as XGBoost. Additionally, participants will see how GPUs can accelerate performance for numeric and NLP-based workloads.

It’s common knowledge that GPUs are fast. But, traditionally, GPU accelerating analytics workloads has required specialized knowledge of low-level C++ and CUDA programming. The open-source RAPIDS data science libraries allow data scientists to easily make use of GPU acceleration in common ETL, machine learning, and graph analytics workloads using familiar Python APIs (e.g. pandas and scikit-learn), natively scaled with Dask.

This means we can take existing dask code:

import dask.dataframe as dd
df1 = dd.read_csv(“customer.csv”)
df2 = dd.read_csv(“online_sales.csv”)
df1.merge(df2, on=[“id”]).groupby(“customer_state”).sale_price.mean()

And immediately start increasing performance with one or hundreds of GPUs:

import dask_cudf
df1 = dask_cudf.read_csv(“customer.csv”)
df2 = dask_cudf.read_csv(“online_sales.csv”)
df1.merge(df2, on=[“id”]).groupby(“customer_state”).sale_price.mean()

Data volumes and computational complexity of data analysis techniques have increased, but the need to quickly explore data and develop models is more important than ever. Dask and RAPIDS provide the easiest path forward by leveraging common PyData APIs.