Time Zone: UTC

19 May 14:00 – 19 May 16:00 in Tutorials / Workshops 2

Dask SQL Query Engines

Nils Braun, Han Wang, Mike Klaczynski, Miles Adkins, Tom Drabas

Audience level:
Intermediate

Description

In this workshop, we will discuss the different ways to run SQL queries on and with Dask using CPUs and GPUs. Being able to write SQL commands to query and transform the data does allow users to integrate the vast Dask and RAPIDS ecosystem into their BI workflows. We will discuss the current state of PyData SQL query engines, SQL integrations and together find out where to head next.

Abstract

Data is the new gold of this century and being able to digest and analyze the growing amount of data is key. The Dask and RAPIDS ecosystem play a huge role in enabling this and their Python APIs allow to build up complex distributed pipelines. However, not all users are able to write elaborate and optimized distributed pipelines, and many legacy applications can only connect to traditional SQL databases. SQL query engines combine the best of both worlds: they enable querying and transforming huge amounts of data efficiently and in a distributed way using standard SQL statements, all without databases.

The Dask ecosystem brings at least two Python-based SQL query engines: dask-sql for distributed computations on CPUs and BlazingSQL for GPU processing. Fugue SQL, on the other hand, is an abstraction layer utilizing these SQL engines. It changes the way of using SQL, from ‘commands’ to an end-to-end workflow language. Dask also comes with a large variety of integrations into SQL databases, Snowflake Data Cloud being one of them.

In this workshop, we will talk about the landscape of SQL query engines and integrations that are available to Dask users in detail, explore different projects and applications, get some hands-on experience and discuss important features that are still missing.

Outline

14:00-14:10: BlazingSQL BlazingSQL is the SQL engine for the RAPIDS ecosystem and allows you to manipulate data using ANSI SQL, all of this in seconds rather than hours. BlazingSQL can query data in almost any format: cuDF DataFrames (or dask_cudf DataFrames) stored in the GPU memory, or Apache Parquet, CSV/TSV, JSON, and Apache ORC files, stored, both, locally and remotely.

14:10-14:20: dask-sql After quickly introducing dask-sql and explaining its differences to the other frameworks, we will go a bit deeper into the tech stack behind both BlazingSQL and dask-sql and explore, how the SQL query is turned into distributed computations on a cluster of GPUs (BlazingSQL) and CPUs (dask-sql).

14:20-14:30: Fugue SQL This talk will introduce Fugue SQL. With additional syntax, it can invoke python code in SQL as extensions so you gain simplicity and elegance without losing flexibility or power. In addition, Fugue provides a notebook extension enabling SQL highlights and an interactive distributed computing experience.

Short Break

14:35-15:25: Demo In this demo we will show how BlazingSQL, dask-sql and Fugue can be used, where they differ and where they are similar. We will also use this time to answer any questions on the three frameworks and discuss where to go next as a community.

Break

15:35-16:00 Learn how to bring the power of Dask to the Snowflake Data Cloud. We’ll show you how you can connect Dask to Snowflake and go over upcoming feature improvements.