Presentation: Scaling Pandas using Dask: How to avoid all my mistakes

Time Zone: UTC

20 May 12:00 – 20 May 12:30 in Talks

Scaling Pandas using Dask: How to avoid all my mistakes

Krishan Bhasin

Audience level:: Novice

Description

Dask is a Python package that provides advanced parallelism for analytics, enabling performance at scale for the tools you love. People think it’s magic - drop it in and it scales. This will mostly work, but it will not scale well!

We would like to share what we’ve learned about using Dask to scale dataframe and computations, to avoid you making the same mistakes

Abstract

In particular, this talk will cover:

how to explore your code's current performance
how to find performance bottlenecks
how configuring Dask can help improve your performance
contributing fixes/improvements back to Dask when you find something missing or incorrect