Join the Dask Community Slack Workspace
During the event, the Dask Slack will be a forum for announcements, discussions, and sharing of resources. Following the event, the Dask Slack will remain open to the community as a forum for continued communication with the Dask developers and fellow users.
Following each keynote plenary session,
join the speaker in for a fireside chat.
Gather is a video chat platform where you control your avatar on a virtual map. As you get close to other avatars, your videos will pop up and you will be able to chat.
We demonstrate a practical case study: loading and interacting with hundreds of Sentinel satellite images and the results of their analysis in near real-time. The project was part of research on interactive visualization of out-of-core images, in the context of the Monash VegMap land cover study.
Zoom: https://zoom.us/j/95688944722?pwd=eUNVMlRVTnZNVzk2N0F3NjBPREkwQT09
Video Recording: https://zoom.us/rec/share/p2Wdhl7COnIBJh3qaBfKtvb7h8XCnQuZHlenYe8YCAFeLNy_dQrIUMC550cC_Etr._DVaRX39-Yiu4VXM?startTime=1621386157000
This talk covers our DevOps journey with Dask at Geoscience Australia, the architecture of our processing cluster, as well as some of the very expensive lessons we learnt along the way.
Zoom: https://zoom.us/j/95688944722?pwd=eUNVMlRVTnZNVzk2N0F3NjBPREkwQT09
Video Recording: https://zoom.us/rec/share/p2Wdhl7COnIBJh3qaBfKtvb7h8XCnQuZHlenYe8YCAFeLNy_dQrIUMC550cC_Etr._DVaRX39-Yiu4VXM?startTime=1621387868000
We describe an approach for efficiently embedding non-Dask algorithms into Dask processing pipeline. By constructing large contiguous memory array incrementally from a Dask graph we were able to achieve significant peak memory reductions. We have used this approach to generate cloud-free Sentinel-2 Geometric Median and Median Absolute Deviations mosaics over Africa (10m res).
Zoom: https://zoom.us/j/95688944722?pwd=eUNVMlRVTnZNVzk2N0F3NjBPREkwQT09
Video Recording: https://zoom.us/rec/share/p2Wdhl7COnIBJh3qaBfKtvb7h8XCnQuZHlenYe8YCAFeLNy_dQrIUMC550cC_Etr._DVaRX39-Yiu4VXM?startTime=1621389492000
We hope this panel discussion will start a conversation about using Dask in Australia, how we build our community, contribute and stay in touch with the rest of the world.
Zoom: https://zoom.us/j/95688944722?pwd=eUNVMlRVTnZNVzk2N0F3NjBPREkwQT09
Video Recording: https://zoom.us/rec/share/p2Wdhl7COnIBJh3qaBfKtvb7h8XCnQuZHlenYe8YCAFeLNy_dQrIUMC550cC_Etr._DVaRX39-Yiu4VXM?startTime=1621395255000
Dask down under is a chance for everyone in Oceania to forge links and build community here in our backyard. Dask down under we feature talks, tutorials and panel discussions on using Dask to accelerate research. All levels from beginner to expert are encouraged to attend.
Zoom: https://zoom.us/j/95688944722?pwd=eUNVMlRVTnZNVzk2N0F3NjBPREkwQT09
Video Recording: https://zoom.us/rec/share/3AnbFtOiRIARD3A6MdM1F0PrMvpASQxJQNWbt6pppYpRrx33EFGQYy-wLQWVQZ-H.bl2gwN915ju9aidF?startTime=1621402478000
The past decade has shown there is a steep learning curve for organizations trying to scale and productionalize ML systems quickly. At Walmart, we have developed several principles over the years that allow us to address this challenge. In this talk, I will discuss these principles and the open-source tools that enable us today.
Zoom: https://zoom.us/j/98736677197?pwd=ZkFmY3pORm1YbFBPaWdmR3pNYUVHZz09
Video Recording: https://zoom.us/rec/share/PfWZiEfZCpFLsdIyhSj5KiDNn4N40e8If8dEBgYOKlohtTH0RvIrAKPa1Qy7-KE.iF9wnXU-mOUL8r7z?startTime=1621429232000
This tutorial is intended for working and aspiring data professionals. A working knowledge of the basics of Dask and/or distributed computing is required, though knowledge of Dask’s internals are not. Tutorial attendees should walk away with a deeper understanding of Dask’s internals, an introduction to more advanced features, and ideas of how they can apply these features effectively to their own.
Zoom: https://zoom.us/j/96686816162?pwd=cmhzMXQ2TDhsWDFrdjlVMjB6VE11QT09
Video Recording: https://zoom.us/rec/share/WDj9pYZ_LQKNfM1gQNEfSbb5wY60UFJE1ffIGLXidzmqGR5z3QIJMo4ieem3HBO3.IPJxdRQDU_9GmQuZ?startTime=1621432838000
In this workshop, we will discuss the different ways to run SQL queries on and with Dask using CPUs and GPUs. Being able to write SQL commands to query and transform the data does allow users to integrate the vast Dask and RAPIDS ecosystem into their BI workflows. We will discuss the current state of PyData SQL query engines, SQL integrations and together find out where to head next.
Zoom: https://zoom.us/j/94693106912?pwd=NlpFbVBISjlIRytpL3YySDZGSUdjZz09
Video Recording: https://zoom.us/rec/share/nNBf5l_BihmkUgMzO4xJ8up6PN4fpzmWhWs3OzBJXvEuAVMIcqYRZAHC7igQEZz9.bNSav9g8_S2j0N5I?startTime=1621432869000
We share challenges encountered deploying Dask with JupyterHub for HPC for the needs of high-energy physics community at University of Nebraska-Lincoln. We describe how we combine multiple ways of launching resources, allowing Dask to submit workers directly to a batch system and investigate improvements in network connectivity that allow scaling to large numbers of simultaneous facility users.
Zoom: https://zoom.us/j/96201769159?pwd=c0pTckQrT3lWU2lTNlNzRUs3U3Y0dz09
Video Recording: https://zoom.us/rec/share/g9S2QIeFd7gZ5bF7vqUZkxp6ciyzO6b9yf5hOT7PCgEmBO_fGWF-qm7ZYd5zPB09.txV07riikza9E7Tq?startTime=1621432905000
In this talk, we describe the lessons learned from running modern computer-vision tasks in an always available Dask cluster.
Zoom: https://zoom.us/j/96201769159?pwd=c0pTckQrT3lWU2lTNlNzRUs3U3Y0dz09
Video Recording: https://zoom.us/rec/share/g9S2QIeFd7gZ5bF7vqUZkxp6ciyzO6b9yf5hOT7PCgEmBO_fGWF-qm7ZYd5zPB09.txV07riikza9E7Tq?startTime=1621434754000
As part of the data science team, at Ipsen, we study effect of medical products on patients. Computing on Terabytes of data is a challenge, we will explain why we have opted for Dask along with Kubeflow for that. We will share issues we faced and how we fixed them. Come and spend some time with us and you will delve into a real Dask use case.
Zoom: https://zoom.us/j/96201769159?pwd=c0pTckQrT3lWU2lTNlNzRUs3U3Y0dz09
Video Recording: https://zoom.us/rec/share/g9S2QIeFd7gZ5bF7vqUZkxp6ciyzO6b9yf5hOT7PCgEmBO_fGWF-qm7ZYd5zPB09.txV07riikza9E7Tq?startTime=1621436384000
AI algorithms for financial portfolio management require distributed computation based on complex workflows. We are happy to share with you how we transform many financial data based on Dask to reach this goal. Dask distributed on Kubernetes is a simple but powerful solution for that. Furthermore we will present a thin wrapper on top of Dask futures to allow lazy and partial computations.
Zoom: https://zoom.us/j/96201769159?pwd=c0pTckQrT3lWU2lTNlNzRUs3U3Y0dz09
Video Recording: https://zoom.us/rec/share/g9S2QIeFd7gZ5bF7vqUZkxp6ciyzO6b9yf5hOT7PCgEmBO_fGWF-qm7ZYd5zPB09.txV07riikza9E7Tq?startTime=1621438215000
This workshop will cover the most common methods for deploying Dask today. Starting with an overview of all the moving pieces within a Dask cluster (client, cluster, scheduler, workers), we will then talk through various platforms and the tools used to deploy onto them along with benefits, common challenges, and pitfalls.
Zoom: https://zoom.us/j/95024741930?pwd=Z3RTSUZRTmVxY0dWNGxROFU5Z2xaZz09
Video Recording: https://zoom.us/rec/share/leSavBFcTI304aH3FighyDnhyUz9qda_-sQurRmUXAeciJi-qXYy49vvRsOJzHg.wsSFSp9pSq7lU70w?startTime=1621440209000
Memory spilling is an important feature that makes it possible to run Dask applications that would otherwise run out of memory. When low on memory, Dask moves data from GPU memory to main memory and/or data from main memory to disk automatically. In this talk, we will walk through how spilling works in general, its shortcomings, and introduce a new Dask-CUDA approach to overcome these shortcomings.
Zoom: https://zoom.us/j/96201769159?pwd=c0pTckQrT3lWU2lTNlNzRUs3U3Y0dz09
Video Recording: https://www.youtube.com/watch?v=mHWk7y2p-NM
When it takes hours or days to run your computation, it can take a long time before you notice something has gone wrong. This means your feedback loop for fixes can be very slow. Learn how logging, and in particular the causal tracing library Eliot, can help debug inconsistent calculations and spot input-specific performance problems in your Dask application.
Zoom: https://zoom.us/j/96201769159?pwd=c0pTckQrT3lWU2lTNlNzRUs3U3Y0dz09
Video Recording: https://zoom.us/rec/share/z60xsuVG0Dy6zWqeXlNK4SQg3rPqUpSskZdeE_Q1QV_1KASVR-D0Ns8NEon-qXes.eGa8sSblNYC_U_KD?startTime=1621443637000
A group of geospatial experts from the pangeo community share their experiences using Dask, xarray and other tools. They’ll share their best practices and the pain points they’ve run into.
Zoom: https://zoom.us/j/92413783445?pwd=NThUMk51Mm0rVUhGNFVFS0lMbGJ1QT09
Video Recording: https://zoom.us/rec/share/L0L2G6jimgHnx1dYERXuM4mtBthwPLRRRTjskAAw45b9h994fXuGdGlJ9x4KXnK-.TqZ87pGSZhXtx7Jv?startTime=1621447260000
In this talk, we will learn how Dask helps Bumblebee, an open-source, data wrangling web app, to provide the user with data insight and data transformation feedback in real-time using Dask sync and async task handling. Also, we will talk about our experience with Apache Spark, the shortcomings we found when creating Bumblebee, and how Dask helps to achieve the user experience we envision.
Zoom: https://zoom.us/j/93636527572?pwd=b0k4M0l4TllJZW1nd1dTekcwUXcyQT09
Video Recording: https://zoom.us/rec/share/U77aw65fF4JMj24LmJrE62FLAb_MqfiJF6hByeeROGgoovsqLhUGOg-plOFxH29_.3RgWOhxwsx9WTI6K?startTime=1621449036000
After a brief introduction to particle physics datasets, I will discuss how we adapted the scientific python ecosystem to efficiently and conveniently process these data, from ingestion, through manipulation with novel array programming techniques, to reduction using histograms. Then, I will describe how we currently scale our processing with dask and how we would like to improve our solution.
Zoom: https://zoom.us/j/93636527572?pwd=b0k4M0l4TllJZW1nd1dTekcwUXcyQT09
Video Recording: https://zoom.us/rec/share/U77aw65fF4JMj24LmJrE62FLAb_MqfiJF6hByeeROGgoovsqLhUGOg-plOFxH29_.3RgWOhxwsx9WTI6K?startTime=1621450694000
NVTabular is a recommender-system focused feature-engineering and preprocessing library for tabular data. This talk will describe how NVTabular was built entirely on Dask-Dataframe to both simplify and accelerate model-training pipelines. The primary goals are to (1) present a successful example of Dask integration and to (2) communicate important lessons learned.
Zoom: https://zoom.us/j/93636527572?pwd=b0k4M0l4TllJZW1nd1dTekcwUXcyQT09
Video Recording: https://zoom.us/rec/share/U77aw65fF4JMj24LmJrE62FLAb_MqfiJF6hByeeROGgoovsqLhUGOg-plOFxH29_.3RgWOhxwsx9WTI6K?startTime=1621452552000
Everything changed this year, including how we work together. In the last decade, the explosion of Python open-source tools has fundamentally changed science. Emerging cloud computing collaborative workspaces are opening up opportunities to realize a new vision of how science advances.
Zoom: https://zoom.us/j/97626591336?pwd=cW83b0c1S2lXdWNnT0VLWVNTMDVUQT09
Video Recording: https://zoom.us/rec/share/HSh9CXoGW4_LZzA9lp7d53D_F8otdO6VHycCLm_hBVv8QvRrz29dMtYuyFhO4o1S.Ml-80sFqEs63RvrH?startTime=1621454432000
Oríon is a framework for asynchronous hyperparameter optimization (HPO) built around two main principles: 1) HPO should be effortless to execute in common machine learning workflows 2) New HPO algorithms should be readily available for practitioners. In this talk we will present Oríon and its core design principles, followed by an integration example with Dask demonstrating its simplicity of use.
Zoom: https://zoom.us/j/93212239346?pwd=RWpCRmVFRDJ0MGZpeHZub2FtTGliZz09
Video Recording: https://zoom.us/rec/share/88DPZedHJaAUbhYyi1KbhX0Xp8l6CE4tvyRwzuI3xEY5gCqVm58zaeEH8iAPwLcC.uaAvEOfZVA3a4-qj?startTime=1621458039000
Metagraph is an experimental library designed to glue together a fragmented world of graph libraries. However, Metagraph extends Dask in ways that have broader potential. This talk will explore the components of Metagraph: a multiple dispatch system, a data translation system, and a plugin-based DAG compiler. These ideas will motivate a wishlist of possible enhancements to the Dask core.
Zoom: https://zoom.us/j/93212239346?pwd=RWpCRmVFRDJ0MGZpeHZub2FtTGliZz09
Video Recording: https://zoom.us/rec/share/88DPZedHJaAUbhYyi1KbhX0Xp8l6CE4tvyRwzuI3xEY5gCqVm58zaeEH8iAPwLcC.uaAvEOfZVA3a4-qj?startTime=1621459800000
Dask has transformed what is possible with Python and Data Science.
However, while Dask has solved many of the technical challenges of parallelism, there remain challenges for institutional adoption. How does this get deployed? Is it supported? Is it secure?
This talk describes Coiled, a company based around Dask, and how it strives to enable the use of Dask by everyone, everywhere.
Zoom: https://zoom.us/j/93212239346?pwd=RWpCRmVFRDJ0MGZpeHZub2FtTGliZz09
Video Recording: https://zoom.us/rec/share/88DPZedHJaAUbhYyi1KbhX0Xp8l6CE4tvyRwzuI3xEY5gCqVm58zaeEH8iAPwLcC.uaAvEOfZVA3a4-qj?startTime=1621461720000
An overview of experience with dask in HPC environment (SLURM) for an academic project.
Zoom: https://zoom.us/j/93212239346?pwd=RWpCRmVFRDJ0MGZpeHZub2FtTGliZz09
Video Recording: https://zoom.us/rec/share/88DPZedHJaAUbhYyi1KbhX0Xp8l6CE4tvyRwzuI3xEY5gCqVm58zaeEH8iAPwLcC.uaAvEOfZVA3a4-qj?startTime=1621463497000
Image processing tasks are pixel independent and are embarrassingly parallel, in our pipeline labelling continuous patches requires calculations across neighbourhoods of pixels and means parallelization is more complex. Buffered tiles solve parallelization by allowing non-parallel pixel operations within tiles to be run simultaneously across many tiles and so provide tile level parallelization
Zoom: https://zoom.us/j/97179558778?pwd=dDF1VmRta1plbC8vWi9Oc1o4QTAxdz09
Video Recording: https://zoom.us/rec/share/5TA3atKA9GaUEvI4P8iGRooXZd2WasnxM5tgqpcoCaSUNUsKdEgHhv9lFAHLPFeP.mXYHn0JQIlo4Wsis?startTime=1621472512000
Dask offers a level of simplicity that makes distributed computing accessible to a broad scientific community. Because of this simplicity users often overlook some key features of the various Dask schedulers. We will present an outline of the types of schedulers and their uses. We will then focus on AARNet’s SWAN environment, which provides a user with a single large node with 36 cores with 256 Gb
Zoom: https://zoom.us/j/97179558778?pwd=dDF1VmRta1plbC8vWi9Oc1o4QTAxdz09
Video Recording: https://zoom.us/rec/share/5TA3atKA9GaUEvI4P8iGRooXZd2WasnxM5tgqpcoCaSUNUsKdEgHhv9lFAHLPFeP.mXYHn0JQIlo4Wsis?startTime=1621474466000
Dask down under is a chance for everyone in Oceania to forge links and build community here in our backyard. Dask down under we feature talks, tutorials and panel discussions on using Dask to accelerate research. All levels from beginner to expert are encouraged to attend.
Zoom: https://zoom.us/j/97179558778?pwd=dDF1VmRta1plbC8vWi9Oc1o4QTAxdz09
Video Recording: https://zoom.us/rec/share/h17UHEn1XjiTgHYNhzVA-M9t8YQ0s-VigchtGYSU_fa9cycCJjSoRdgovStYUQZR.SMwWWnIj7IaFIRsj?startTime=1621489025000
The Python ecosystem provides a nice set of tools for working with geospatial vector data, including Shapely and GeoPandas. Efforts are popping up to improve the performance and scalability of those workflows, such as dask-geopandas and spatialpandas. This workshop will give an overview of work by the community, and foster discussion on improvements and interoperability between the libraries.
Zoom: https://zoom.us/j/98654378161?pwd=eTMvRCtoV3hIaEFyd1Y3Ulh6QVh5dz09
Video Recording: https://zoom.us/rec/share/WFRbJZNlsdg49Oh1XxYmtheG2KTbL4-5AOj-_GuyBMjlLSQpabb0k2pbWPDgczmw.-F5hnexY3fEZ77_Z?startTime=1621508432000
This talk illustrates recent and ongoing work that rethinks how Dask.distributed manages memory across the cluster. Data should be automatically and transparently moved around workers to optimize memory occupancy, prevent workers from hanging, and increase robustness all around.
Zoom: https://zoom.us/j/98879950901?pwd=M056a2gwTHVJeERFWEEycVFMMDNXUT09
Video Recording: https://zoom.us/rec/share/ppUZ0OX73xgf17rFRQJjQ79wFRjsPxtd6bnzXfpsNrdS2QrlO_kpDFpXdIqp3BJH.oKLsesiFjpqQEAOx?startTime=1621510195000
Dask is a Python package that provides advanced parallelism for analytics, enabling performance at scale for the tools you love. People think it’s magic – drop it in and it scales. This will mostly work, but it will not scale well!
We would like to share what we’ve learned about using Dask to scale dataframe and computations, to avoid you making the same mistakes.
Zoom: https://zoom.us/j/98879950901?pwd=M056a2gwTHVJeERFWEEycVFMMDNXUT09
Video Recording: https://zoom.us/rec/share/M_BanH4yXBLTiZq6HS0BHU3iMdElpDfY9bv_IF_0fWah9JCn3wSTedL2lrU6hhLF.gZNIyeTVeFhLDLzB?startTime=1621512178000
Rubicon is an open source data science tool that captures and stores model training and execution information, like parameters and outcomes, in a repeatable and searchable way to ensure full auditability and reproducibility for both developers and stakeholders alike. It was built for the pydata ecosystem and works well with dask.
Zoom: https://zoom.us/j/98879950901?pwd=M056a2gwTHVJeERFWEEycVFMMDNXUT09
Video Recording: https://zoom.us/rec/share/M_BanH4yXBLTiZq6HS0BHU3iMdElpDfY9bv_IF_0fWah9JCn3wSTedL2lrU6hhLF.gZNIyeTVeFhLDLzB?startTime=1621514113000
Zoom: https://zoom.us/j/97487416320?pwd=L3c1cGZhWUQwYm5ObitteWJ1MWNDQT09
Video Recording: https://zoom.us/rec/share/_KQiagnZBnW0xPIHAHZZs553JuFsJXfkQpdwUmTca7-QE91safAQCefSYr6-Kz3R.EDHHYf_864Dw3v8Q?startTime=1621515755000
The Finance workshop aims to bring together Dask users and developers working in the financial industry to learn from each others experiences and find ways to collaborate going forward. We will start with lightning talks to share experiences using Dask and generate discussion. We will then form breakout groups to focus on common topics and identify ways to work together going forward.
Zoom: https://zoom.us/j/97343824192?pwd=Z1kydWxhbnpDOFR0K0gyclpNeHVXUT09
Video Recording: https://zoom.us/rec/share/LpPiwH---f6X5dJhliJkpoVU0YsAu6_rj7QRscY_0nBh4ocrAE9VVGkC6jabINeH.SLwgV_UmFiwQzDqd?startTime=1621519220000
The goal of this workshop is to bring together scientists, software developers and HPC center administrators to share their experiences with interactive supercomputing, using Dask in High Performance Computing (HPC) settings.
Zoom: https://zoom.us/j/91597434303?pwd=NC9UUllBYXc1a2pHMXVpTGU1UHFJUT09
Video Recording: https://zoom.us/rec/share/25oqsPdh_ZeZSFmvvbOLzJ0Sa8Q7py6GiLrVra8UHJmi0rZ1YLm69Jpt9NUiN4Cm.1MS7-WLlIPa3Ewku?startTime=1621519269000
In this talk, attendees will learn about LightGBM, a popular gradient boosting library. The talk offers details on distributed LightGBM training, and describes the main implementation of it using Dask. Attendees will learn which pieces of the Dask ecosystem LightGBM relies on, and what challenges LightGBM faces in using Dask to wrap existing distributed training code written in C++.
Zoom: https://zoom.us/j/97698388581?pwd=cGlPLzljUlNCWm9ZQ3U5T0JoYnNEZz09
Video Recording: https://zoom.us/rec/share/2MDNheUjidMT7EOcVuD0qnCph3OGnk9Wjf6QZo-8YLO95bzCEHaiDH6I5LmeqXE.Y87S6St0o2DuG29G?startTime=1621521150000
mlforecast is a framework to perform scalable machine learning based time series forecasting. It performs every step of the process in a distributed way, allowing you to scale to massive amounts of data. dask is used for the parallelism so you can use it either on a single machine or on remote clusters.
Zoom: https://zoom.us/j/97698388581?pwd=cGlPLzljUlNCWm9ZQ3U5T0JoYnNEZz09
Video Recording: https://zoom.us/rec/share/2MDNheUjidMT7EOcVuD0qnCph3OGnk9Wjf6QZo-8YLO95bzCEHaiDH6I5LmeqXE.Y87S6St0o2DuG29G?startTime=1621522979000
We’ll use a fundamental conservation workload, land use / land cover change detection, to demonstrate how Dask can scale geospatial workloads. We’ll use tools like STAC and GDAL to efficiently query a geospatial dataset and mosaic many images into a single xarray DataArray. We’ll then apply a PyTorch model to do the actual land use classification, and xarray to analyze the change.
Zoom: https://zoom.us/s/97698388581?pwd=cGlPLzljUlNCWm9ZQ3U5T0JoYnNEZz09
Video Recording: https://zoom.us/rec/share/2MDNheUjidMT7EOcVuD0qnCph3OGnk9Wjf6QZo-8YLO95bzCEHaiDH6I5LmeqXE.Y87S6St0o2DuG29G?startTime=1621524734000
Dash is a framework for developing analytic web apps in Python. This talk will describe Dash’s design, and how it enables efficient scaling to support large numbers of simultaneous users. Then, several architectures will be presented that can be used to combine the strengths of Dash with the strengths of Dask Distributed to create apps that scale to support large datasets and many users.
Zoom: https://zoom.us/s/97698388581?pwd=cGlPLzljUlNCWm9ZQ3U5T0JoYnNEZz09
Video Recording: https://zoom.us/rec/share/2MDNheUjidMT7EOcVuD0qnCph3OGnk9Wjf6QZo-8YLO95bzCEHaiDH6I5LmeqXE.Y87S6St0o2DuG29G?startTime=1621526599000
Do you use the Scikit-learn library to build machine learning models? In this tutorial, we’ll discuss how to avoid the traps that lead to hard to maintain code while implementing customizations to these algorithms. We will cover how building your own estimators can lead to easily scaling your model training with additional libraries like Dask and Dask-ml with much less code than you might think!
Zoom: https://zoom.us/j/91216667381?pwd=T201dnR0NVQ0Nms0WDJXK3JDRFhiQT09
Video Recording: https://zoom.us/rec/share/p_Pqd6IJ_YnMEGGsmnnGTxH4_w8FQpG4H_v_4BcsjEE5iZANhjZq_5qDNau5oUiV.IBwADS6g6LDF-ORg?startTime=1621526617000
RAPIDS supercharges data science with NVIDIA accelerated compute. Paired with Dask, data professionals can build highly-performant, distributed workloads with a comfortable toolset similar to favorites like pandas or scikit-learn. In this workshop, we’ll discuss how Dask + RAPIDS empower practitioners, how to start with these tools quickly, and how they’re used to solve common challenges.
Zoom: https://zoom.us/j/95362663568?pwd=N2tyU29RLzcycjFsWld0R2tYdmpVQT09
Video Recording: https://zoom.us/rec/share/X5h8Hp1Z2revdkiVjYcYImp2iV9d-abgyxB7ItrvRNIepAwgmlJnl3TINtAajyM6.P9lfQS35xyprfZv_?startTime=1621526701000
Zoom: https://zoom.us/j/97520043754?pwd=VXlIclBUSTJTWk9NSDN0TVBvYitOdz09
Video Recording: https://zoom.us/rec/share/glHDH22kLmNd5cGLxy-ep362J6YBPFP8Jq3LotJKpDBQq2wHtWaW6D0QkG3jAHQe.jxiB880gM8xb_eyY?startTime=1621532078000
Data volumes and computational complexity of analysis techniques have increased, but the need to quickly explore data and develop models is more important than ever. One of the key ways to achieve this has been through GPU acceleration. This workshop introduces RAPIDS, and illustrates how to use Dask and RAPIDS to accelerate ETL/ML workloads, increasing performance and decreasing total cost.
Zoom: https://zoom.us/j/94954946323?pwd=ZXZpRGxRMjFkanlJYlRSbXQ1cGpoQT09
Video Recording: https://zoom.us/rec/share/ecFv8qlWZPyfiGv4zjFNTYErrA5H79mluVWBRegCpeATQhrWQUWj0cTKex4MITQ.fTnVW38TuJ7gmwwz?startTime=1621535703000
This talk presents a blueprint for bringing Dask workloads to HPC grids. We implement an architecture which allows dynamic sharing of compute resources within and between multi-tenant environments where Dask clusters are defined in secure, flexible, and repeatable templates.
Zoom: https://zoom.us/j/96380136685?pwd=aXM5b0pUOUFCY01QTnFWRGcwOWMrdz09
Video Recording: https://zoom.us/rec/share/EaWoBAe_FGV0iTf_ma_qzbLDn5kEnD5rInMGLqg9bKT_ieNRn2JV1B49n7iPbUb4.IC6rqtc_bpEJneGi?startTime=1621535446000
Ray is a distributed task execution system that provides a simple API for building distributed applications, and has a large ecosystem of libraries for training and serving machine learning models. As part of a recent effort to expand support for Ray-based data processing and data analytics, Dask-on-Ray was developed to allow users to run Dask workloads on Ray.
Zoom: https://zoom.us/j/96380136685?pwd=aXM5b0pUOUFCY01QTnFWRGcwOWMrdz09
Video Recording: https://zoom.us/rec/share/EaWoBAe_FGV0iTf_ma_qzbLDn5kEnD5rInMGLqg9bKT_ieNRn2JV1B49n7iPbUb4.IC6rqtc_bpEJneGi?startTime=1621537347000
What makes a distributed system framework different than other libraries? How does one’s mental model need to change when thinking about writing code for distributed systems? What makes one distributed system framework different from another? What are some common trade-offs made amongst Dask, Spark, and Ray and how are these three different from other classes of distributed systems?
This talk will answer all of these questions, and include pictures of my amazing puppy dog for when you zone out.
Zoom: https://zoom.us/j/99885256199?pwd=UDJXeGp5S09JYjZNUUYwMTIzdVJVUT09
Video Recording: https://zoom.us/rec/share/lM49KwUGKdkDofST3Xn2R00UlzF5H0osg6nzwN2k8K8kdvlvY8dh1Tnthv2GMBvR.KgrRqPQNKS7_36xG?startTime=1621540826000
Astronomers are learning to use Dask to analyze terabyte to petabyte scale data from the upcoming Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) to unlock the mysteries of dark energy leading to the accelerated expansion of the Universe. Anticipating and balancing memory usage during interactive analyses across a high-performance computing center is proving challenging.
Zoom: https://zoom.us/j/91901424359?pwd=YjF1MlQreWNRNnRnMU9GVjhyK1VMUT09
Video Recording: https://zoom.us/rec/share/FqQC3hrNop_S4txcjls2JSd42qjh6TT0Ecz-RGdNvP_ehdJAdeLmnKCtwZmVj6ag.7BsUwkF2xss0Jyx-?startTime=1621600204000
In early 2019, SymphonyRM started a very ambitious R&D phase to develop machine learning models across many disease areas on clinical healthcare data. To achieve our vision, we needed infrastructure to give our data scientists superpowers. The combination of Dask and Prefect turned out to be incredibly powerful and productive, allowing us to achieve excellent results.
Zoom: https://zoom.us/j/97590283314?pwd=czU3SURreUVZNEExdkRPV3VkRy9oUT09
Video Recording: https://zoom.us/rec/share/esxiBx7J6_0WlHox7o940oUpORfoGJmuaaYftPLXWY8NFAWojgqRw-Me0AzNPXxL.bsgoJLNypUBkEHOC?startTime=1621602077000
Next generation Radio Telescopes generate vast and ever-increasing quantities of data, but current software is not designed to operate in a parallel, distributed paradigm. This workshop brings together three strands of distributed dask Radio Astronomy development by SARAO, NRAO and SKAO, to provide a forum for the above challenge and to serve as a platform for future developments.
Zoom: https://zoom.us/j/99981786966?pwd=TGhRK2JRaStCeFpjc1EveGdrdlpxUT09
Video Recording: https://zoom.us/rec/share/yix_kJ3WHO1uGZdRl-x8aXrJmaO8YQ52vXw8aauBw40WD-Tk3A2zxY4gZmhlAEPT.9kCqC6d1gWrB-ODT?startTime=1621605674000
Dask contains many functions for data IO for arrays and dataframes. In this workshop, we will discuss the current status of various data format integrations for Dask and more generally about the parallel/cloud-friendly data storage landscape.
Zoom: https://zoom.us/j/94963743297?pwd=TFBZanBwRGtxZHpMMVZjRURaS3VSUT09
Video Recording: https://zoom.us/rec/share/ICh-jVQ0BqFEAK4eNHM4_5ask04qALFz1N7AvEPFKD6heapvK2L9dE5QwEG0_lfd.MzvvmByibednS33f?startTime=1621605833000
Training machine learning (ML) models can easily take hours. A common parameter with big data is the “batch size,” the number of training examples used to approximate the gradient. Training time can be minimized by increasing the batch size with certain distributed systems. Our experimental results show that using our wrapper for a deep network requires less wall-clock time than standard SGD.
Zoom: https://zoom.us/j/93640713876?pwd=dm5zQm96S3NEZEpWc1ZLR0xVY1lUdz09
Video Recording: https://zoom.us/rec/share/FaL6Rd8HMmkzcbxUdm4gymKEUG2z1gtJHJy6HI109g8EdVj637JxG2fZyncckd85.Vw2veiLe45QKGsay?startTime=1621605605000
Raster analysis sits at the core of work done in Geo domains, specifically GIS and Environmental Studies. Every domain, from finance to manufacturing, has a Geo component, so being able to wrangle large amounts of raster data is an asset. This talk will show you how Dask beefs up the geo-raster Python stack and how users from different domains can look to GIS to solve interesting problems.
Zoom: https://zoom.us/j/93640713876?pwd=dm5zQm96S3NEZEpWc1ZLR0xVY1lUdz09
Video Recording: https://zoom.us/rec/share/FaL6Rd8HMmkzcbxUdm4gymKEUG2z1gtJHJy6HI109g8EdVj637JxG2fZyncckd85.Vw2veiLe45QKGsay?startTime=1621607440000
Dask DataFrame groupby operations are very common and very powerful. However, due to the distributed nature of Dask DataFrames, they can fail in unexpected ways. This talk covers mitigation strategies for these problems, including using set_index to optimize data layout, and using split_out and split_every parameters to optimize computation.
Zoom: https://zoom.us/j/93640713876?pwd=dm5zQm96S3NEZEpWc1ZLR0xVY1lUdz09
Video Recording: https://zoom.us/rec/share/FaL6Rd8HMmkzcbxUdm4gymKEUG2z1gtJHJy6HI109g8EdVj637JxG2fZyncckd85.Vw2veiLe45QKGsay?startTime=1621609177000
I took a neural network I had trained on a single CPU to generate pet names and tried retraining it with tons of connected GPUs using Dask, PyTorch, and the package dask-pytorch-ddp. I learned a lot about when is the right time to use multiple GPUs and what the pitfalls can be. In this talk I’ll discuss what these lessons mean for training with GPUs and Dask.
Zoom: https://zoom.us/j/93640713876?pwd=dm5zQm96S3NEZEpWc1ZLR0xVY1lUdz09
Video Recording: https://zoom.us/rec/share/FaL6Rd8HMmkzcbxUdm4gymKEUG2z1gtJHJy6HI109g8EdVj637JxG2fZyncckd85.Vw2veiLe45QKGsay?startTime=1621610908000
During this workshop, we would like to discuss upcoming challenges and different Dask use-case in High-Energy Physics community as well as deployment methods across HPC, on-premise high-throughput clusters and cloud providers. A key challenge is the integration of interactive scale-out systems within the existing federated scientific grid computing infrastructure.
Zoom: https://zoom.us/j/94873155498?pwd=RUZFZjUyQWhXSWNGYXIzOVhraXk0Zz09
Video Recording: https://zoom.us/rec/share/nWFRmpyzXqhOuI4fRgv3csOKbaEVzgA3yq1pTYCyGCCK0xjUQ3FBTLVioqcHRW7S.9-vMMxrI_ZEvHP7f?startTime=1621612811000
Xarray provides metadata-rich data structures that wrap array-like objects such as Dask arrays. This two-part session will highlight recent exciting advances in Xarray’s capabilities, and present user stories of Xarray+Dask usage across a wide variety of domains.
Zoom: https://zoom.us/j/95161275110?pwd=eWhMTzFtd1NuMTAvSWNOOW5qVnIyQT09
Video Recording: https://zoom.us/rec/share/BREqZXbqFf3d4B3J9QHAlWCIOuHdRgTq8hBtXSNl6UaSJPneitf_ZLU3vhF5Cjoh.sKTiTOE2DW6qXHlF?startTime=1621612831000
Going from pandas or cudf to dask-cudf can unlock big and latency-sensitive analytics workloads… if done right. However, dask-cudf is quite new and multi-GPU computing faces NUMA hazards. This talk shares our experience with dask-cudf from two perspectives: A case study in tackling 100 GB/s for extracting an identity graph from big logs, and our top lessons in going to production.
Zoom: https://zoom.us/j/93640713876?pwd=dm5zQm96S3NEZEpWc1ZLR0xVY1lUdz09
Video Recording: https://zoom.us/rec/share/FaL6Rd8HMmkzcbxUdm4gymKEUG2z1gtJHJy6HI109g8EdVj637JxG2fZyncckd85.Vw2veiLe45QKGsay?startTime=1621612748000
Stream processing is experiencing exponential growth with businesses and services relying heavily on real-time analytics, inferencing, monitoring, and more. Reliable, cost-effective streaming at scale is paramount, but auto-scaling has hit cost-efficiency limits with CPUs. This talk will be about how NVIDIA is leveraging Dask to GPU-accelerate big data stream processing at scale in production.
Zoom: https://zoom.us/j/93640713876?pwd=dm5zQm96S3NEZEpWc1ZLR0xVY1lUdz09
Video Recording: https://zoom.us/rec/share/FaL6Rd8HMmkzcbxUdm4gymKEUG2z1gtJHJy6HI109g8EdVj637JxG2fZyncckd85.Vw2veiLe45QKGsay?startTime=1621614790000
The Dask JupyterLab extension provides integration between Dask and JupyterLab. I will show how to use the JupyterLab panel system to create custom layouts for the distributed dashboards, as well as how to use the integrated Dask cluster manager to start, stop, and customize your dask clusters. I will finish with ideas for how people with either Dask or frontend experience could contribute.
Zoom: https://zoom.us/j/93640713876?pwd=dm5zQm96S3NEZEpWc1ZLR0xVY1lUdz09
Video Recording: https://zoom.us/rec/share/FaL6Rd8HMmkzcbxUdm4gymKEUG2z1gtJHJy6HI109g8EdVj637JxG2fZyncckd85.Vw2veiLe45QKGsay?startTime=1621616392000
Capital One uses Dask and its ecosystem to great effect and gains more internal users regularly. This survey talk outlines who is using Dask, why they use Dask, how they deploy Dask, and the challenges they encounter. We will also consider the future of Dask and its usage within Capital One.
Zoom: https://zoom.us/j/93640713876?pwd=dm5zQm96S3NEZEpWc1ZLR0xVY1lUdz09
Video Recording: https://zoom.us/rec/share/FaL6Rd8HMmkzcbxUdm4gymKEUG2z1gtJHJy6HI109g8EdVj637JxG2fZyncckd85.Vw2veiLe45QKGsay?startTime=1621618150000
As workloads scale, the overhead of processing tasks itself becomes a bottleneck. Improved scalability requires not only a faster scheduler, but a coordinated effort across the entire Dask ecosystem. High performance computing is not about doing one thing well; it’s about doing nothing poorly. In this workshop, we’ll cover an ongoing multi-institutional effort to accelerate Dask scheduling.
Zoom: https://zoom.us/j/98534272079?pwd=U2d6MEh6TWF3cGcwY0JzMnpROG5qZz09
Video Recording: https://zoom.us/rec/share/vO7Xx5HAupQHDEjTgmspp6VZoCY_zabi1nouOHCwl4mBSg2hVQpIFqTTpVQUNZA4.8cE5K7SjaMIUNINW?startTime=1621620212000
As data pipelines become increasingly complex and interconnected, workflow management systems are being used to schedule and monitor tasks. Prefect is an open-source workflow management system designed for large-scale data processes. We’ll show how to get started with Prefect and also cover how to run Prefect on top of Dask on the cloud to parallelize workflows.
Zoom: https://zoom.us/j/93640713876?pwd=dm5zQm96S3NEZEpWc1ZLR0xVY1lUdz09
Video Recording: https://zoom.us/rec/share/FaL6Rd8HMmkzcbxUdm4gymKEUG2z1gtJHJy6HI109g8EdVj637JxG2fZyncckd85.Vw2veiLe45QKGsay?startTime=1621620075000
Dask allows users to scale their python code, however, it is not usually easy to provision the machines necessary to run that code. Google Cloud has a wide range of machine sizes (200+ CPU, 600 GB + memory) and types that can be provisioned in minutes. Add to that a wide range of GPUs, including the single-node 16 A100 GPU shape, and you can use Dask on the cluster of your dreams.
Zoom: https://zoom.us/j/93640713876?pwd=dm5zQm96S3NEZEpWc1ZLR0xVY1lUdz09
Video Recording: https://zoom.us/rec/share/FaL6Rd8HMmkzcbxUdm4gymKEUG2z1gtJHJy6HI109g8EdVj637JxG2fZyncckd85.Vw2veiLe45QKGsay?startTime=1621621817000