19 May 22:30 – 19 May 23:00 in Talks

Record linkage on a SLURM cluster with Dask

Sultan Orazbayev

Audience level:
Novice

Description

An overview of experience with dask in HPC environment (SLURM) for an academic project.

Abstract

This is a non-technical talk summarizing the experience of using Dask in an academic project that studies long-term development of the United States using full-count Census data from 1850 to 1940 (provided by IPUMS USA).

Dask was used in HPC environment (SLURM) to parallelise various computations, such as operations on very large sparse matrices, prediction tasks (xgboost), data merges. The talk will not cover the details of those computations, but will highlight some of the recurring patterns and challenges of working with Dask on a SLURM cluster:

No background knowledge of Dask or Census-related data is needed.