Presentation: Record linkage on a SLURM cluster with Dask

This is a non-technical talk summarizing the experience of using Dask in an academic project that studies long-term development of the United States using full-count Census data from 1850 to 1940 (provided by IPUMS USA).

Dask was used in HPC environment (SLURM) to parallelise various computations, such as operations on very large sparse matrices, prediction tasks (xgboost), data merges. The talk will not cover the details of those computations, but will highlight some of the recurring patterns and challenges of working with Dask on a SLURM cluster:

embarrassingly parallel computations with side effects;
workflows that are (somewhat) robust to interruptions;
memory and workload management.

No background knowledge of Dask or Census-related data is needed.

19 May 22:30 – 19 May 23:00 in Talks

Record linkage on a SLURM cluster with Dask

Sultan Orazbayev

Description

Abstract