Presentation: Dask in High-Energy Physics community

Timetable of workshop

Jim Pivarski (Princeton University), “New Dask collection type for Awkward Arrays”, 15+5 min
Vincenzo Eduardo Padulano (speaker/Universtitat Politecnica de Valencia), Enric Tejedor Saavedra (CERN), “Dask backend for distributed RDataFrame”, 15+5 min
Wei Yang (SLAC National Accelerator Laboratory), “Dask at SLAC”, 15+5 min
Maria Acosta(speaker/Fermi National Accelerator Laboratory), Lindsey Gray (Fermi National Accelerator Laboratory), Carl Lundstedt (speaker/University of Nebraska-Lincoln), “Dask at U.S.CMS analysis facilities”, 15+5 min
Ricardo Roca (CERN), “Dask R&D at CERN”, 15+5 min
Discussion, 20 min

Short description

The High-Energy Physics community will face challenges during the upcoming High-Luminosity LHC (HL-LHC) era. The number of interesting particle collisions - “events” - needed to be processed for analysis will increase by about a factor of thirty resulting in multi-PBs datasets to be analyzed as interactively as possible. For this reason, we are looking for new, scalable tools for data analysis that have become available.

Traditionally HEP has been using C++-based data analysis tools but recently a vibrant Python-based ecosystem is emerging. This new Python ecosystem provides a diverse suite of analysis tools that are in broad use in the scientific community. Many scientists are taking advantage of Jupyter notebooks, which provide a straightforward way of documenting a data analysis. Column-wise data analysis, in which a single operation on a vector of events replaces calculations on individual events serially, is seen as a way for the field to take advantage of vector processing units in modern CPUs and GPUs, leading to significant speed-ups in throughput. Also, declarative programming paradigms can make it simpler and more easily reproducible for physicists to intuitively code their analysis. As the use of Machine Learning-based methods is increasing, large-scale training of models on distributed systems is a key requirement.

Recently High-Energy Physics community started to explore Dask as one of the modern flexible libraries for parallel computing that easily bridged with the Python ecosystem and to be used in physics data analysis frameworks. Another point of interest in using Dask in HEP is to explore possible limits of scalability comparing to the traditional batch approach. During this workshop, we would like to discuss upcoming challenges and different Dask use-case in our community as well as deployment methods across HPC, on-premise high-throughput clusters and cloud providers. A key challenge is the integration of interactive scale-out systems within the existing federated scientific grid computing infrastructure.

21 May 16:00 – 21 May 18:00 in Tutorials / Workshops 1

Dask in High-Energy Physics community

Oksana Shadura, Lukas Heinrich

Description

Abstract

Timetable of workshop

Short description