During this workshop, we would like to discuss upcoming challenges and different Dask use-case in High-Energy Physics community as well as deployment methods across HPC, on-premise high-throughput clusters and cloud providers. A key challenge is the integration of interactive scale-out systems within the existing federated scientific grid computing infrastructure.
The High-Energy Physics community will face challenges during the upcoming High-Luminosity LHC (HL-LHC) era. The number of interesting particle collisions - “events” - needed to be processed for analysis will increase by about a factor of thirty resulting in multi-PBs datasets to be analyzed as interactively as possible. For this reason, we are looking for new, scalable tools for data analysis that have become available.
Traditionally HEP has been using C++-based data analysis tools but recently a vibrant Python-based ecosystem is emerging. This new Python ecosystem provides a diverse suite of analysis tools that are in broad use in the scientific community. Many scientists are taking advantage of Jupyter notebooks, which provide a straightforward way of documenting a data analysis. Column-wise data analysis, in which a single operation on a vector of events replaces calculations on individual events serially, is seen as a way for the field to take advantage of vector processing units in modern CPUs and GPUs, leading to significant speed-ups in throughput. Also, declarative programming paradigms can make it simpler and more easily reproducible for physicists to intuitively code their analysis. As the use of Machine Learning-based methods is increasing, large-scale training of models on distributed systems is a key requirement.
Recently High-Energy Physics community started to explore Dask as one of the modern flexible libraries for parallel computing that easily bridged with the Python ecosystem and to be used in physics data analysis frameworks. Another point of interest in using Dask in HEP is to explore possible limits of scalability comparing to the traditional batch approach. During this workshop, we would like to discuss upcoming challenges and different Dask use-case in our community as well as deployment methods across HPC, on-premise high-throughput clusters and cloud providers. A key challenge is the integration of interactive scale-out systems within the existing federated scientific grid computing infrastructure.