Astronomers are learning to use Dask to analyze terabyte to petabyte scale data from the upcoming Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) to unlock the mysteries of dark energy leading to the accelerated expansion of the Universe. Anticipating and balancing memory usage during interactive analyses across a high-performance computing center is proving challenging.
The Vera C. Rubin Observatory Legacy Survey of Space and Time Dark Energy Science Collaboration (LSST DESC) seeks to unlock the mysteries of the accelerated expansion of the Universe. To do this we need to understand large datasets and analyze them for both the scientific signal but also to systematic noise. We are exploring using Dask to help analyze the terabyte to petabyte scale catalogs that will be created by the 10-year LSST of half of the sky to start in 2023. DESC is currently working on Data Challenge to prepare our systems, analyses, and thinking for the upcoming onslaught of data. This talk will describe how we are using Dask to analyze the Data Challenge and how we are planning and identifying needs for the future.
Status: We have found that Dask provides significant speedups when running on large datasets across many nodes, but that memory usage is difficult for most users to predict and planning is required for users to correctly request the needed computational resources. Problem: We seek how to efficiently enable analyses by hundreds to thousands of individual users in a coordinated way. The data we wish to persist in an analysis is often commonly shared across users, but naive approaches require a separate allocation of independent nodes to hold the data in memory for each user. Dream: A shared-memory, multi-user system that still can use the power of Dask to efficiently calculate and persist more specialized analyses on a per-user basis would dramatically increase our ability to provide full Dask access to the entire set of data.