After a brief introduction to particle physics datasets, I will discuss how we adapted the scientific python ecosystem to efficiently and conveniently process these data, from ingestion, through manipulation with novel array programming techniques, to reduction using histograms. Then, I will describe how we currently scale our processing with dask and how we would like to improve our solution.
In this talk, I will briefly introduce the science domain (high-energy particle physics) and the structure of the processed "analysis-level" datasets that we most frequently interact with. I will then discuss how we adapted the scientific python ecosystem to efficiently and conveniently process these data, tracing the processing path through: how our custom binary data format is ingested with the uproot library; how we manipulate the data in memory with the novel array programming techniques provided by the generalized awkward-array and the domain-specific algorithms provided by the coffea library; to how we reduce the data and build models to extract physical quantities via extensive use of multidimensional histograms with the hist library. Following, I will discuss our current solution for scaling the processing task using dask, including: how we distribute user code, how we divide the work, how we lazily load data columns during processing, and how we access existing compute resources. Lastly, I will discuss how we would like to improve our integration with dask.