We share challenges encountered deploying Dask with JupyterHub for HPC for the needs of high-energy physics community at University of Nebraska-Lincoln. We describe how we combine multiple ways of launching resources, allowing Dask to submit workers directly to a batch system and investigate improvements in network connectivity that allow scaling to large numbers of simultaneous facility users.
High-energy physics experiments, such as those at the Large Hadron Collider, collect many statistically independent “events”, each of which represents information from a single particle collision. Data analysis in HEP has often relied on batch systems and event loops; users are given a non-interactive interface to computing resources and consider data event-by-event. We have been developing a prototype analysis facility to provide users with alternate mechanisms to access computing resources and enable access to the Python ecosystem and novel programming paradigms that can treat the collection of events with “big data” techniques. We designed an analysis facility to build on a combination of Nebraska's multi-tenant Kubernetes resource and available HTCondor pool. As a strategy for launching resources we rely on “dask-jobqueue” mechanisms that allow Dask to submit workers directly to a batch system; this feature provides users with resources from the much larger Nebraska computing facility instead of limiting them to the fixed-size analysis pod. Together with the dedicated worker, this provides an interactive experience using batch resources. In this case, Dask can easily auto-scale, acquiring more resources based on the current task queue length and shutting down unnecessary workers when all tasks complete. Another interesting solution for us was the Dask Labextention for JupyterLab, integrating Dask control into the JupyterLab web UI and providing easier management of the resources. We are working on testing both approaches, and have detected possible bottlenecks for our usage case. By default Dask's data transport is secured with TLS, allowing workers to securely join from separate networks and external clients to connect directly to the scheduler. Instead, we expose the scheduler to the Internet to allow users who prefer to connect directly to the Dask cluster from a Python session on their laptop instead of using the notebook-based interface. As Dask uses TLS, we can leverage SNI to uniquely identify the requested hostname and utilize a TLS proxy such as Traefik to route requests from outside the analysis facility to the correct Dask cluster. This provides a mechanism for all Dask clusters in the analysis facility to share the same IP address and port instead of a unique IP per Dask scheduler. One of the network challenges was to allow the scheduler to preserve worker hostnames (as opposed to the current and default behavior of resolving hostnames immediately to IP addresses). All these features were tested by a group of initial users to provide feedback and identify potential areas for further development.