21 May 16:00 – 21 May 16:30 in Talks

100GB/s GPU log analytics at Graphistry: A case study and production lessons on tuning dask-cudf

Leo Meyerovich

Audience level:


Going from pandas or cudf to dask-cudf can unlock big and latency-sensitive analytics workloads... if done right. However, dask-cudf is quite new and multi-GPU computing faces NUMA hazards. This talk shares our experience with dask-cudf from two perspectives: A case study in tackling 100 GB/s for extracting an identity graph from big logs, and our top lessons in going to production.


From security and fraud to sales and marketing, quickly making sense of the relationships across big log files is important but difficult. The rise of GPU cloud computing has changed what is practical here, so for the last 4 years, Graphistry, a visual computing startup, alongside Nvidia and other startups, have been collaborating on RAPIDS, a GPU computing ecosystem that shares a common core of GPU dataframes. The dask-cudf library is poised to unlock more speed and scale by enabling multi-GPU acceleration and out-of-core processing. However, dask-cudf can be difficult in practice, and one fundamental reason is that non-uniform memory access (NUMA) costs can quickly derail any potential benefits of such an approach. This talk shares Graphistry's lessons in going from in-memory single-GPU log & graph analytics RAPIDS stack (cudf, cugraph, cuml, blazingsql, and our proprietary layers) to multi-GPU and bigger-than-memory processing with dask-cudf.

First, we'll share a case study of tackling Splunk's Boss of the SOC challenge by scaling a Streamlit/Jupyter interacive visual log analyzer to go from handling 2 GB of logs in cudf on 1 GPU, to now handling 300 GB in dask_cudf on 8 GPUs. This was difficult from a performance and architecture perspective. The challenge is handling NUMA scheduling hazards around multi-GPU and bigger-than-memory tasks. We'll cover: * The simple version single GPU version * The simple multi-GPU version * Optimizing out-of-core reads with GDS * Optimizing out-of-core reads with NUMA-aware parallel IO These are fundamental dask-cudf programming concepts to prevent dask from being slower than non-dask, and ideally, getting close to the performance that your hardware investments are capable of.

Second, from more of a production perspective, we'll share our top 10 lessons on adopting dask-cudf. These are things we generally wish someone would have told us. They range from architecture to debugging to tuning to ops. Many of dask-cudf basics are currently not documented or are under active development, so this talk should help teams smooth their own journey.

We will share code examples in a public open source repository for use after the talk.