Time Zone: UTC

19 May 19:30 – 19 May 20:00 in Talks

NVTabular: Building a Dask-based Library for Recommender-System Data Pipelines

Richard Zamora

Audience level:
Intermediate

Description

NVTabular is a recommender-system focused feature-engineering and preprocessing library for tabular data. This talk will describe how NVTabular was built entirely on Dask-Dataframe to both simplify and accelerate model-training pipelines. The primary goals are to (1) present a successful example of Dask integration and to (2) communicate important lessons learned.

Abstract

Recommender-systems are the engine of the modern internet. These models are used extensively by top technology companies to rank what kind of products and content will be most appropriate for a given user. The predictive performance depends strongly on both compute and memory-intensive feature-engineering operations, like group-based normalization and categorical encoding. The data-preparation requirements are so significant, in fact, that end-to-end model training can easily become dominated by ETL time. Given the huge financial incentives for an accelerated turn-around in model development, high-performance data-handling technologies are in high demand.

This talk will discuss the ongoing development of NVTabular, a scalable Python library for recommender-system data pipelines. In order to handle large-scale datasets, NVTabular uses Dask to map all data to an internal DataFrame collection. The library is also tightly integrated with CuDF/Dask-CuDF (NVIDIA RAPIDS), enabling GPU acceleration of all IO and transformation operations.

For the purposes of a self-contained discussion, the talk will begin with a brief introduction to recommender-system data preparation and to the general NVTabular API. With this background in place, the remainder of the session will focus on the high- and low-level details of Dask integration. The target audience includes intermediate and advanced Dask users who are interested in walking through the process of building a high-level library on top of an existing collection module. Although some benchmark results will be used to illustrate favorable scaling behavior, the primary goals are to (1) present a successful example of Dask integration and to (2) communicate important lessons learned.