20 May 15:00 – 20 May 15:30 in Talks

mlforecast: Scalable machine learning based time series forecasting

José Morales

Audience level:
Novice

Description

mlforecast is a framework to perform scalable machine learning based time series forecasting. It performs every step of the process in a distributed way, allowing you to scale to massive amounts of data. dask is used for the parallelism so you can use it either on a single machine or on remote clusters.

Abstract

Background

Gradient Boosted Decision Trees (GBDT) can achieve good performance on time series forecasting as shown by the M5 competition, where LightGBM was used in some of the best scoring solutions (1st place, 4th place).

Computing lag-based features for training is embarrassingly parallel and is fairly straightforward, you just partition your dataset by the series id and perform the preprocessing in parallel. However, most of the times you have to concatenate them back together (which can be expensive) to train a model, and once you've trained it, you have to update your features somehow in order to predict the next timestep, which can be hard to do efficiently.

mlforecast approach

mlforecast uses dask.dataframe to load the data in partitions, performs the preprocessing independently for each partition, trains a single model (either XGBoost or LightGBM) in a distributed way using all the data, sends the trained model to every worker and performs the forecasting independently for each partition. This avoids data movement and the only part that requires communication is the training, so it scales very well.

The required inputs from the user are:

For those who don't have much experience with dask, mlforecast provides a CLI which can be used to run experiments using configuration files, so you can get away with only knowing pandas to set up your data in the correct format and let mlforecast do all the heavy lifting.

For users who want to have more control and maybe apply some custom transformations, there's also a programmatic API which should be very easy as well. The lag-based transformations are defined through numba compiled functions that take an array as input and return an array with the same size containing the transformed values.

Example

I want to present two examples. A small one of how to set some sample data in the format that mlforecast requires and then running it from the CLI, watching the dask dashboard as it goes to explain the process and show the compute graph.

The second example will be a more realistic one, training on a real dataset using the programmatic API to define custom transformations and running on a remote cluster.