Training machine learning (ML) models can easily take hours. A common parameter with big data is the "batch size," the number of training examples used to approximate the gradient. Training time can be minimized by increasing the batch size with certain distributed systems. Our experimental results show that using our wrapper for a deep network requires less wall-clock time than standard SGD.
Recent empirical and theoretical results provide strong motivation for increasing the batch size. This results in fewer model updates to train a model.
At first, that seems like an internal detail. However, if each Dask worker processes a constant number of gradients, each model update can be made agnostic to the batch size. That means that the wall-clock time required is proportional to the number of model updates, not the number of floating point operations.
In this talk, I'll present software that combines those two facts (and also show the details). The main benefit: the wall-clock time required to train a model be reduced from 120 minutes to 45 minutes. This isn't a free lunch but luckily will not require any more floating point operations.