Training machine learning (ML) models can easily take hours. A common parameter with big data is the “batch size,” the number of training examples used to approximate the gradient. Training time can be minimized by increasing the batch size with certain distributed systems. Our experimental results show that using our wrapper for a deep network requires less wall-clock time than standard SGD.