Stop burning money on the wrong batch size
Once your training runs become material in terms of wall-clock time and hardware budget, it's time...
With sub-optimal hyperparameters, deep neural networks can take forever to converge. For certain datasets and model architecture, they may never converge at all. This can cost you a lot of training runs and money just to find the right hyperparameters. In the recent past, libraries such as Keras, PyTorch Lightning, and Masterful have incorporated hyperparameter AutoTuners into their platforms to help engineers improve their models without having to find the right values for these hyperparameters by trial and error. In this post, we compare the performance of these AutoTuners.
Training a deep neural network involves selecting the optimal hyperparameters, such as batch size and initial learning rate, which will be used by the optimization algorithm to learn the model parameters that can produce the best accuracy. Picking the right values for these hyperparameters can be very painful. It's an arduous journey of trial-and-error where you try different values, run the training for many epochs, and finally pick the values that achieved the best validation accuracy. This approach is often intractable because if we start with the sub-optimal values for these hyperparameters, DNNs may get stuck at local minima or saddle points. You can waste lot of time and money on compute trying to overcome these challenges to achieve the best model accuracy.
Optimal batch size is roughly where the noise and signal of the gradient are balanced - where the variance of the batch gradient is at the same scale as the true gradient itself. Heuristically, the noise scale measures the variation in the data as seen by the model by taking expectation over individual data points (variance). In the previous post, we talk about optimizing batch size automatically and its impact on training results.
Initial learning rate
Initial learning rate is tightly coupled with batch size. We find the optimal initial learning rate by employing different techniques based on the batch size, model architecture, and the cardinality of the dataset. In particular, low cardinality datasets can be very tricky when finding the optimal learning rate. We’ll soon write a detailed blog post on our optimal learning rate finder.
We use the same model architecture and dataset across different AutoTuners. To isolate the impact of batch size and learning rate on the training results, first we run the AutoTuners to predict these hyperparameter values, and we then run training on the respective library.
We will use SVHN (Street View House Numbers) dataset which is an image digit recognition dataset of digit images coming from real world data. Images are cropped to 32x32. The dataset consists of 73,257 images for training and 26,032 for testing, but we only use 5% of the dataset to keep the training runtime short.
We will use a model inspired by the simple CNN used in this TensorFlow tutorial. This is a toy model for demonstration purposes only, and should not be used in a production environment. It has a few convolutional layers and outputs logits directly, rather than using a softmax layer at the end. Below is the model summary on TensorFlow (left) and PyTorch(right). We run both KerasTuner and Masterful on the TensorFlow model.
Callbacks to apply during training
We set the following callbacks during training for all the AutoTuners across different libraries
We compare different AutoTuners based on the amount of time taken to find the optimal hyperparameter values, training time to run 100 epochs (with above mentioned callbacks), and accuracy on Training and Testing datasets.
In this blog post, we are only demonstrating Masterful's automatic hyperparameter value selection, which is just part of our platform, and the performance benefit we achieve from the right hyperparameter values alone in terms of training time and accuracy. We find Masterful achieves significantly better accuracy on both training and test datasets when trained using its auto-tuned batch size and learning rate.
The Masterful platform learns how to optimally train your model by focusing on five core organizational principles in deep learning: architecture, data, optimization, regularization, and semi-supervision. Masterful's metalearner for optimization hyperparameters are tailored to your model architecture and data. Check out our Optimization Guide and sample code to try it yourself. You can also use our platform to train your model end-to-end - here are the full docs.
We’d love to hear what you think about this post! Please reach out at firstname.lastname@example.org, ping us on Twitter, or join the Masterful Slack community to continue the discussion.
Machine Learning Engineer and Researcher, Masterful AI