Back to Blog

Hyperparameters that can save your AWS bills

Image of Nikhil Gajendrakumar
Nikhil Gajendrakumar

Keras - Reviews, Pros & Cons | Companies using Keras  vs   Introduction to Pytorch Lightning — PyTorch Lightning 1.7.0rc1 documentation   vs 1.Social-media-profile pic_a - cropped-larger-1

With sub-optimal hyperparameters, deep neural networks can take forever to converge. For certain datasets and model architecture, they may never converge at all. This can cost you a lot of training runs and money just to find the right hyperparameters. In the recent past, libraries such as Keras, PyTorch Lightning, and Masterful have incorporated hyperparameter AutoTuners into their platforms to help engineers improve their models without having to find the right values for these hyperparameters by trial and error. In this post, we compare the performance of these AutoTuners




Training a deep neural network involves selecting the optimal hyperparameters, such as batch size and initial learning rate, which will be used by the optimization algorithm to learn the model parameters that can produce the best accuracy. Picking the right values for these hyperparameters can be very painful. It's an arduous journey of trial-and-error where you try different values, run the training for many epochs, and finally pick the values that achieved the best validation accuracy. This approach is often intractable because if we start with the sub-optimal values for these hyperparameters, DNNs may get stuck at local minima or saddle points. You can waste lot of time and money on compute trying to overcome these challenges to achieve the best model accuracy.

How do we find the optimal values for hyperparameters?

Batch size

Optimal batch size is roughly where the noise and signal of the gradient are balanced - where the variance of the batch gradient is at the same scale as the true gradient itself. Heuristically, the noise scale measures the variation in the data as seen by the model by taking expectation over individual data points (variance). In the previous post, we talk about optimizing batch size automatically and its impact on training results.

Initial learning rate

Initial learning rate is tightly coupled with batch size. We find the optimal initial learning rate by employing different techniques based on the batch size, model architecture, and the cardinality of the dataset. In particular, low cardinality datasets can be very tricky when finding the optimal learning rate. We’ll soon write a detailed blog post on our optimal learning rate finder.

Experimental setup

We use the same model architecture and dataset across different AutoTuners. To isolate the impact of batch size and learning rate on the training results, first we run the AutoTuners to predict these hyperparameter values, and we then run training on the respective library.


We will use SVHN (Street View House Numbers) dataset which is an image digit recognition dataset of digit images coming from real world data. Images are cropped to 32x32. The dataset consists of 73,257 images for training and 26,032 for testing, but we only use 5% of the dataset to keep the training runtime short. 


Model architecture

We will use a model inspired by the simple CNN used in this TensorFlow tutorial. This is a toy model for demonstration purposes only, and should not be used in a production environment. It has a few convolutional layers and outputs logits directly, rather than using a softmax layer at the end. Below is the model summary on TensorFlow (left) and PyTorch(right). We run both KerasTuner and Masterful on the TensorFlow model.

                                           keras_model.            plt_model

                                                                     (TensorFlow)                                                                (PyTorch)

Callbacks to apply during training

We set the following callbacks during training for all the AutoTuners across different libraries

  • ReduceLROnPlateau : monitor('val_loss'), factor(0.5), patience(8)
  • EarlyStopping: monitor('val_loss'), patience(12)




We compare different AutoTuners based on the amount of time taken to find the optimal hyperparameter values, training time to run 100 epochs (with above mentioned callbacks), and accuracy on Training and Testing datasets.


  1. To find the optimal values for batch size and learning rate, Masterful takes 93% less time than PyTorch Lightning Tuner and 95% less time than KerasTuner.
  2. To train with auto-tuned hyperparameters for 100 epochs (with ReduceLROnPlateau and EarlyStopping callbacks), Masterful takes 71% less time than PyTorch Lightning Tuner and 98% less time than KerasTuner.
  3. Training accuracy : When trained with its respective auto-tuned hyperparameters, Masterful achieves 79% better accuracy than PyTorch Lightning Tuner and 48% better accuracy than KerasTuner on the training dataset.
  4. Test accuracy : When trained with its respective auto-tuned hyperparameters, Masterful achieves 40% better accuracy than PyTorch Lightning Tuner and 27% better accuracy than KerasTuner on the test dataset.

In this blog post, we are only demonstrating Masterful's automatic hyperparameter value selection, which is just part of our platform, and the performance benefit we achieve from the right hyperparameter values alone in terms of training time and accuracy.  We find Masterful achieves significantly better accuracy on both training and test datasets when trained using its auto-tuned batch size and learning rate. 

Get started

The Masterful platform learns how to optimally train your model by focusing on five core organizational principles in deep learning: architecture, data, optimization, regularization, and semi-supervision. Masterful's metalearner for optimization hyperparameters are tailored to your model architecture and data.  Check out our Optimization Guide and sample code to try it yourself.   You can also use our platform to train your model end-to-end - here are the full docs.

Join our community

We’d love to hear what you think about this post! Please reach out at, ping us on Twitter, or join the Masterful Slack community to continue the discussion.

Related Posts

Stop burning money on the wrong batch size

Image of Nikhil Gajendrakumar
Nikhil Gajendrakumar

Once your training runs become material in terms of wall-clock time and hardware budget, it's time...

Read more

A simple way to improve your CV model with unlabeled data

Image of Jack Lynch
Jack Lynch

Semi-supervised learning (SSL) unlocks value in your unlabeled data, but it can be difficult to...

Read more