Once your training runs become material in terms of wall-clock time and hardware budget, it's time to look at improving your batch size. If your batch size is too small, your training runs are taking longer than necessary and you are wasting money. And if your batch size is too large, you are training with more expensive hardware than you need.

Training time and compute cost are primarily determined by the number of optimization steps and the number of training examples processed, respectively (Source : Figure 1 in the paper “An Empirical Model of Large-Batch Training”)

*How do we find the right batch size ?*

Finding the right batch size by trial and error is nearly intractable because it depends on the cardinality and entropy of the dataset, the model architecture, GPU specifications and learning rate.

Instead of trial and error, we have built an automatic learner for the optimal batch size based on the gradient noise scale and hardware analysis. This approach boils down dataset characteristics, model architecture, and GPU hardware into a single number. Later in the post we’ll show an example where training with an optimized batch size achieved equivalent results in **75% less wall clock time than the baseline**.

Before we learn about the gradient noise scale, let's understand what batch size is and why it is important. Batch size refers to the number of training examples utilized in one iteration of neural network training.

During training, our goal is to find the best parameters `\theta` of a neural network. "Best" means that these parameters will yield the minimum loss value, calculated by the loss function `J(\theta)`. The typical loss function for classification, for example, is cross entropy.

Gradient

During training, we compute the best direction and relative magnitude, or gradient, along which we should change our parameters `\theta` by using back-propagation. The gradient is mathematically guaranteed to be the direction of the steepest descent.

Batch size

The gradient for each parameter `\theta` could be calculated as the average of each and every example in the training data. But generally in deep learning, gradients are not calculated across the entire training dataset. Instead, batches (also known as mini batches) of data of a fixed cardinality like 16 are generated, and gradients and updates are calculated as the average across a batch.

Optimizer

An optimizer translates the gradient into a specific update. The simplest optimizer is Stochastic Gradient Descent (SGD), which simply multiplies the gradient by a learning rate to generate the update. Other optimizers like RMSProp, Adam, and even LARS and LAMB include some statistical information about the prior gradients to dynamically adjust the update for faster convergence.

A crucial implementation detail is that the batches must be selected randomly. Unfortunately, Tensorflow's Dataset API’s shuffle method provides an insufficiently random algorithm. We will discuss that in a future post.

*True Gradient*

The best gradient possible would be one that averages the gradients from each and every training example. This gradient would factor in all possible information. Let's call this the True Gradient.

*Batch Gradient*

When calculating the gradient for each parameter `\theta` as the average of just a batch of data, rather than the full training data, we are in some sense approximating the true gradient. Let's call this the Batch Gradient.

*Small batch size vs large batch size*

When the batch size is very small, the batch gradients will have very high variance, and the resulting update will be noisy and stochastic. Doubling the batch size will smooth out the noise, resulting in a Batch Gradient that is a better approximation of the True Gradient. By contrast, when the batch size is very large, the Batch Gradient will almost exactly match the True Gradient, and correspondingly, two randomly sampled batches will have almost the same gradient. As a result, doubling the batch size will barely improve the update – we will use twice as much computation for little gain.

*Intuition*

Transition between the first regime (where increasing the batch size leads to almost perfectly linear speedups) and the second regime (where increasing the batch size mostly wastes computation) should occur roughly where the noise and signal of the gradient are balanced – where the variance of the gradient is at the same scale as the gradient.

Comparison between large and small batch training (Source : Figure 2 in the paper “An Empirical Model of Large-Batch Training”)

In the paper “An Empirical Model of Large-Batch Training”, the authors distilled the above intuition into a specific algorithm and demonstrated its usefulness across many domains and applications, including MNIST, SVHN, CIFAR10, ImageNet, and Billion Word. It starts by calculating a statistic called the **gradient noise scale.**

*Loss function*

Formally, the loss function is given by an average over a distribution `\rho(x)` over data point `x`. Each data point `x` has an associated loss function `L_x(\theta)` and the full loss is given by `L(\theta) = \mathbb{E}_{x ~ p}[L_x(\theta)]`.

To train the model, we would like to minimize `L_x(\theta)`, using the True Gradient `G(\theta) = \nabla L(\theta)`. However as discussed above, calculating the True Gradient is time consuming since it requires analyzing every data point. Instead, we obtain an estimate called the Batch Gradient of the gradient by averaging over a batch, or fixed cardinality sample from `\rho`.

*Gradient as function of the batch size*

We are interested in how useful the gradient is for optimization purposes as a function of the batch size `B`. During stochastic gradient descent, weight updates are performed using the Batch Gradient. If the variance in the Batch Gradient between batches is high, then we should try to improve the accuracy of the stochastic gradient by increasing the batch size.

*Gradient noise scale*

\[B_{simple} = \frac{tr (\Sigma)}{|G|^2} = \frac{\Sigma_{i=1}^n Var(G_i)}{|G|^2}\]

Which says that the gradient noise scale is equal to the sum of the variances of the individual gradient components, divided by the global norm of the gradient – essentially a measure of how large the gradient is compared to its variance. Heuristically, the noise scale measures the variation in the data as seen by the model by taking expectation over individual data points (Variance). When the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can still learn a lot from huge batches of data.

We trained EfficientNetB0, using LAMB optimizer, on CIFAR-10 for two different batch sizes (starting learning rate are different for both batch sizes because higher batch size requires higher starting learning rate otherwise the model won’t be able to learn as much. We’ll soon write a blog on optimal starting learning rate for a specific batch size)

Two batch sizes we chose were:

**64**: A common default value.**1024**: This is the batch size predicted by running the above gradient noise scale algorithm

batch size |
learning rate |
training time |
# epochs ran within training time |
training accuracy |
validation accuracy |

64 | 1e-3 | 1928 sec | 21 | 90.98 % | 67.67 % |

64 | 1e-3 | 7416 sec | 82 | 97.92 % | 70.68 % |

1024 | 7e-3 | 1914 sec | 81 | 98.64 % | 71.51 % |

As we can see from the above table:

- The baseline setup achieved 70.68% accuracy in 82 epochs over 7416 seconds.
- The experimental setup, trained with an optimized batch size, achieved equivalent results in 1914 seconds, or
**75% less wall clock time than the baseline**. - Training on the baseline setup but the optimized wall clock time yields a less accurate model (67.67%).

Finding the right batch size can be tedious, especially detecting the maximum batch size that will fit into GPU memory, running training cycles to see how well the model performs, and picking the best performing batch size. That's a lot of time and resources just to find the right batch size.

The Masterful platform employs meta-learning algorithms such as gradient noise scale to automatically find the right batch size for your specific dataset and model.

We’d love to hear what you think of our automated optimal batch size finder. Please reach out at learn@masterfulai.com or join the Masterful community at https://www.masterfulai.com/community. If you’d like to try our platform, with automatic solutions for unlabeled data, optimization, and regularization, just run "pip install masterful" and visit our developer docs at www.masterfulai.com/docs.

- Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. “An Empirical Model of Large-Batch Training”. https://arxiv.org/pdf/1812.06162.pdf
- David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. “Learning representations by back-propagating errors”. https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf
- T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
- Diederik P. Kingma, and Jimmy Ba. “Adam: A Method for Stochastic Optimization.” https://arxiv.org/pdf/1412.6980.pdf
- Ilya Loshchilov, and Frank Hutter. “Decoupled Weight Decay Regularization.” https://arxiv.org/pdf/1711.05101.pdf
- Yang You, Igor Gitman, and Boris Ginsburg. “Large Batch Training of Convolutional Networks.” https://arxiv.org/pdf/1708.03888.pdf
- Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.” https://arxiv.org/pdf/1904.00962.pdf
- Sam McCandlish, Jared Kaplan, and Dario Amodei. Blog : How AI Training Scales. https://openai.com/blog/science-of-ai/#:~:text=We have found that by,of the true network gradient.
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. “Deep Learning”. MIT Press, 2016.

Machine Learning Engineer and Researcher, Masterful AI