Back to Blog

Four buckets of deep learning innovation: architecture, data, regularization, and optimization

Image of Yaoshiang Ho
Yaoshiang Ho

Intro

It’s hard to stay current and maintain competency in deep learning. It’s a young and fast growing field, which means that groundbreaking research and innovations are coming out really rapidly. But at Masterful, we don’t have a choice: we have to stay current because the promise we make to developers is that our platform automatically delivers state-of-the-art approaches for computer vision models (CV) in a robust and scalable way. 

The Buckets

One lens (among many) that we use to try to understand new developments is to put them into one of four buckets.

Deciding on the correct bucket can help yield insights. And some developments touch multiple buckets, in which case seeking out ablations of the various buckets can separate the core innovation from the incremental innovation.

This approach has been helpful for us as we develop our platform, and we hope it's a useful addition to your toolkit as you consider adding innovations to your models.

Architecture 

Architecture icon

Architecture is the structure of weights, biases, and activations that define a model.

For example, a perceptron or logistic regression model’s architecture is a single multiplier weight per input, followed by a sum, followed by the addition of one bias weight, followed by a sigmoid activation.

In computer vision, most practitioners are familiar with the basic ideas behind AlexNet, VGG, Inception, Resnet, EfficientNets. MobileNets, and Vision Transformers, as well as different heads for detection and classification like YOLO, SSD, U-Net, and Mask R-CNN.

Architectures have arguably been the main source of excitement in deep learning. 

Data

Data Icon

Data has been perhaps the least studied element, perhaps because it was assumed to be a fixed input to a CV problem.

Recently, dramatic advances in semi-supervised learning (SSL), or the ability for a CV model to extract information from both labeled and unlabeled images, has unlocked a vast new source of information to train models. And the move towards Data Centric AI has breathed new thinking into data [1].

Regularization 

Regularization Icon

Regularization means helping a model generalize to data it has not yet seen. Another way of saying this is that regularization is about fighting overfitting.

As a thought experiment, it is actually quite easy to achieve 100% accuracy (or mAP or other goodness of fit measure) on training data: just memorize a lookup table. But that would be an extreme example of overfitting: such a model would have absolutely zero ability to generalize to data that the model has not seen.

Common methods for regularization include Dropout, L2 kernel normalization, data augmentation, stochastic depth, decoupled weight decay, early stopping, and some would even argue batch normalization. 

Optimization

Optimization Icon

Finally, optimization. Optimization just means finding the best weights for a model and training data.

Optimization is different from regularization because optimization does not consider generalization to unseen data. The central challenge of optimization is speed - find the best weights faster.

Plain old gradient descent with a very low learning rate is sufficient to find the best weights, but it takes too long, so basically every innovation in optimizers, including momentum, RMSProp, Adam, LARS, and LAMB, are essentially about getting the best weights faster by calculate weight updates with not only the current gradient, but also statistical information about past gradients.

Another strain of optimization is applying GPUs given they are great at the matrix math that underlies neural networks. Most recently, parallel clusters of GPUs allow very large batch sizes. Tensorflow and PyTorch lightning allow almost arbitrary horizontal scaling of GPUs, and the main method for taking advantage of parallel hardware is to increase your batchsize and learning rate. For those of us who got started scrounging P100s on Google Collab with 16 GIGABYTES of RAM, we need to adjust our thinking now that it’s possible to rent a DGX-A100 from a cloud provider with 1.36 TERABYTES of ram - enough room for 100x larger batch sizes. 

Multiple Buckets

AlexNet was a single paper but fit both the architecture and optimization buckets, since it was both a novel architecture and one of the first uses of GPUs. AlexNet also checks the regularization box since it introduced a simple data augmentation scheme and did use weight decay. When VGG was introduced, it faithfully reproduced the augmentation scheme of AlexNet to ensure that data augmentation was ablated away. Checking multiple boxes has the advantage of achieving the best accuracy, as in the ILSVRC challenge. But it has the downside of muddling the true driver of performance. For example, a quick read of the EfficientNet paper suggests they are the state-of-the-art architecture. However, a recent high quality paper suggests that using modern regularization methods on plain old ResNets can achieve the same performance [2]!  

A single algorithm may attempt to do multiple things. It has been hypothesized that some optimizer algorithms like SGD not only optimize (find the best weights for a model and training data) but also regularize the model (help it generalize to unseen data) [3]. 

Finally, as a programmer, a single functional area may touch on multiple buckets. In both Tensorflow and Pytorch, you implement decoupled weight decay regularization by setting hyperparameters of the optimizer, like AdamW, SGDW, LARS, and LAMB… meaning your “optimizer object” handles both optimization and regularization. And Dropout goes into your model “architecture” and yet is a regularization method, not really part of architecture. 

Non-orthogonal interactions within buckets

We’ve learned the hard way here at Masterful that all the innovations within each bucket interplay with each other a great deal. In regularization for example, strong use of data augmentation can reduce the need for other forms of regularization like weight decay. To get it right, you can’t just pick and choose hyperparameters from individual papers - you need to find the best ones that work together for your specific situation [4]. 

Conclusion

I’d love to hear from you if you think these buckets make sense, and even if you think they don’t! Please reach out at learn@masterfulai.com. If you’d like to try our platform, with automatically metalearned solutions for unlabeled data, optimization, and regularization, you can get an analysis of your model at https://www.masterfulai.com/product

References

  1. Andrew Ng, “MLOps: From Model-centric to Data-centric AI”, (2021). https://www.deeplearning.ai/wp-content/uploads/2021/06/MLOps-From-Model-centric-to-Data-centric-AI.pdf
  2. Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. "Revisiting resnets: Improved training and scaling strategies." arXiv preprint arXiv:2103.07579 (2021).
  3. Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, Nathan Srebro, Geometry of Optimization and Implicit Regularization in Deep Learning. https://arxiv.org/abs/1705.03071
  4. Alex Hernández-García and Peter König. "Data augmentation instead of explicit regularization." arXiv preprint arXiv:1806.03852 (2018).

 


Related Posts

Avoiding Three Types of Geospatial Data Leakage

Image of Jack Lynch
Jack Lynch

 

Read more

It's time to use Semi-Supervised Learning for your CV models

Image of Yaoshiang Ho
Yaoshiang Ho
Read more