Stop burning money on the wrong batch size
Once your training runs become material in terms of wall-clock time and hardware budget, it's time...
It’s no secret that just about any deep learning computer vision task can be improved with transfer learning. Transfer learning is a method where a model developed for one task is reused as the starting point for a model for another task. It leads to the question: how can we best reuse weights gathered from existing models trained on very large datasets to improve the performance and cut down development time on a new model for a smaller target dataset? In this post we’ll compare several approaches, provide experimental results, and show you how to easily incorporate transfer learning into your model development.
Before formulating an effective strategy for applying transfer learning in a training pipeline, it will help to establish an intuitive understanding of how our input images are being processed at varying depths of a convolutional neural network. The paper Visualizing and Understanding Convolutional Networks [1] provides a classic demonstration of what that signal processing looks like.
First Two Deconvolved Layer Activations of a ConvNet
The convolutional network is responsible for turning the pixels of the input image into features which is why it is often called the feature extractor or encoder. The above image shows a side-by-side comparison of the first two layer activations within a feature extractor and their corresponding reconstructed images. Reconstruction is possible by feeding the activations into a deconvolutional network which roughly applies the signal processing operations in reverse to decode the feature back into a natural image.
The first layer learns exclusively to extract colors and linear edges from its input. The second one goes on to introduce more complicated textures and curves while remaining usefully generalizable to any dataset. For those curious this same trend is also observed in the layer activations of a visual transformer [2] which hints to us that the strategies of transfer learning later discussed here will have overlap with transformers despite convolutional networks being the focus in this article.
Last Three Deconvolved Layer Activations of a ConvNet
The signal outputted by the previous layers becomes the input of the next. We observe further increase to feature complexity learned in these later layers which are also more task specific and therefore less useful to the target dataset. ImageNet will allow the model to detect animal faces, which is not so relevant if we aim to do geospatial analysis with satellite imagery, as seen in upcoming benchmarks.
Once pretrained weights have been loaded into a newly instantiated model the easiest solution would be to compile the model and run the training loop with the same hyperparameter configuration that would be appropriate for random weight initialization. Let’s call this a naive transfer method as nothing special is being done to accommodate starting with pretrained weights.
The biggest drawback to this method is that large gradient updates blow out the generalizable feature representations that we want to maintain to perform better on the target dataset. This happens when the learning rate is set too high, made worse when batch normalization operates in training mode, or when a different pre-processing method is used from what originally produced the pretrained weights. In the best-case scenario the model’s weights change dramatically and the knowledge that they once possessed has to be re-learned. In the worst-case scenario the model will fail to converge altogether.
The benchmarks provided in the results section are only semi-naive in that they at least use compatible pre-processing methods. Batch normalization’s two operating modes are tested side-by-side to demonstrate the harm of leaving them running in training mode. Experiments that failed because the learning rate was too high were omitted.
Another approach is to freeze some or all of the layers of the encoder. Deciding how many of the earlier layers to leave undisturbed when updating weights during training would have to be done by trial and error.
However, models produced with this method will often score lower on metrics during evaluations on validation and test sets.
If too many layers in the encoder are frozen, there may not be enough trainable parameters to adequately fit to the unique features of the target dataset. This method will also fail to acquire knowledge that can only be gained from co-adaptation [3], which is when neighboring layers learn from one another in ways that cannot be replicated if one neighbor were to be frozen. To give credit where it is due, fewer trainable parameters cuts down on the model’s memory footprint within the GPU during training, and will also reduce the time to train it.
The benchmark featured in the results section freezes the entire encoder. Its output is then sent to a global average pooling layer and then to a fully connected dense layer which is trained on the target dataset. Performance would improve if less layers were frozen.
A work-around to address the cost of using a frozen encoder is to break up the training process into two phases. The model is first trained to convergence with a frozen encoder, and then unfrozen to fine-tune the entire model with a small learning rate. The second phase allows for some recovery of knowledge gained from co-adaptation.
The approach would appear to have many advantages. The frozen encoder phase allows early layers to preserve their representations. The fine-tuning phase allows the layers to co-adapt. And the approach is faster than the warmup-learning rate technique described below.
Unfortunately, based on the experiments of the results section, it’s safe to conclude from the lower accuracy scores that the freeze then fine-tune technique does not appear to be successful. We speculate that the resulting model could be stuck in a local minima, or perhaps the frozen phase generates a final layer overfit to the target data domain, preventing co-adaptation.
What if each layer had its own optimizer, which was initialized with an appropriate learning rate depending on its location of the network? This is not only possible with TensorFlow’s MultiOptimizer, it’s also very effective if you can guess how many optimizers are appropriate for the selected model architecture, which layers to assign to them to, and what initial learning rate should be for each. That’s a lot of hyperparameters to manually tune! You will also have to create custom classes for any callback that references a learning rate so that each learning rate within the MultiOptimizer can be handled accordingly. An example of this would be a callback that reduces the learning rate after the monitored metric plateaus.
Saving one of the most effective methods for last, we can introduce a learning rate scheduler that starts at a low value typically reserved for fine-tuning and then linearly increases it for the first few epochs of training. The model avoids large gradient updates by slowly ramping up the learning rate, preserves co-adaptation, trains relatively quickly, and is on par or excels at producing top performing metrics. Given that none of the layers have to be frozen, the machine learning engineer or data scientist doesn’t have to deal with this added complexity of preparing the model for training.
In many ways it is just like using the naive transfer method, but recall that experiments with a high learning rate were omitted from the results! For example, if you run naive transfer learning on a ResNet50V2 model with the learning rate set to 0.005, training will crash due to divergence after the first epoch, but the same setup will succeed if a warm-up scheduler is applied for the first 5 epochs of training making the warm-up strategy a safer option. You can verify this outcome with the test code we made publicly available.
Masterful comes equipped with a learning rate warm-up scheduler and doesn’t require the user to set any hyperparameters, like epoch count, initial, and final learning rates. See our Guide on Transfer Learning which provides runnable code to try it out.
Let's review the performance of models which classify EuroSAT geospatial data [4] when using ImageNet pretrained weights. Though both datasets contain RGB images, there is no class overlap, and ImageNet does not contain any geospatial data taken from a satellite. This makes the task of transferring knowledge more challenging, as the pre-trained weights found within the later layers of the network will be less useful for classifying images within the target dataset. The class imbalance of this dataset isn’t severe, so we can use accuracy to fairly judge its performance.
The dataset is split so that 80% is used for training, 10% for validation, and 10% for testing on a final holdout set after the model has finished training. All reported metrics mentioned here are derived from this test dataset to avoid the over-optimistic outcomes which occur when evaluating an overfitted validation set. For this reason the metrics are not comparable to the original EuroSAT paper which features an 80/20 split.
There is no data augmentation or other fancy regularization techniques at play here. The point of using a barebones training pipeline is to ablate irrelevant influences that don’t relate to transfer learning. Each experiment uses stochastic gradient descent with momentum for optimization. Early stopping after 20 epochs of no improvement on validation accuracy and a callback that reduces the learning rate after 8 epochs of no validation accuracy improvement were also applied. Both EfficientNetB0 and ResNet50V2 models are used with the intention to see if any trend holds true across architectures. The benchmarks were produced with an NVIDIA Tesla K80 GPU.
Test Set Accuracy and Train Time Scatter Plots
The above scatter plots show the performance trends for each method of transfer learning previously covered. Each method was benchmarked with 5 trials that used various combinations of learning rate and batch size to make trends more visible in the face of suboptimal hyperparameter configuration and random noise.
The table above calculates the average test set accuracy and train times of the EfficientNetb0 experiments featured in the scatter plot. It also isolates the best accuracy gathered from the 5 trials and reports the corresponding train time of the highest scoring experiment. Studying it reveals that naive transfer learning with batch normalization running in inference is almost identical to the learning rate warm-up strategy in accuracy when the learning rate is restricted to values that do not blow up naive transfer training. 5 epochs of warm-up naturally slows down the overall train time of the warm-up method. Layer-wise also proves to be competitive if configured correctly for each model architecture it is applied to.
Despite the gradual drop in performance observed with the other methods, any of them are significantly better than random initialization for both achieving higher accuracy scores on the holdout dataset. It is not surprising to see high train times on unfrozen fine-tuning. The same can be said for freeze then fine-tune given that both stages of training were given 20 epochs of patience until convergence. Other than these exceptions transfer learning allows the model to converge faster.
When studying the results for ResNet50V2 we see that the warm-up method comes out on top in comparison to the naive transfer method. Another notable difference is how poorly the frozen encoder performs even in comparison to random initialization. Admittedly not all layers in the encoder have to be frozen to practically apply this method, and freezing an encoder can get large models to train on affordable GPUs. Outcomes will also vary depending on how different the target dataset is from the one that produced the pretrained weights.
If you’re building computer vision models and are not taking advantage of transfer learning, we recommend adding it, especially considering how readily available pre-trained weights are for a large number of the publicly available model architectures. If you need assistance building a model that effectively uses a pre-trained encoder, please see our Guide on Transfer Learning. We also recommend training with Masterful, which comes pre-loaded with the top-performing learning rate warm-up scheduler featured in this benchmark, so that you don’t have to waste time implementing and configuring it yourself. Training with the same EfficientNetB0 model with Masterful yields a test set accuracy of 98.4815% exceeding the best performing experiment featured in the results section, thanks to all the extra regularization and hyperparameter tuning techniques featured in the Masterful train loop. The ResNet50V2 model also outperforms when training with Masterful, earning an accuracy of 97.4815% on the test set. View our Quickstart Tutorial to learn how to install and begin training your models with Masterful. Chat with us on our Community Slack channel if you have any questions!
Once your training runs become material in terms of wall-clock time and hardware budget, it's time...