A simple way to improve your CV model with unlabeled data
Semi-supervised learning (SSL) unlocks value in your unlabeled data, but it can be difficult to...
Curation is the unsung hero of academic datasets: many common benchmarks in vision are class-balanced, well labeled, and feature distinct train and test sets.
When moving from academic benchmarks to data “in the wild”—for example, training a model on a dataset you create yourself—curation transforms from an assumption to a responsibility.
A variety of common ML mistakes emerge from skipping or bungling one of the components of data curation. At Masterful, we've encountered these challenges while working with companies focused on geospatial analytics, and we thought it would be helpful to share how we think through them. In this post, we’ll focus on creating distinct train, val, and test splits, or more specifically the ways a well-meaning ML Engineer can accidentally introduce data leakage.
When training machine learning models, we aim for generalization: we want our model to be capable of correctly processing data it hasn’t seen before (during training).
Learning without generalization is just memorizing the training data, which isn't useful for deploying models into production in the real world.
To measure generalization, we evaluate a trained model on a "test" set that the model has never been trained on. (We can also create a "validation" set, which is not used for learning model parameters but can be used to search hyperparameters.) The key here is that the model must really never see the data before, or else we have data leakage.
Data leakage means information about the test data leaks into the training process, opening the model up to memorization and/or overfitting. Data leakage will result in a model that at appears to generalize well during training, but later disappoints in production on new test data.
Unfortunately there are many ways to induce data leakage, some of them sneakier than others. We'll cover a few in this post.
Generating a dataset for geospatial machine learning is significantly different from generating a dataset of natural images. Unlike natural images, which are captured by photographers, geospatial imagery is typically collected by satellites, which image vast regions of the world at regular intervals. For many geospatial data collections, a single region of interest (for example, San Francisco) might consist of millions of pixels. One common approach to pare these large acquisitions down into model-friendly sizes is called chipping: sample smaller windows from the regions of interest, either randomly or in a grid.
Chips of San Francisco sampled randomly, with replacement.
When creating train and test splits of chipped data, there are multiple ways to introduce data leakage. The easiest to avoid we can call chip overlap: when you sample chips from overlapping extents (regions) in your train and test data. It's easy to see how this could happen: let's say we’re generating RGB imagery over San Francisco, pulling chips of 10m imagery from the Sentinel-2 satellite. In fact, using an API like Google’s Earth Engine or SentinelHub, we can do this quite easily (we'll be adding example code here).
When manually chipping large regions, it might seem simplest to sample random windows with replacement: pick a small window within an overall bounding extent, grab the appropriate imagery, and repeat. Then, after you’ve downloaded a large collection of chips, you can make a train/test split by dividing them up.
But there’s a catch: though your saved chips will be unique, the area they cover might not. For example, if we follow the above data-collection approach and keep track of the extents our train and test sets cover (here, red and blue), we see the emergence of overlap:
Train and Test chips of San Francisco, some of which overlap.
In this case, a model trained on our data will probably exhibit artificially low generalization error, because it’s seen the same areas before, as parts of different chips in the training data. We aren’t measuring true generalization ability.
This is most easily avoided by chipping without replacement. When sampling random windows, a simple overlap check can determine acceptance / rejection; the more common approach is simply to chip on a grid, which offers less variance in the chipped data per split but has no chance of overlap between chips:
Chips generated by grid sampling. Notably, there is no overlap.
Even proper chipping can leave room for a softer kind of leakage.
Let's say you're a geospatial MLE interested in land-cover classification. You gather imagery for three regions of California and chip them without replacement; you construct your train and test sets by mixing these chips together and creating two splits.
Here, your test set is measuring your model's ability to generalize to unseen chips from a familiar region; however, what if you want to deploy your model to a new region of California? This measure of generalization might not be predictive of your model's performance in the new region.
A stronger form of train/test split would be a pre-chipping split, i.e. holding out entire regions as a test set. (This is done in the popular So2Sat classification dataset [2].) Of course, this has disadvantages—with only one test region and without a k-fold approach, your test metrics are region-specific—but it ensures the model isn't overfitting to specific regions of the world.
This gets tricky, as the difficulty of regional generalization varies with the regions in question. For example, the SpaceNet-2 building segmentation dataset covers four cities: Paris, Las Vegas, Shanghai, and Khartoum [1]. The train and test splits per city aren’t regional, in that chips from the different splits are (grid) sampled from shared regions (below, test chips in blue within the total extent in yellow):
SpaceNet-2 Paris test-set chips, with the total dataset extent in yellow.
But if we train on one city and test on another, we might find that a standard model generalizes poorly. In Masterful's work at CCAI@ICML 2021, we demonstrated that generalizing to an unseen city is actually quite difficult, requiring creative domain-adaptive measures [3]:
Example predictions on Khartoum for models trained on Las Vegas, from this paper.
Of course, you could argue that, if your test regions are significantly different from your train regions, you’ve turned a standard machine-learning task into a domain adaptation problem. The two concepts are related—a perfectly general model is also perfectly domain-adaptive—but you’ll want to keep this in mind when constructing your train and test sets: when are split regions too different to be useful? When are they too similar?
One of the ways geospatial data differs from natural imagery is its regularity: many satellites offer data of the same spatial locations at repeating times. However, some kinds of geospatial data don’t change much over time; if they are sampled repeatedly (at different times), and split separately, it creates the possibility for leakage. The risk is in assuming things will change more than they do.
In some situations, the issue with pulling data over a range of times is intuitive. A single chipped region might have the same land-cover for years; if you pull from that region over multiple years, splitting the different temporal copies separately, you could end up showing your model information during training that’s highly similar to what it will see for that region at test-time.
Sometimes it’s less intuitive. For example, let’s look at the National Interagency Fire Center’s 2021 wildfires to date. Maybe you’re interested in burned-area classification, burn risk assessment, or even fire spread prediction. In such a situation you might download data for multiple fires, splitting it per fire. The two fires below, from different times, overlap spatially; if you split them separately, the model might be able to access information during training that is very similar, if not identical, to some contained in the test set—areas that burned for each fire could yield similar chipped samples.
Spatially overlapping wildfire burned areas from 2021.
This is just one overlap present in one region, in one year; the likelihood of similar overlaps increases greatly as you pull data from multiple years, especially for regions that burn often. Of course, this could be preempted by training and testing on separate regions, as discussed before; it’s simply something to consider when not electing to split inter-regionally.
Sometimes you might think data is changing significantly over time, but it isn’t. When creating an airplane-detection dataset, an airport might seem like a safe place to sample from repeatedly, as planes will continually arrive at and depart from it. But if an airport shuts down, planes won’t move, and repeated observations will yield something closer to copies than truly unique samples.
If you’re used to ready-made datasets with predefined or easily randomized splits, leakage might not seem like a realistic issue. But once you wade into the waters of creating your own dataset, things can complicate quickly. In general, thinking carefully about how you’re constructing train and test splits—considering the sources of data and the specific problem you’ll use them to solve—can save headaches (and, in the case of spurious generalization performance, heartaches) in the future.
[1] Adam Van Etten, Dave Lindenbaum, and Todd M. Bacastow. "Spacenet: A remote sensing dataset and challenge series." arXiv preprint arXiv:1807.01232 (2018).
[2] Zhu, Xiao Xiang et al. “So2Sat LCZ42: A Benchmark Dataset for Global Local Climate Zones Classification”. arXiv [cs.CV] 2019. Web.
[3] Jack Lynch and Samuel Wookey, Leveraging Domain Adaptation for Low-Resource Geospatial Machine Learning. In ICML 2021 Workshop: Tackling Climate Change with Machine Learning (2021)
Semi-supervised learning (SSL) unlocks value in your unlabeled data, but it can be difficult to...
In a previous post we compared Masterful AI’s computer vision training platform to Google’s Vertex...