Back to Blog

Is More Data Labeling Worth it?

Image of Yaoshiang Ho
Yaoshiang Ho

Our mission at Masterful AI is to bring the power and efficiency of modern software development to machine learning. One of the most archaic and error-prone aspects of ML development is getting accurately labeled training data. Through our work with many other ML engineers, we've seen a common fear: no one really knows if simply throwing more labeled training data at their model is going to deliver the accuracy they need. This has big implications, since labeling is slow and expensive. In this post, we'll share a framework and online calculator you can use to evaluate the ROI of spending more money on labeling.

The story usually goes something like this. When an ML engineer first embarked on the deep learning adventure, the results were super promising. With an open sourced architecture, pretrained weights, and a few hundred dollars in labeling, a model achieved decent results, let’s say something like 65% accuracy on a classification task. Early improvements came easily and rapidly… after a little more labeling, the accuracy improved to something like 75%.

But fast-forward to today and improvements have plateaued. Every batch of labeling delivers less and less impact. Worse yet, the ML engineer can’t even predict how much labeling budget they’ll need to improve their model for the next level of accuracy they want to promise to their customers. 

At Masterful, we’ve helped our customers think through this issue with a simple but well grounded framework.

First, think in terms of error rather than accuracy. 

Accuracy (or other goodness of fit measures like mean average precision, F1, or recall/precision) are by definition capped at 1.0, so a goal like “increase accuracy by 5%” means something different if your starting accuracy is 50% vs 90%… and of course it’s mathematically impossible if your starting accuracy is already 99%. It’s more helpful to think about reducing the error rate. For example, if your classifier has an accuracy of 90%, its error rate is 10%. Reducing that by half would get you an error rate of 5%. 

Second, error rate can be predicted from the size of your training data.

A series of papers have trained models multiple times with varying training data cardinalities (the number of training data points) to attempt to find a predictable relationship [1,2,3]. This body of work suggests that error rate and training data set cardinality are related by a power law. Similar work has found a power law relationship in Natural Language Processing models [3], object detection problems [4], and transfer learning [5]. Here at Masterful, we have confirmed this power law ourselves with our own study of CIFAR10.

A simplified version of a power law equation takes the form (Eq 1):

E = Ck 


E is error rate

C is cardinality

k is a constant.

Image from a cited paper shows straight lines with a log-log plot.

Figure 1: Plotting top-1 error of a CIFAR10 classifier against cardinality appears to fit a power law: plotting on a log-log scale yields straight lines. This relationship appears to hold for multiple model sizes, represented by different colors. Reproduced from [2]. 

Left image shows a curved relationship on a linear plot. Right image shows a straight line relationship on a log-log plot.

Figure 2: In our own experiments with CIFAR10 and a simple convnet, we observed that the relationship between error and cardinality appear to follow a power law. Left shows the error and cardinality plotted on a linear scale, showing some convexity. Right shows the same data with both the x and y axis scaled by a log factor - a “log-log plot”. In the log-log plot, the relationship appears linear, implying a power law relationship.

Third, you can predict your future error rate using a simple calculation.

Using some algebraic manipulation of Eq 1, we can estimate a model’s future error rate with more labeled training data. 

First, divide the log of your error by the log of your data set cardinality. This is the k coefficient you’ll need later. The result should be between -1.0 and -0.0. The base of the log does not matter: log_2, natural log, and log_10 are all equally good. 

Next, to estimate your error rate if you were to increase your training data size, just calculate:

Predicted error rate @ N times the training data =  (N x cardinality) ^ coefficient. 


Let’s work through a simple example. Suppose that on CIFAR10, your model achieves a 90% top-1 accuracy rate. That’s an error rate of 10%. And we know CIFAR10 has 50,000 training data points.

First, we calculate log(0.10) / log(50000). That results in a k coefficient of -0.2128.

Second, we predict the error rate if we had more CIFAR-10 training data:

Predicted error rate @ 2 times the training data =  (100,000) ^ -0.2128 = 0.086.

So with twice as much data, your error rate is predicted to drop from 10% to 8.6%. 


We’ve created an online calculator here, so you can apply this framework to different scenarios without without pulling out Numpy and Matplotlib.  As part of the output, you can see how much accuracy more labeling budget will buy you.


Figure 3: Plot of predicted error rate at various cardinalities, available in the online calculator. 


This approach is just an estimate and makes some simplifying assumptions:

  • Assumes Bayes Error is 0%. 
  • Assumes your model is big enough to be able to learn from more data. 
  • Assumes no data drift. 
  • Assumes no labeler drift.

All four of these assumptions overestimate the value of more labeling, so consider your estimate an upper bound.

Next Steps

Give the online calculator a spin and let us know if you have any questions or feedback at or @masterful_ai. 


Aaron Sabin and Ray Tawil contributed to this post with literature reviews and experiments. 


[1] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically, 2017, 1712.00409. 

[2] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales, 2019, 1909.12673.

[3] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

[4] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.

[5] Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021.

Related Posts

Hello World

Image of Sam Wookey
Sam Wookey

We're thrilled to announce that the Masterful AutoML platform is now available on the Python...

Read more

Introducing Masterful AI

Image of Tom Rikert
Tom Rikert

Today we’re thrilled to introduce Masterful AI - a smarter, more automated way to build machine...

Read more