Does smart selection actually work?

Before asking anyone to pay for ALaaS, we measured it. This page summarizes five controlled studies: label efficiency on real photos, robustness across three image domains, a head-to-head comparison with published active learning methods, a scale test with real model training on 20,000-image pools, and a segmentation test with pixel masks. Including the results that did not go our way.

How we test

The protocol is the standard one from the active learning literature: take a labelled dataset and hide the labels. Grow a training set at matched budgets, either by asking ALaaS which images to label next or by picking at random, then train the same classifier on each set and score it on a held-out test split. The selection never sees any labels. Every number below is averaged over several random splits.

Study 1: Same accuracy, 2.5x fewer labels

Real photos, 6 classes. ALaaS reached 98% accuracy with 60 labels; random sampling needed about 150 labels for the same result. The gap is largest exactly where labelling budgets hurt most: at 16 to 40 labels, ALaaS was 5 to 11 accuracy points ahead.

25%50%75%100%04080120160labelled imagesALaaS selectionrandom sampling

Classifier accuracy vs labelling budget. Mean over 5 splits.

LabelsALaaSRandomDifference
1667.0%55.9%+11.1 pp
2481.1%70.9%+10.3 pp
4093.2%88.0%+5.2 pp
6098.2%96.4%+1.8 pp
15098.3%98.2%+0.1 pp

Study 2: Three domains, one honest answer

The same protocol on three very different datasets: natural photos, satellite land-use tiles and blood cell microscopy. The table shows the mean accuracy gain over random sampling, averaged across all budgets.

DomainDatasetGain vs random
Natural photos6 classes+5.1 pp
Satellite imagery10 land-use classes+5.6 pp
Medical microscopy8 blood cell types+2.1 pp

The honest part: on blood microscopy the gain is small and noisy. General-purpose vision systems transfer poorly to that domain, and when the underlying signal carries little structure, no selection strategy can do much with it. We say this openly because it defines where ALaaS helps today: natural imagery, aerial and satellite data, retail, industrial inspection and similar domains. Support for domain-specific adaptation is on our roadmap.

A second finding from this study: our explore strategy (which deliberately includes unusual images) beat the default on satellite data, where classes vary a lot internally, but costs accuracy on cleaner data. That is why coverage is the default for training efficiency and explore is opt-in for edge-case hunting and dataset QA.

Study 3: Against published methods

We ran a fair batch comparison against the strongest published active learning baselines: BADGE (Ash et al. 2020), entropy and margin uncertainty sampling, and k-center greedy coreset selection (Sener & Savarese 2018). Same data, same seeds, same classifier, batches of 16, across all three domains.

MethodMean accuracy (3 domains)Works from zero labels
ALaaS (coverage)65.6%yes
BADGE65.5%no
Random64.7%yes
Margin63.1%no
Entropy58.3%no
k-center greedy48.0%yes

ALaaS ties BADGE, the strongest published method we tested, while needing no trained model in the loop: BADGE and the uncertainty methods require an existing model to score images, so they cannot even start at the cold-start point where most labelling projects begin. We also repeated the comparison with a real network fine-tuned from scratch each round instead of a fixed classifier; the ranking held there too, with ALaaS and random leading and the model-in-the-loop methods trailing.

Study 4: Does it hold at scale, with real training?

The studies above use a linear probe on a few hundred images. This one is the toughest test we run: pools of 20,000+ images, and at every budget a full neural network fine-tuned from scratch (not a probe), evaluated on a held-out test set. Two domains, 3 seeds, compared against the same published methods. Mean accuracy across budgets (50 to 1,600 labels):

MethodPhotos (20k pool)Satellite (22k pool)
ALaaS (coverage)69.4%83.2%
BADGE66.6%83.3%
Margin67.8%82.0%
ALaaS (explore)66.9%83.1%
Random66.2%83.0%

On photos, coverage is the best method overall and beats BADGE by 2.8 points - while using no model in the loop and about half the compute. On satellite imagery everything is a statistical tie (coverage, BADGE and random all within 0.3 points): with a strong pretrained backbone on an easy, saturating task, no method pulls ahead - but coverage never falls behind either.

The honest read across all four studies: coverage is the best or tied-best method everywhere we tested, and never worse than random, with the clearest wins where labelling budgets are smallest. It matches or beats state-of-the-art active learning without needing a trained model - so it works from the very first label, at lower cost.

Study 5: Does it work for segmentation, not just classification?

Everything above is image classification. But the most expensive labels are pixel masks for segmentation - minutes per image, not seconds. So this is where selecting the right images to annotate matters most. We tested on skin-lesion photos (a medical domain, where our representations are weakest): pick images with ALaaS or at random, fully mask them, train the same U-Net, measure mean IoU. 3 seeds.

Masks labelledALaaS (coverage)RandomDifference
2068.0%65.5%+2.5
8076.8%75.7%+1.1
16080.4%78.4%+2.0
32082.4%81.2%+1.1

Mean IoU (%), 3 seeds.

Coverage wins here too - a consistent edge, largest where labels are scarcest (+2.5 mean IoU at just 20 masks). The gain is smaller than on natural-image tasks, as expected in a medical domain, but it holds: when a labelled mask costs minutes, even a couple of points for free is real money saved. And it confirms the pattern across every task type we have tested: coverage is best or tied-best, never worse than random.

Limitations

Try it on your own data

10,000 free credits are enough to run a study like these on your own images. No card required.