Does smart selection actually work?
Before asking anyone to pay for ALaaS, we measured it. This page summarizes five controlled studies: label efficiency on real photos, robustness across three image domains, a head-to-head comparison with published active learning methods, a scale test with real model training on 20,000-image pools, and a segmentation test with pixel masks. Including the results that did not go our way.
How we test
The protocol is the standard one from the active learning literature: take a labelled dataset and hide the labels. Grow a training set at matched budgets, either by asking ALaaS which images to label next or by picking at random, then train the same classifier on each set and score it on a held-out test split. The selection never sees any labels. Every number below is averaged over several random splits.
Study 1: Same accuracy, 2.5x fewer labels
Real photos, 6 classes. ALaaS reached 98% accuracy with 60 labels; random sampling needed about 150 labels for the same result. The gap is largest exactly where labelling budgets hurt most: at 16 to 40 labels, ALaaS was 5 to 11 accuracy points ahead.
Classifier accuracy vs labelling budget. Mean over 5 splits.
| Labels | ALaaS | Random | Difference |
|---|---|---|---|
| 16 | 67.0% | 55.9% | +11.1 pp |
| 24 | 81.1% | 70.9% | +10.3 pp |
| 40 | 93.2% | 88.0% | +5.2 pp |
| 60 | 98.2% | 96.4% | +1.8 pp |
| 150 | 98.3% | 98.2% | +0.1 pp |
Study 2: Three domains, one honest answer
The same protocol on three very different datasets: natural photos, satellite land-use tiles and blood cell microscopy. The table shows the mean accuracy gain over random sampling, averaged across all budgets.
| Domain | Dataset | Gain vs random |
|---|---|---|
| Natural photos | 6 classes | +5.1 pp |
| Satellite imagery | 10 land-use classes | +5.6 pp |
| Medical microscopy | 8 blood cell types | +2.1 pp |
The honest part: on blood microscopy the gain is small and noisy. General-purpose vision systems transfer poorly to that domain, and when the underlying signal carries little structure, no selection strategy can do much with it. We say this openly because it defines where ALaaS helps today: natural imagery, aerial and satellite data, retail, industrial inspection and similar domains. Support for domain-specific adaptation is on our roadmap.
A second finding from this study: our explore strategy (which deliberately includes unusual images) beat the default on satellite data, where classes vary a lot internally, but costs accuracy on cleaner data. That is why coverage is the default for training efficiency and explore is opt-in for edge-case hunting and dataset QA.
Study 3: Against published methods
We ran a fair batch comparison against the strongest published active learning baselines: BADGE (Ash et al. 2020), entropy and margin uncertainty sampling, and k-center greedy coreset selection (Sener & Savarese 2018). Same data, same seeds, same classifier, batches of 16, across all three domains.
| Method | Mean accuracy (3 domains) | Works from zero labels |
|---|---|---|
| ALaaS (coverage) | 65.6% | yes |
| BADGE | 65.5% | no |
| Random | 64.7% | yes |
| Margin | 63.1% | no |
| Entropy | 58.3% | no |
| k-center greedy | 48.0% | yes |
ALaaS ties BADGE, the strongest published method we tested, while needing no trained model in the loop: BADGE and the uncertainty methods require an existing model to score images, so they cannot even start at the cold-start point where most labelling projects begin. We also repeated the comparison with a real network fine-tuned from scratch each round instead of a fixed classifier; the ranking held there too, with ALaaS and random leading and the model-in-the-loop methods trailing.
Study 4: Does it hold at scale, with real training?
The studies above use a linear probe on a few hundred images. This one is the toughest test we run: pools of 20,000+ images, and at every budget a full neural network fine-tuned from scratch (not a probe), evaluated on a held-out test set. Two domains, 3 seeds, compared against the same published methods. Mean accuracy across budgets (50 to 1,600 labels):
| Method | Photos (20k pool) | Satellite (22k pool) |
|---|---|---|
| ALaaS (coverage) | 69.4% | 83.2% |
| BADGE | 66.6% | 83.3% |
| Margin | 67.8% | 82.0% |
| ALaaS (explore) | 66.9% | 83.1% |
| Random | 66.2% | 83.0% |
On photos, coverage is the best method overall and beats BADGE by 2.8 points - while using no model in the loop and about half the compute. On satellite imagery everything is a statistical tie (coverage, BADGE and random all within 0.3 points): with a strong pretrained backbone on an easy, saturating task, no method pulls ahead - but coverage never falls behind either.
The honest read across all four studies: coverage is the best or tied-best method everywhere we tested, and never worse than random, with the clearest wins where labelling budgets are smallest. It matches or beats state-of-the-art active learning without needing a trained model - so it works from the very first label, at lower cost.
Study 5: Does it work for segmentation, not just classification?
Everything above is image classification. But the most expensive labels are pixel masks for segmentation - minutes per image, not seconds. So this is where selecting the right images to annotate matters most. We tested on skin-lesion photos (a medical domain, where our representations are weakest): pick images with ALaaS or at random, fully mask them, train the same U-Net, measure mean IoU. 3 seeds.
| Masks labelled | ALaaS (coverage) | Random | Difference |
|---|---|---|---|
| 20 | 68.0% | 65.5% | +2.5 |
| 80 | 76.8% | 75.7% | +1.1 |
| 160 | 80.4% | 78.4% | +2.0 |
| 320 | 82.4% | 81.2% | +1.1 |
Mean IoU (%), 3 seeds.
Coverage wins here too - a consistent edge, largest where labels are scarcest (+2.5 mean IoU at just 20 masks). The gain is smaller than on natural-image tasks, as expected in a medical domain, but it holds: when a labelled mask costs minutes, even a couple of points for free is real money saved. And it confirms the pattern across every task type we have tested: coverage is best or tied-best, never worse than random.
Limitations
- Object detection labelling is not evaluated yet; that experiment is planned. Segmentation is covered (Study 5), but on a single medical dataset so far.
- Engineering study, not a benchmark paper: pools from a few hundred (probe studies) up to 20,000+ images (scale study), 3 to 5 seeds. Trends are consistent and reproducible, but modest in scope.
- Gains depend on the data: largest at small budgets and on domains the service handles well, smaller on far-out domains like microscopy.
Try it on your own data
10,000 free credits are enough to run a study like these on your own images. No card required.