Active Learning in Connectomics

Connectomics — the attempt to trace every neuron and every synapse in a tissue block — is a labor-intensive field. A cubic millimetre of mouse cortex is roughly a billion electron-microscopy voxels, and meaningful segmentation requires expert-level annotation on enough of them to train a model. Nobody has enough annotator-hours. Active learning is one of the few viable escape valves: instead of labelling randomly, let the model tell you which examples to label next.

Why one-stream active learning falls short

Most active-learning methods use uncertainty from the downstream task model — the most uncertain patch is queried first. This is a reasonable heuristic, and it fails in a specific mode: if the unlabelled pool has outliers that look nothing like your labeled set, the task model will be highly uncertain about them for the wrong reason (no prior), and you will burn expensive annotation budget on patches that turn out to be noise or artifacts.

The fix is to complement task uncertainty with a second signal that is computed from unlabelled data alone — something that tells you where the unlabeled example sits in the underlying data distribution. If a patch is highly uncertain and lives in a dense region of the unlabeled data (i.e. it's typical, not exotic), it's worth labelling. If it's highly uncertain and lives in an outlier region, it's probably a waste.

A detail I like

The unsupervised encoder is trained once, on the full unlabelled volume, using a standard contrastive objective. It is not retrained as more labels arrive. This means the "data-structure" stream is a fixed reference against which the evolving task stream can be compared. We tried retraining it jointly; it gave up a little accuracy and a lot of simplicity. Fixed wins.

Active learning fails in the wild not because the uncertainty signal is wrong but because it is one-dimensional. Two dimensions is nearly always enough; what matters is that the second dimension is independent of the first.

−40%

Labels to reach same accuracy

ECCV 2020

Published

bio-CV

Domain

My role

I was a minor collaborator on this one, at the junior-PhD tail of a Harvard connectomics team. Zudi Lin drove the method; I contributed on the experiment design and ran some of the ablations. The project is the reason I believe active-learning research is more about decomposing the uncertainty signal into complementary parts than about inventing new acquisition functions. Most new acquisition functions are rearrangements of the same one signal.