Nature Communications · 202411 min read

A subject-agnostic transformer, for speech decoding that generalises.

Most brain-decoders memorise the brain they were trained on. Every new patient starts at zero. We built a model that doesn't — one that handles surface grids and depth probes and regions it was never shown, without retraining, and talks back in a voice that still sounds like the person.
subj · Asurface 8×8subj · BsEEG depthsubj · Cmixed… n subjects, n geometries, n regions …SUBJECT-AGNOSTIC TRANSFORMERany subject's voice
Fig. 1 — Three geometries, one model, one voice. The model learns a representation of cortex, not a particular cortex.

Here is a problem I've watched BCI groups quietly live with for a decade. You spend six months collecting ECoG data from one patient, you train a brilliant decoder, you get a great Nature paper, and then a new patient arrives with an electrode grid in a slightly different place, oriented slightly differently, and your six months of work is worth roughly nothing. The decoder starts from scratch. The patient starts from scratch. The field starts from scratch.

This is the paper about the decoder that doesn't.

The problem, stated precisely

Neural recordings are almost pathologically non-shared. Two patients implanted for clinical reasons will have:

A standard neural network expects a fixed input shape. So almost every decoder in the literature is implicitly single-subject — trained per-patient, discarded at end of monitoring, re-trained on the next one. This is not a research gap. This is a crater.

Our answer, in one diagram

We stopped thinking of the inputs as "images" of cortex. We started thinking of them as sets of tokens, each token representing one electrode's activity and carrying, alongside the signal itself, the electrode's approximate location in a common MNI-style reference space. A transformer ingests tokens natively; a transformer does not care how many tokens you hand it; and a transformer can learn to treat location as a feature rather than a given.

Once you frame it that way, subject-specific shape goes away. The model learns a language of electrode-tokens, and the language is shared across all subjects.

Interactive · drop electrodes, watch the model still decode
RECONSTRUCTED SPECTROGRAMquality score:0.81
Slide electrode count, toggle depth probes. The model produces a usable output at any count > 8, and quality degrades gracefully instead of collapsing.
Simulated · matches paper's degradation profile qualitatively

Three things that mattered in training

One — coordinate embeddings. Each electrode token gets a small positional embedding derived from its MNI coordinates. This is what lets the model "know where" an input is coming from even when it has never seen that exact location before. We briefly tried learned per-electrode embeddings; that version generalised worse, because it memorised rather than inferred.

Two — masked pretraining. We pretrained on a reconstruction task — given a random subset of tokens, predict the rest — before ever touching speech. This trained the model to treat the geometry as interpolable. It also made the downstream task data-efficient by an embarrassing margin.

Three — grid-dropout during training. We randomly dropped entire rows or depth-probe segments during training. The model learned to gracefully degrade rather than fail, which is exactly the thing you need when the test-time patient doesn't have the same coverage as training.

43
Subjects pooled
0.78
PCC · unseen subject
0
Retraining required

What "subject-agnostic" does not mean

It does not mean the model ignores the subject. It means the model adapts to the subject at inference time, without weight updates, by consuming the subject's electrode coordinates as part of the input. An analogy: a fluent English reader doesn't need to re-learn English for each new page of text, but they do attend to the specific words in front of them. Our transformer does the same thing with electrodes.

"Subject-agnostic" is a marketing name. "Subject-conditioned without finetuning" is the accurate technical description. I argued for the accurate name. I lost to the marketing name.
PCC BY HELD-OUT SUBJECT · LOWER = WORSEPer-subject (train from scratch)0.52Finetune shared backbone0.68Ours · zero-shot on held-out subject0.78Ours · with 10 min of subject data0.90prior approachesours
Fig. 2 — Zero-shot beats finetuning a standard shared backbone. A small calibration window closes the remaining gap.

Why I think this changes the BCI data calculus

If decoders generalise, data stops being subject-private and starts being pooled. That changes everything downstream. Collection economics improve because every subject's data benefits every future subject. Clinical deployment improves because a new patient gets a working decoder on day one. And — here is the part I have spent a lot of time thinking about — patients have less reason to worry about their private neural data being "used" in a way that only helps strangers; under a pooling model, their data helps them too.

None of this is automatic. It depends on the pooled model being maintained, updated, and governed; and the governance question is harder than the technical question. But the technical question had been the bottleneck, and I think with this paper the bottleneck has moved.

← all fun papers Next: 08 VoxelFormer →