Subject-Agnostic Transformer Decoding

Here is a problem I've watched BCI groups quietly live with for a decade. You spend six months collecting ECoG data from one patient, you train a brilliant decoder, you get a great Nature paper, and then a new patient arrives with an electrode grid in a slightly different place, oriented slightly differently, and your six months of work is worth roughly nothing. The decoder starts from scratch. The patient starts from scratch. The field starts from scratch.

This is the paper about the decoder that doesn't.

The problem, stated precisely

Neural recordings are almost pathologically non-shared. Two patients implanted for clinical reasons will have:

Different numbers of electrodes (anywhere from 16 to 256)
Different electrode geometries (surface grids, depth probes, strip arrays)
Different covered regions (frontal-only, temporal-only, both)
Different sampling rates, filtering, artifact profiles

A standard neural network expects a fixed input shape. So almost every decoder in the literature is implicitly single-subject — trained per-patient, discarded at end of monitoring, re-trained on the next one. This is not a research gap. This is a crater.

Our answer, in one diagram

We stopped thinking of the inputs as "images" of cortex. We started thinking of them as sets of tokens, each token representing one electrode's activity and carrying, alongside the signal itself, the electrode's approximate location in a common MNI-style reference space. A transformer ingests tokens natively; a transformer does not care how many tokens you hand it; and a transformer can learn to treat location as a feature rather than a given.

Once you frame it that way, subject-specific shape goes away. The model learns a language of electrode-tokens, and the language is shared across all subjects.

Three things that mattered in training

One — coordinate embeddings. Each electrode token gets a small positional embedding derived from its MNI coordinates. This is what lets the model "know where" an input is coming from even when it has never seen that exact location before. We briefly tried learned per-electrode embeddings; that version generalised worse, because it memorised rather than inferred.

Two — masked pretraining. We pretrained on a reconstruction task — given a random subset of tokens, predict the rest — before ever touching speech. This trained the model to treat the geometry as interpolable. It also made the downstream task data-efficient by an embarrassing margin.

Three — grid-dropout during training. We randomly dropped entire rows or depth-probe segments during training. The model learned to gracefully degrade rather than fail, which is exactly the thing you need when the test-time patient doesn't have the same coverage as training.

Subjects pooled

0.78

PCC · unseen subject

Retraining required

What "subject-agnostic" does not mean

It does not mean the model ignores the subject. It means the model adapts to the subject at inference time, without weight updates, by consuming the subject's electrode coordinates as part of the input. An analogy: a fluent English reader doesn't need to re-learn English for each new page of text, but they do attend to the specific words in front of them. Our transformer does the same thing with electrodes.

"Subject-agnostic" is a marketing name. "Subject-conditioned without finetuning" is the accurate technical description. I argued for the accurate name. I lost to the marketing name.

Fig. 2 — Zero-shot beats finetuning a standard shared backbone. A small calibration window closes the remaining gap.

Why I think this changes the BCI data calculus

If decoders generalise, data stops being subject-private and starts being pooled. That changes everything downstream. Collection economics improve because every subject's data benefits every future subject. Clinical deployment improves because a new patient gets a working decoder on day one. And — here is the part I have spent a lot of time thinking about — patients have less reason to worry about their private neural data being "used" in a way that only helps strangers; under a pooling model, their data helps them too.

None of this is automatic. It depends on the pooled model being maintained, updated, and governed; and the governance question is harder than the technical question. But the technical question had been the bottleneck, and I think with this paper the bottleneck has moved.