VoxelFormer · Fun Papers 08

If you have watched fMRI visual decoding evolve over the last two years, you have watched an arms race. Every new model is bigger, has more per-subject parameters, and claims a couple more points on retrieval accuracy. The state of the art could reconstruct strikingly good images — at the cost of training a roughly GPT-2-sized network per subject, usually on 30+ hours of carefully collected scan data.

VoxelFormer is my co-authors' attempt to ask whether the arms race is, technically, necessary.

The idea, smaller than it looks

Standard pipelines treat fMRI decoding as a per-subject adapter problem. Voxels for subject A are a fixed vector of some length N_A; a big neural network maps it into CLIP space; retrieve images from there. Voxels for subject B are a different fixed vector of length N_B; you need a different adapter.

The pattern should be familiar from our subject-agnostic transformer. Stop treating the input as a fixed-shape vector. Treat it as a set. Each voxel is a token, each token has a positional embedding (MNI coordinates + ROI identity), the transformer eats sets natively, and a single shared model handles everyone.

Why this is a real gain, not a smaller-model stunt

The headline number in the paper — competitive retrieval, ~30× fewer parameters — could easily sound like a compression result. It isn't. The interesting property is not that the model is small; it's that the model is shared. Three consequences follow that compression alone would not buy:

Data from subject A helps subject B. A pooled model internalises generalities — visual cortex organisation, retinotopy, category-selective regions — that no single-subject adapter can learn. Adding more subjects improves everyone's decode.
New subjects need less calibration. With 1–2 hours of scan, a new subject's retrieval is already usable. The per-subject pipeline would need 10× that.
The model is portable across scanners. Because coordinates are an explicit input, a subject scanned in a different protocol with different TR and slightly different coverage is still in-distribution.

Pooling is not a compression trick. It is the thing that makes visual decoding feasible for people who have access to one hour of 7T time, not one month. That is most labs, most patients, most of the future.

What we had to be careful about

The cross-subject generalisation is not free. Three failure modes we had to engineer around:

Category bias. The NSD dataset, like most fMRI visual datasets, is unevenly distributed across image categories. A pooled model will happily overfit the categories that dominate — faces, indoor scenes — and underperform on the rare ones. We added a class-balanced contrastive loss; it cost a little absolute accuracy and bought a lot of calibration.

Subject leakage. A transformer trained on many subjects can, in principle, learn a subject-recognition shortcut from voxel statistics alone. We tested for this carefully: a classifier trained to predict subject identity from VoxelFormer's internal representations only reaches chance+3%, which is reassuring. We didn't expect it to reach chance; a shared voxel-space model that is completely subject-invariant is probably not decoding anything useful.

Anatomical variability. MNI coordinates are a shared space in principle, but individual cortex is folded differently. We projected each subject's voxels onto a common surface mesh before tokenising. That cost a week of preprocessing and paid for itself several times over.

What I wish we had done differently

We did not, in this paper, do the obvious thing: train VoxelFormer on all the public fMRI visual datasets at once. We trained on NSD. The "pooling across datasets" experiment is in the roadmap and, if it works, I expect the next paper after this one to dissolve the last major single-dataset assumption in fMRI visual decoding. If you are reading this in 2026 and someone has done it, they probably did a better job than we would have.

I did not lead this work — Chenqian did, and did it beautifully. What I contributed was the transformer-of-sets framing, inherited from the ECoG paper, and the dogged insistence that "parameter-efficient" is not a headline on its own; it has to change what experiments are possible. That insistence got the paper written, I think, in a usefully different shape than it otherwise might have been.