IEEE ISBI · 20207 min read

Stimulus speech decoding with GAN transfer learning, the paper where I learned to cheat honestly.

Reconstructing heard speech from cortex is a chicken-and-egg problem: you don't have enough ECoG-paired audio to train a good decoder, and you can't pretrain on something else because "something else" doesn't exist. We used a GAN trained on natural speech as a prior, then borrowed its decoder. Small ECoG dataset, credible reconstructions.
STAGE 1 · pretrain GAN on lots of speechGDSTAGE 2 · freeze G · attach ECoG encoderEncG(frozen)→ decoded spectrogramThe trick: the hard, data-heavy part of the model was learned from free data.
Fig. 1 — GAN on speech first, freeze, then learn the tiny ECoG→latent encoder on what little ECoG you have.

This was my first paper as a PhD student, and it is the one that taught me the single most useful technical habit I have: check whether the task your data supports is actually the task you are trying to solve. ECoG-paired-with-speech datasets are small. You cannot train a good speech generator on a few hours of data. But you do not need to train a speech generator on ECoG — you need to train a map from ECoG to a speech space that already exists.

So we built the speech space with a GAN on freely-available natural speech, took its generator's latent manifold as our target, and then trained only the small ECoG→latent encoder on the paired ECoG data we actually had. Instead of data-hungry end-to-end training, we had two small regression problems and one large pretraining problem solved for us by the audio community.

Why this worked

The generator's latent space is organised by acoustic content. It is already smooth: nearby points in latent space produce acoustically similar speech. That smoothness is exactly the property you need downstream — a small imprecision in the ECoG encoder's output produces a small imprecision in the synthesised speech, not a catastrophic jump to a different utterance. This was not guaranteed, but it held empirically and it is why the paper even had numbers worth publishing.

Interactive · how much ECoG data would a fully end-to-end model need?
reconstruction quality (PCC) →ECoG data available (hours) →end-to-end (scratch)GAN-pretrained (ours)
Below ~5h of paired data — the regime most labs live in — pretrained models crush end-to-end ones. Above 15h it stops mattering.
Qualitative shape · matches Fig. 4

What I got wrong

Two things. First, I over-indexed on generator quality as the limiting factor. In reality the ECoG encoder was the bottleneck — the generator was already better than my encoder could address. Second, I assumed speaker identity would come "for free" from the generator's latent space. It did not; speaker identity lives in a disentangled direction we didn't have supervision for. That problem got solved properly four years later in paper 01.

The best thing I did in this paper was pretend I had 1000x more data than I had, by outsourcing the expensive half of the learning to a task other people had already solved.
4h
Paired ECoG used
+32%
PCC vs scratch
2020
Year of suffering

What it taught me about research

In every subsequent paper I have written, I start by asking: is there a free version of the hard part? Can I borrow an embedding space, a generator, a classifier, a representation — and then only learn the narrow subject-specific mapping into it? More often than not the answer is yes, and the answer is why our labs' downstream work became feasible.

← all fun papers Next: 14 When Thinking Fails →