This was my first paper as a PhD student, and it is the one that taught me the single most useful technical habit I have: check whether the task your data supports is actually the task you are trying to solve. ECoG-paired-with-speech datasets are small. You cannot train a good speech generator on a few hours of data. But you do not need to train a speech generator on ECoG — you need to train a map from ECoG to a speech space that already exists.
So we built the speech space with a GAN on freely-available natural speech, took its generator's latent manifold as our target, and then trained only the small ECoG→latent encoder on the paired ECoG data we actually had. Instead of data-hungry end-to-end training, we had two small regression problems and one large pretraining problem solved for us by the audio community.
Why this worked
The generator's latent space is organised by acoustic content. It is already smooth: nearby points in latent space produce acoustically similar speech. That smoothness is exactly the property you need downstream — a small imprecision in the ECoG encoder's output produces a small imprecision in the synthesised speech, not a catastrophic jump to a different utterance. This was not guaranteed, but it held empirically and it is why the paper even had numbers worth publishing.
What I got wrong
Two things. First, I over-indexed on generator quality as the limiting factor. In reality the ECoG encoder was the bottleneck — the generator was already better than my encoder could address. Second, I assumed speaker identity would come "for free" from the generator's latent space. It did not; speaker identity lives in a disentangled direction we didn't have supervision for. That problem got solved properly four years later in paper 01.
The best thing I did in this paper was pretend I had 1000x more data than I had, by outsourcing the expensive half of the learning to a task other people had already solved.
What it taught me about research
In every subsequent paper I have written, I start by asking: is there a free version of the hard part? Can I borrow an embedding space, a generator, a classifier, a representation — and then only learn the narrow subject-specific mapping into it? More often than not the answer is yes, and the answer is why our labs' downstream work became feasible.