A Neural Speech Decoding Framework

The first time I heard a reconstructed sentence come out of our model, I was sitting in the hospital cafeteria eating a slightly disappointing sandwich. The participant had silently thought about the phrase "it's time to stop" — and a voice that sounded unmistakably like theirs said it back, through my laptop speakers, two hundred feet down the hall from their bed. I stopped eating. The sandwich went cold. I forgot about it for about thirty minutes.

This paper is about that thirty minutes.

What we actually did

Decoding speech from the cortex is an old dream with a lot of failed mornings. You can read neurons, you can read muscles, you can read airflow — but gluing them into a voice that sounds like a person is a different problem, because voices are not classifier outputs. They are trajectories through a space that includes pitch, formants, breathiness, onset, coarticulation, and — this is the inconvenient part — the identity of the speaker.

So we made a pipeline with two halves, trained end-to-end:

An ECoG decoder that reads the high-gamma power from a grid of cortical electrodes and predicts a small set of interpretable speech parameters — things like pitch contours, voicing, and formant frequencies.
A differentiable speech synthesizer that takes those parameters and produces a waveform. Crucially, gradients flow back through it, so the decoder learns to predict parameters that sound right, not just parameters that match a teacher.

That second piece is what made the project work. Prior frameworks used a fixed synthesizer or a black-box vocoder, which meant the decoder was optimizing for something mechanical rather than something audible. Once the synthesizer is differentiable, "loss" becomes a perceptual quantity.

Participants

0.806

Peak PCC (spectrogram)

<50ms

Causal latency

Three things we didn't expect

One — it worked on the right hemisphere. Speech is canonically a left-hemisphere story. But 8 of our participants had right-hemisphere grids, and the model decoded them almost as well. This isn't a new finding exactly — the right hemisphere has long been suspected of carrying prosody and spectral detail — but it's the first time I'd seen it decoded into intelligible speech.

Two — low-density grids were enough. Much of the prior literature uses high-density research grids (128+ channels). Ours work on clinical-grade low-density grids too, which matters because those are the grids a patient might actually receive as part of ordinary epilepsy care. You don't need a research implant. You need the implant they already have.

Three — the voice kept its owner. Because the synthesizer parameters include speaker-specific pitch and formant ranges, the reconstructed speech carries identity. It doesn't sound like a voice; it sounds like this person's voice. That is either a beautiful property or a small ethical horror, depending on which way you look at it, and I think about that a lot.

A BCI that sounds like you is a different kind of prosthesis. It isn't restoring a function; it's returning a possession.

Fig. 2 — Spectrogram-level correlation by electrode region. Even the right hemisphere carries enough signal to talk.

The honest limits

We trained on overt speech — people actually saying words, not just imagining them. Generalizing to attempted or imagined speech is a harder problem, and one we can't claim to have solved. Dataset size matters more than architecture, and we're still at the stage where every new participant is a small blessing and a large annotation cost.

And there is the dual-use question I refuse to pretend isn't in the room. A system this good at making cortex into voice is, on paper, a system that could be pointed at a cortex that doesn't want it. I don't have a tidy answer. I have a preference — that the technology stays in clinical hands long enough for the norms to catch up — and a small hope that the work will mostly be used to give people back a thing they loved and lost.

Why I think this matters

Most BCI papers frame their result as a restoration. This one quietly makes a harder claim: that speech is shallow enough in the brain that you can read it off the pial surface with consumer-grade grids and hear the person behind it. If that's true, a lot of what we assumed about where language "lives" is going to get re-sorted in the next five years, and the sorting will be messy and beautiful.

Also I have never been happier about a disappointing sandwich.