Stimulus Speech Decoding From Human Cortex With Generative Adversarial Network Transfer Learning

Ran Wang

Xupeng Chen

Amirhossein Khalilian-Gourtani

Zhaoxi Chen

Leyao Yu

Adeen Flinker

Yao Wang

[Demo]

[Talk]

[Slides]

[Paper]

Decoding results on each one of HD and HB testing set. GT and SW denote ground truth and SpecWaveNet, respectively.

Abstract

Decoding auditory stimulus from neural activity can enable neuroprosthetics and direct communication with the brain. Some recent studies have shown successful speech decoding from intracranial recording using deep learning models. However, scarcity of training data leads to low quality speech reconstruction which prevents a complete brain-computer-interface (BCI) application. In this work, we propose a transfer learning approach with a pre-trained GAN to disentangle representation and generation layers for decoding. We first pre-train a generator to produce spectrograms from a representation space using a large corpus of natural speech data. With a small amount of paired data containing the stimulus speech and corresponding ECoG signals, we then transfer it to a bigger network with an encoder attached before, which maps the neural signal to the representation space. To further improve the network generalization ability, we introduce a Gaussian prior distribution regularizer on the latent representation during the transfer phase. With at most 150 training samples for each tested subject, we achieve a state-of-the-art decoding performance. By visualizing the attention mask embedded in the encoder, we observe brain dynamics that are consistent with findings from previous studies investigating dynamics in the superior temporal gyrus (STG), pre-central gyrus (motor) and inferior frontal gyrus (IFG). Our findings demonstrate a high reconstruction accuracy using deep learning networks together with the potential to elucidate interactions across different brain regions during a cognitive task

Generated Audio Demo

Videos

ISBI 2020 Talk

Paper and Supplementary Material

Wang, Ran and Chen, Xupeng and Khalilian-Gourtani, Amirhossein and Chen, Zhaoxi and Yu, Leyao and Flinker, Adeen and Wang, Yao.
Stimulus Speech Decoding From Human Cortex With Generative Adversarial Network Transfer Learning.
In ISBI, 2020. (hosted here)

[Bibtex]

Performance comparisons

Quantitative comparison of transfer-GAN (proposed), SpecWaveNet, and linear model in MSE (lower is better) and CC (higher is better) on test data. “-” refers to number not reported.

Model Architecture

Overview of the transfer-GAN framework.

Overview of the generator network. Total K = 5 residual blocks are used. The BN, DO, and 1 × 1 in the figure denote batch normalization, dropout, and temporal convolution with filter width 1, respectively.

The transfer-GAN framework contains an encoder that maps an ECoG signal to a representation space with a prescribed distribution, followed by a generator that generates a spectrogram from the representation vector (output of the encoder). Finally, the spectrogram is converted to the sound waveform using another network (vocoder). Both the generator and the vocoder can be pretrained using any large corpus of speech data. To encourage realist spectrograms generation, a GAN loss is applied during generator pre-training. Then, the encoder and the generator can be refined together using the paired data.

Averaged evolution of the attention mask

The plot shows the averaged evolution of the attention mask for a subject with the hybrid grid. The color in each electrode indicates the value of the attention mask, following the color bar. The white square shows the 8×8 grid used in the experiment. Similar dynamic is also observed in the other HB subject.