arXiv · 2509.08703 · 20259 min read

Predicting speech arrest, before anyone touches the scalpel.

Before epilepsy surgery, neurologists stimulate each implanted electrode and ask the patient to speak. An electrode that freezes the patient mid-word is "eloquent" — do not resect. The procedure is slow, painful, and runs out of time. We built a model that predicts which electrodes will freeze speech, from resting-state data alone.
A · electrode grid · predicted P(speech arrest)P(arrest) = 0= 1B · ROC curve · AUC = 0.870FPR10TPR1Before surgery, electrodes on the left are tested in about 45 minutes instead of 4 hours.
Fig. 1 — Predicted arrest probability (A) and binary classifier performance (B). The red cluster is the do-not-resect region.

This paper is about the most stressful 45 minutes in a neurology service I have ever watched.

A patient is on the operating table, awake, with the skull open. A neurologist has a stimulator in hand and is pulsing each implanted electrode in turn, asking the patient to count, or name objects, or repeat a sentence. The question is simple: does this pulse freeze the patient's speech? If yes, the tissue directly under that electrode is eloquent — language-critical — and it must not be resected, no matter how seizure-onset-friendly it looks on the imaging.

Some electrodes freeze speech. Most don't. The ones that do tell you where not to cut. Miss them and the patient wakes up unable to speak.

Why this matters beyond the surgery

Stimulation mapping is a gold standard, but it is a bad one in three ways. It is time-bounded — a surgeon has an OR slot, not an afternoon. It is painful — stimulation at some sites elicits sensations the patient does not like. And it is coarse — with a 128-electrode grid, you often run out of time before you finish testing, and the neurologist has to use clinical judgement to prioritise.

So: a prior on which electrodes are likely to be eloquent — even a probabilistic one — would be enormously useful. It lets the surgeon test the high-prior sites first. It lets them skip the ones the prior says are almost certainly safe. And on the rare run where they cannot test everything, the prior backfills the uncertainty with something better than a random guess.

What the model sees, what it predicts

Our input is ten minutes of resting-state ECoG — the patient is not doing anything, just lying in bed on the epilepsy monitoring unit. For each electrode, we compute a bundle of signals: spectral power in six bands, several measures of local connectivity (coherence, directed transfer function) to the rest of the grid, and a small handful of anatomical covariates (MNI coordinates, distance to superior temporal sulcus, etc.).

The output is a single probability per electrode: given the stimulation currents used in this patient, what is the probability that stimulating this electrode will arrest speech?

Interactive · set your threshold
rate →classifier threshold →sensitivity · miss-rate matters mostspecificity · false-positives cost time
At the default threshold, sensitivity is high enough that no eloquent electrode is missed. Drag left to be even safer — at the cost of a longer test list.
Simulated ROC operating curve · shape matches paper's Fig. 4

The asymmetry of errors

The single most important design decision in this paper was not the model. It was the loss.

A naïve accuracy objective treats false positives and false negatives symmetrically. In this domain they are not symmetric. A false negative is an eloquent electrode the model says is safe, that the surgeon now tests last or skips entirely, and the patient wakes up with a speech deficit. A false positive is a safe electrode the model says is eloquent, that the surgeon tests first, finds to be safe, and moves on from; cost ≈ two minutes of OR time.

False negatives cost language. False positives cost seconds. The loss must know this. Ours does.

The right metric here is not ROC-AUC, and certainly not accuracy. It is the fraction of eloquent sites captured by the top-k highest-probability sites, for a k a surgeon can realistically test. Everything we optimised was downstream of that metric.

Features that surprised us

We did the usual: trained an XGBoost baseline, a small MLP, a graph neural network that uses the electrode-to-electrode connectivity graph. The GNN won, by a little, and was the one we deployed. But the feature-importance analysis produced three non-trivial findings.

High-gamma power at rest is predictive. Eloquent sites are, on average, more active at rest. Not surprising in hindsight. Surprising that it was quite this informative — nearly as useful on its own as the entire connectivity feature family.

Functional connectivity to STG matters more than distance. Anatomical distance to the classical language regions is weakly predictive. Functional connectivity to STG — how much resting activity the electrode shares with the superior temporal gyrus — is strongly predictive. This matches the distributed-processing story from our PNAS paper: eloquence travels along functional networks, not just geometric neighbourhoods.

The model is robust to subject variability. Trained on 30 subjects, tested on 10 unseen, the AUC dropped from 0.89 (within-sample) to 0.87 (held-out). This is small enough to be clinically tolerable. I would have forgiven a much larger drop given the heterogeneity of the patient population.

The deployment question

We have not deployed this model in a clinical decision-support role. The paper is a research artefact. Real deployment would require a trial, a regulatory pathway, a careful study of what happens when a surgeon's priors are explicitly modified by a machine's priors, and — the thing nobody in ML wants to talk about — a whole workflow for the cases where the model is wrong in the direction that costs the patient something.

I think that trial will happen in the next three years, with a version of this model that is not ours. That is fine. If the paper moves the pre-surgical-planning workflow from "guess what order to test electrodes in" to "consult a probabilistic prior, test in priority order, reclaim 30% of the OR slot," it has done its job even if it never becomes a product.

A personal aside

Working on this project taught me a thing I should have known earlier: ML for clinical settings is not ML for research settings scaled down. The metrics are different. The deployment pressures are different. The acceptable failure modes are different. The "gold standard" you are trying to beat is a human expert doing a procedure whose definition is itself evolving. You have to take the expert seriously, and you also have to take seriously the question of what the expert would do with a tool like this if it existed. That is a different kind of modeling problem, and it lives above the loss function, not inside it.

← all fun papers Next: 10 VANGUARD →