This paper is about the most stressful 45 minutes in a neurology service I have ever watched.
A patient is on the operating table, awake, with the skull open. A neurologist has a stimulator in hand and is pulsing each implanted electrode in turn, asking the patient to count, or name objects, or repeat a sentence. The question is simple: does this pulse freeze the patient's speech? If yes, the tissue directly under that electrode is eloquent — language-critical — and it must not be resected, no matter how seizure-onset-friendly it looks on the imaging.
Some electrodes freeze speech. Most don't. The ones that do tell you where not to cut. Miss them and the patient wakes up unable to speak.
Why this matters beyond the surgery
Stimulation mapping is a gold standard, but it is a bad one in three ways. It is time-bounded — a surgeon has an OR slot, not an afternoon. It is painful — stimulation at some sites elicits sensations the patient does not like. And it is coarse — with a 128-electrode grid, you often run out of time before you finish testing, and the neurologist has to use clinical judgement to prioritise.
So: a prior on which electrodes are likely to be eloquent — even a probabilistic one — would be enormously useful. It lets the surgeon test the high-prior sites first. It lets them skip the ones the prior says are almost certainly safe. And on the rare run where they cannot test everything, the prior backfills the uncertainty with something better than a random guess.
What the model sees, what it predicts
Our input is ten minutes of resting-state ECoG — the patient is not doing anything, just lying in bed on the epilepsy monitoring unit. For each electrode, we compute a bundle of signals: spectral power in six bands, several measures of local connectivity (coherence, directed transfer function) to the rest of the grid, and a small handful of anatomical covariates (MNI coordinates, distance to superior temporal sulcus, etc.).
The output is a single probability per electrode: given the stimulation currents used in this patient, what is the probability that stimulating this electrode will arrest speech?
The asymmetry of errors
The single most important design decision in this paper was not the model. It was the loss.
A naïve accuracy objective treats false positives and false negatives symmetrically. In this domain they are not symmetric. A false negative is an eloquent electrode the model says is safe, that the surgeon now tests last or skips entirely, and the patient wakes up with a speech deficit. A false positive is a safe electrode the model says is eloquent, that the surgeon tests first, finds to be safe, and moves on from; cost ≈ two minutes of OR time.
False negatives cost language. False positives cost seconds. The loss must know this. Ours does.
The right metric here is not ROC-AUC, and certainly not accuracy. It is the fraction of eloquent sites captured by the top-k highest-probability sites, for a k a surgeon can realistically test. Everything we optimised was downstream of that metric.
Features that surprised us
We did the usual: trained an XGBoost baseline, a small MLP, a graph neural network that uses the electrode-to-electrode connectivity graph. The GNN won, by a little, and was the one we deployed. But the feature-importance analysis produced three non-trivial findings.
High-gamma power at rest is predictive. Eloquent sites are, on average, more active at rest. Not surprising in hindsight. Surprising that it was quite this informative — nearly as useful on its own as the entire connectivity feature family.
Functional connectivity to STG matters more than distance. Anatomical distance to the classical language regions is weakly predictive. Functional connectivity to STG — how much resting activity the electrode shares with the superior temporal gyrus — is strongly predictive. This matches the distributed-processing story from our PNAS paper: eloquence travels along functional networks, not just geometric neighbourhoods.
The model is robust to subject variability. Trained on 30 subjects, tested on 10 unseen, the AUC dropped from 0.89 (within-sample) to 0.87 (held-out). This is small enough to be clinically tolerable. I would have forgiven a much larger drop given the heterogeneity of the patient population.
The deployment question
We have not deployed this model in a clinical decision-support role. The paper is a research artefact. Real deployment would require a trial, a regulatory pathway, a careful study of what happens when a surgeon's priors are explicitly modified by a machine's priors, and — the thing nobody in ML wants to talk about — a whole workflow for the cases where the model is wrong in the direction that costs the patient something.
I think that trial will happen in the next three years, with a version of this model that is not ours. That is fine. If the paper moves the pre-surgical-planning workflow from "guess what order to test electrodes in" to "consult a probabilistic prior, test in priority order, reclaim 30% of the OR slot," it has done its job even if it never becomes a product.
A personal aside
Working on this project taught me a thing I should have known earlier: ML for clinical settings is not ML for research settings scaled down. The metrics are different. The deployment pressures are different. The acceptable failure modes are different. The "gold standard" you are trying to beat is a human expert doing a procedure whose definition is itself evolving. You have to take the expert seriously, and you also have to take seriously the question of what the expert would do with a tool like this if it existed. That is a different kind of modeling problem, and it lives above the loss function, not inside it.