Cerebral Palsy Severity from Video

This paper is about a research pattern I have come to love, which is: a model that is slightly less accurate than the benchmark but much more useful than the benchmark. In CP severity assessment — and, I suspect, most clinical CV — usefulness is usually the thing we are actually optimising, and accuracy is a proxy that is only sometimes faithful to it.

The task and the trap

Cerebral palsy severity is graded on the Gross Motor Function Classification System (GMFCS), levels I through V. The grade is made by a clinician who watches the child walk. It is a holistic call, and you cannot train a perfect automated grader without a lot of clinician-labeled video — which, as you would expect, is both scarce and protected.

The obvious trap, and one a lot of papers fall into, is to treat this as a black-box classification problem. Train a big model on raw video, optimise for grade accuracy, publish 72%, move on. I do not mean to disparage those papers — they are important baselines. But they are not a tool a clinician will use, because the clinician's job isn't to output a number. The clinician's job is to write a note that says "level III, walking with truncal sway and impaired knee flexion on the left." A 72% classifier doesn't help with that note.

The fusion, in detail

The skeleton stream is a spatiotemporal GCN eating 17 body keypoints per frame; it's a workhorse architecture, bog-standard, chosen because it is well-understood and fast. The clinical-descriptor stream is a small MLP eating six expertly-curated gait features per video clip — step length asymmetry, cadence, knee flexion peak, stance-to-swing ratio, trunk sway, minimum toe clearance. Each of these is named, clinically interpretable, and individually motivated by the literature on CP gait.

We fuse the two streams with a gated-attention layer. During prediction, the gate tells us how much of each stream drove the final call. For high-confidence predictions from the skeleton stream, the gate opens; for borderline cases, the clinical descriptors are given higher weight. This matches the clinical intuition that when the automated grader is unsure, measured descriptors (knee flexion range, toe clearance) are more diagnostic than end-to-end features.

A clinician rereading the report can see not just the predicted level but which descriptors drove it — and, critically, can disagree with the model on the grounds that one descriptor is off. That is the real product.

Results, honestly

Our accuracy (70.86% five-way) is not state of the art. The best pure-skeleton method gets ~73%. We lose ~2 points. We gain something the pure method cannot offer: a per-prediction attribution to features a clinician recognises. In the paper we frame this as an explicit trade-off. In conversation I frame it more strongly: the pure method is a research artefact; the fused method is a tool.

70.86

5-way accuracy (%)

Clinical descriptors

1.0

GMFCS MAE (levels)

What we stumbled on

The most interesting finding is not the headline accuracy. It is that the clinical-descriptor stream alone — a six-dimensional MLP — reaches 62% accuracy. Six hand-crafted numbers, no deep learning on video, and you are two-thirds of the way to the automated result. This is a finding I want people to sit with. The gap between six well-chosen features and a big end-to-end model is 8 accuracy points. Those 8 points are real and valuable, but the shape of the problem is much more "pick the right features" than the current literature wants to admit.

I suspect this generalises. Whenever a clinical CV task has a century of domain expertise behind it, the hand-crafted features probably get you 70–80% of the way to the ML answer on their own. The ML is for the last twenty points, not the first seventy. We have it backwards.

Ethics and caution

A tool that grades severity could be used to allocate care — and in the wrong institutional hands, it could ration it. The paper is careful to say that the output is a decision-support tool, not an arbiter. The clinician remains the grader of record. We repeat that in the limitations and again in the discussion; it is not in my power to prevent the tool from being misused, but it is in my power to make it very hard to claim you misused it in good faith.

Separately — and this is a note more than an argument — the dataset we trained on is overwhelmingly North American and European, and underrepresents the regions where CP prevalence is highest. Any deployment outside that distribution needs its own validation. We do not claim otherwise. I write this note because I wish more papers wrote it explicitly.