IEEE · 20246 min read

Heart-rate prediction, from ARIMA to transformers — and why the newer isn't always better.

Short-horizon heart-rate forecasting is a canonical wearables problem: cheap data, clinically useful output, and an embarrassingly wide variety of models competing on it. We ran the gauntlet — ARIMA, LSTM, CNN, transformer, TSMixer — and found ordering that surprised us a little, and should surprise the field more.
HEART RATE · 60-MINUTE WINDOW · 1s SAMPLINGexerciseforecast horizon · 10 min60 BPM baseline · 1s samples · predicting next 600s
Fig. 1 — An hour of heart rate with an exercise bout in the middle. We predict the last 10 minutes from the first 50.

Heart-rate forecasting is the kind of problem that turns up in every wearables-adjacent paper as a benchmark, which means every new model gets compared on it, which means we have a lot of numbers and very little narrative about which models work and which ones just look fashionable. This paper is the narrative.

The contest, rigged fairly

We ran ARIMA (the old guard), LSTM (the 2015 guard), a small 1D-CNN (the "hey we tried", guard), a transformer (the 2022 guard), and TSMixer (the "please stop training transformers on tabular", guard). Same data, same validation split, same hyperparameter search budget, same forecast horizon (10 min ahead from 50 min of context).

The result, in one line: ARIMA is surprisingly competitive on short horizons, TSMixer wins on medium horizons, transformers win nowhere, LSTMs are fine. CNNs are the dark-horse runner-up on exercise-transition windows.

Interactive · change the forecast horizon, watch the ranking shuffle
RMSE (BPM) · LOWER = BETTERARIMA--LSTM--Transformer--TSMixer--
At 1-min horizon ARIMA leads. Around 8-12 min TSMixer overtakes. Past 20 min everything degrades and the ordering becomes noise.
Shape matches Table 2 in the paper

Why transformers lost

I think this is the most interesting finding and it is one we spent a lot of words on. Transformers need more data than physiological time-series typically offer. They are also architecturally unbiased — they have no inductive prior for temporal locality, so they end up reconstructing it from data instead of being given it free.

For heart rate, this is doubly bad: the signal is heavily autocorrelated at short lags and weakly autocorrelated at long ones. TSMixer and LSTM get the first property for free; ARIMA gets it by construction; transformers have to learn it, and with ~10000 samples of training data they don't learn it well enough. This is a model-mismatch story, not a compute-budget story. Throwing more GPUs at the transformer did not fix it.

Not every problem needs an attention mechanism. Some problems need a recurrence. Some problems just need a Box-Jenkins diagnostic.
TSMixer
Medium-horizon winner
ARIMA
Short-horizon winner
nowhere
Transformer winner

What I'd do in v2

Condition on activity. We treated the exercise-bout label as noise; it isn't. A model that knows the subject just stopped running is going to predict a sensible decay back to baseline. A model that is guessing will get the decay wrong. Per-bout conditioning is a half-page change to the training loop and probably the next 15% of the error.

← all fun papers Next: 19 Connectomics AL →