Heart-rate forecasting is the kind of problem that turns up in every wearables-adjacent paper as a benchmark, which means every new model gets compared on it, which means we have a lot of numbers and very little narrative about which models work and which ones just look fashionable. This paper is the narrative.
The contest, rigged fairly
We ran ARIMA (the old guard), LSTM (the 2015 guard), a small 1D-CNN (the "hey we tried", guard), a transformer (the 2022 guard), and TSMixer (the "please stop training transformers on tabular", guard). Same data, same validation split, same hyperparameter search budget, same forecast horizon (10 min ahead from 50 min of context).
The result, in one line: ARIMA is surprisingly competitive on short horizons, TSMixer wins on medium horizons, transformers win nowhere, LSTMs are fine. CNNs are the dark-horse runner-up on exercise-transition windows.
Why transformers lost
I think this is the most interesting finding and it is one we spent a lot of words on. Transformers need more data than physiological time-series typically offer. They are also architecturally unbiased — they have no inductive prior for temporal locality, so they end up reconstructing it from data instead of being given it free.
For heart rate, this is doubly bad: the signal is heavily autocorrelated at short lags and weakly autocorrelated at long ones. TSMixer and LSTM get the first property for free; ARIMA gets it by construction; transformers have to learn it, and with ~10000 samples of training data they don't learn it well enough. This is a model-mismatch story, not a compute-budget story. Throwing more GPUs at the transformer did not fix it.
Not every problem needs an attention mechanism. Some problems need a recurrence. Some problems just need a Box-Jenkins diagnostic.
What I'd do in v2
Condition on activity. We treated the exercise-bout label as noise; it isn't. A model that knows the subject just stopped running is going to predict a sensible decay back to baseline. A model that is guessing will get the decay wrong. Per-bout conditioning is a half-page change to the training loop and probably the next 15% of the error.