Exchange Rate DL · Fun Papers 17

Exchange rates are famously hard. The random-walk baseline is embarrassingly good. Any model worth reporting has to beat "yesterday's price" by a margin that survives transaction costs, and that is a higher bar than the DL literature tends to acknowledge. This paper is a careful study of what does and doesn't clear that bar for RMB/USD daily returns.

What we actually did

Feature selection was the project. We ran four models — an LSTM, a 1D-CNN, an attention-only transformer, and TSMixer — with three feature sets: price only; price + China-US trade volumes; price + trade + cross-rates (EUR/RMB, JPY/USD). The best model wasn't the flashiest. TSMixer — which is basically two MLPs and an intuition about time-channel mixing — was the one that generalised.

The reason is unglamorous. Exchange-rate series have long but shallow dependencies: persistence over weeks, not deep multi-step causal chains. Transformers over-fit the attention patterns in training data. TSMixer's architectural prior — "mix across time with a channel-wise MLP" — happens to match the data.

Explainability

A forecast you can't explain is useless to a trader. We used SHAP over the time-channel mixing weights to produce per-feature attributions for each prediction. China-US trade volume was the largest contributor by a wide margin. EUR/RMB co-moved enough to be useful. CPI surprises showed up only at macro-event windows — sparse but high-magnitude contributions, which SHAP captures naturally.

The transformer's attention map was a mess of fake signals. TSMixer's channel weights were legible. That, more than the RMSE number, is why this is the model we'd ship.

−18%

RMSE vs RW

TSMixer

Winning model

SHAP

Per-feature attribution

What this wasn't

It wasn't a trading paper. We did not backtest a strategy. We forecast the rate, explained the forecast, and stopped. The gap between a good forecaster and a profitable trader is real and involves risk management we did not solve. A reader who is about to deploy this in production should read paper 16's caveats; they apply here verbatim.