arXiv · cs.MM · 2403.111557 min read

Interactive 360° video streaming, with FoV-adaptive coding and temporal prediction.

VR video is expensive because 90% of what you stream is behind the viewer. FoV-adaptive streaming sends premium quality only where the eye is looking — which works beautifully until the viewer turns their head. We built a scheme that gets temporal prediction back into the picture without blowing the latency budget.
360° VIDEO · EQUIRECTANGULAR TILING (18×9)FoV (premium · temporal pred)rotation margin (spatial only · intra)low-quality fallback →
Fig. 1 — Three-tier quality map: predicted FoV gets inter-coded premium tiles, a rotation margin gets intra-only tiles (so a sudden head turn doesn't freeze), the rest is low-quality.

A 4K 360° video has about 25 megapixels per frame. At 60fps that's 1.5 gigapixels per second. Nobody is actually streaming that; everyone cheats. The cheat is field-of-view-adaptive streaming — send high quality only where the eye is looking, send low quality everywhere else, and pray the user doesn't turn their head too fast. The question this paper asks is: what does "adaptive" mean when the adaptation has to happen in real time, frame by frame, for an interactive application?

Why interactive is the hard case

In a pre-recorded VR video, you can afford to wait — encode everything with inter-frame prediction, use big GOPs, serve tiles on demand. If the user turns unexpectedly, you serve them a low-quality version for a few frames while the new FoV buffers. People tolerate this.

Interactive 360 (telepresence, live streaming, cloud-rendered VR) does not tolerate this. The round trip is too short for buffering. You have to decide this frame what tiles to send at what quality. Prior work solved this by abandoning inter-prediction entirely — all-intra coding, which is robust to head turns but costs 3–5× the bandwidth. Our paper tries to get the inter-prediction bandwidth savings back without losing the robustness.

Interactive · turn your head faster
QUALITY ↓ · BANDWIDTH ↓ · LATENCY ↓All-intra (baseline)1.00×FoV-only inter0.40×Ours · FoV inter + margin intra0.46×PSNR drop on sudden turns0 dB
Slow turns cost nothing. As turn speed rises, the FoV-only scheme drops PSNR sharply. Ours absorbs the turn via the intra-coded margin.
Simulated · matches Fig. 7 of the paper

The two-zone scheme

Our trick is a zoned quality policy: a premium core (current FoV prediction), which uses temporal+spatial prediction like ordinary inter coding; a rotation margin around the core, which uses intra-only coding so it can be used immediately if the viewer rotates; and a low-quality fallback everywhere else. The margin width is the key knob — too thin and sudden turns hit the fallback; too wide and you give up most of the bandwidth savings.

We derived the optimal margin width as a function of expected head-turn distribution (empirically, log-normal with a heavy tail) and the ratio of intra-to-inter bitrate cost. The optimal margin is roughly 15° on each side of the FoV for typical 360 content — which sounds small, but includes the 95th-percentile turn within any single decoder latency window.

The paper's contribution isn't the idea that margins help. It is the derivation of what the margin should be, and the demonstration that getting it right is worth ~40% bandwidth savings over the naive intra-only approach.
−40%
Bandwidth vs all-intra
0 dB
PSNR drop on typical turns
15°
Optimal margin width

What I'd love to see built on this

The margin width should in principle be personalised. Some viewers are exploratory — they turn often, sometimes fast. Some viewers are "passive consumers" who barely move. A single fixed margin is a compromise across population. A model that adapts the margin per-viewer using their recent head-motion history should pick up another 10-15% bandwidth savings on the passive tail, which is most users. I don't think anyone has done this yet, and it is a small, good master's thesis for someone.

I was a co-author, not the lead. The core coding-theory work was Mao's. My role was the user-study component — we ran a small VR-user study with ~30 participants to validate the quality-robustness claims beyond synthetic metrics. The user study was the part reviewers cared about most, and, honestly, the part I cared about most. Engineering metrics that don't cash out in viewer experience are fan fiction.

← all fun papers You reached the end. Back to the index →