VANGUARD · Fun Papers 11

GSD — ground sample distance — is the ratio of real-world metres to image pixels. A drone that knows its GSD can measure, count, navigate, and hand meaningful coordinates to whoever is downstream. A drone that doesn't, can't.

In an outdoor field with sky all around, the altimeter and the camera intrinsics tell you GSD. In a GPS-denied alley, an indoor warehouse, or a cluttered construction site, they don't. Your altitude estimate drifts. Your intrinsics are correct but underdetermined without an additional metric cue. This paper is about the world's cheapest metric cue.

The sedan as a ruler

Vehicles are almost uniquely well-suited to the problem. They are:

Everywhere humans are — so you can rely on at least one being in frame in most operational environments.
Strongly bimodal in size — most vehicles cluster tightly around one of three widths (sedan, SUV/truck, van).
Culturally standardised — Toyota Camrys and Honda Civics span a narrow band; the mean sedan is 1.82 m wide, σ ≈ 6 cm.
Easy to detect — every general-purpose object detector has more "car" data than "any other outdoor object" combined.

Given a detected car in the image, we recover its pixel width. Given the class-conditional real-world width prior, we solve for GSD directly. It is ugly and simple and correct.

Why vision-language models hallucinate scale

There is a failure mode that motivated the paper. Modern VLMs are excellent at what questions ("what is in this image?") and unreliable at how big questions ("how far is it?", "how wide?", "what is the GSD?"). They confidently invent answers. We tested this on a held-out set of UAV images and found that SOTA VLMs produced scale estimates with median error >35% — often with confident, fluent explanations attached.

The reason, once you look at it, is structural. A language model has no mechanism to recover metric scale from pixel content alone; the mapping pixels-to-metres is genuinely not in the image. The "answer" the VLM returns is a language-prior answer — a plausible-sounding number consistent with the class of scene. It is fluent and not measurable.

A drone does not need fluent guesses about scale. A drone needs correct numbers or honest uncertainty. VANGUARD provides both, by outsourcing the metric question to an object whose size is actually known.

What we had to get right

Class-conditional priors, not a single number. We maintain separate priors for sedan, SUV/crossover, pickup, van, and truck. The detector outputs class probabilities; we marginalise. Using a single "vehicle" size gives error ~11%; using class-conditional gives 5–6%.

Orientation matters. The pixel-width of a car depends on its orientation relative to the camera. We ignored this in v1 and it cost us a lot of accuracy. In v2, we use the detector's orientation estimate to apply a cosine correction. This was the single biggest win in the paper.

Confidence-weighted averaging. With multiple vehicles in frame, we average GSD estimates weighted by each detection's confidence. This gives soft robustness to the occasional misdetection.

Ground-plane assumption. GSD as we define it assumes the cars are on a flat ground plane. For drones operating near structures, we extended this to work with a coarse plane-estimate from consecutive frames. It works; we don't love it; it's in the limitations section.

Fig. 2 — Six-percent error is inside what most downstream tasks tolerate. VLM scale hallucinations are not.

What the method doesn't do

It doesn't work in environments without vehicles. A forest fire drone cannot use VANGUARD. It doesn't work in non-terrestrial environments at all. And it has a graceful-degradation property we like, which is that in the one vehicle-less frame in a sequence, the last good GSD estimate persists — the drone keeps flying until the next car shows up.

I am a co-author on this one, not the driver. Yifei led the vehicle-prior work; I contributed the calibration analysis and some of the robustness-ablation framing. The project is a small reminder that the right choice of "anchor" — the thing whose size you happen to know — can collapse a hard problem into an easy one, and that anchors don't have to be exotic. Sometimes they are a Honda.