arXiv · robotics · 2603.042778 min read

VANGUARD, recovering absolute scale from parked cars.

A drone in a GPS-denied alley knows what it is looking at but not how big its image is. Visual-language models invent scales. We cheated nobly — used the one object whose size is culturally standardised — the sedan — as a ruler.
≈ 4.62 m (sedan)GROUND SAMPLE DISTANCE = (pixel_width × real_width_m) / detected_box_width_pxWhatever the drone's altimeter says: a parked sedan is the rosetta stone.
Fig. 1 — Pipeline: detect vehicles, look up the canonical size, solve for the pixels-to-metres ratio.

GSD — ground sample distance — is the ratio of real-world metres to image pixels. A drone that knows its GSD can measure, count, navigate, and hand meaningful coordinates to whoever is downstream. A drone that doesn't, can't.

In an outdoor field with sky all around, the altimeter and the camera intrinsics tell you GSD. In a GPS-denied alley, an indoor warehouse, or a cluttered construction site, they don't. Your altitude estimate drifts. Your intrinsics are correct but underdetermined without an additional metric cue. This paper is about the world's cheapest metric cue.

The sedan as a ruler

Vehicles are almost uniquely well-suited to the problem. They are:

Given a detected car in the image, we recover its pixel width. Given the class-conditional real-world width prior, we solve for GSD directly. It is ugly and simple and correct.

Interactive · how much does class granularity help?
GSD estimation error (% of true) →vehicles in frame →generic "vehicle" classfine-grained (sedan/SUV/truck)
Even one vehicle gives a usable GSD. Fine-grained priors cut the error roughly in half.
Simulated · matches paper's Fig. 3 qualitatively

Why vision-language models hallucinate scale

There is a failure mode that motivated the paper. Modern VLMs are excellent at what questions ("what is in this image?") and unreliable at how big questions ("how far is it?", "how wide?", "what is the GSD?"). They confidently invent answers. We tested this on a held-out set of UAV images and found that SOTA VLMs produced scale estimates with median error >35% — often with confident, fluent explanations attached.

The reason, once you look at it, is structural. A language model has no mechanism to recover metric scale from pixel content alone; the mapping pixels-to-metres is genuinely not in the image. The "answer" the VLM returns is a language-prior answer — a plausible-sounding number consistent with the class of scene. It is fluent and not measurable.

A drone does not need fluent guesses about scale. A drone needs correct numbers or honest uncertainty. VANGUARD provides both, by outsourcing the metric question to an object whose size is actually known.

What we had to get right

Class-conditional priors, not a single number. We maintain separate priors for sedan, SUV/crossover, pickup, van, and truck. The detector outputs class probabilities; we marginalise. Using a single "vehicle" size gives error ~11%; using class-conditional gives 5–6%.

Orientation matters. The pixel-width of a car depends on its orientation relative to the camera. We ignored this in v1 and it cost us a lot of accuracy. In v2, we use the detector's orientation estimate to apply a cosine correction. This was the single biggest win in the paper.

Confidence-weighted averaging. With multiple vehicles in frame, we average GSD estimates weighted by each detection's confidence. This gives soft robustness to the occasional misdetection.

Ground-plane assumption. GSD as we define it assumes the cars are on a flat ground plane. For drones operating near structures, we extended this to work with a coarse plane-estimate from consecutive frames. It works; we don't love it; it's in the limitations section.

GSD ERROR ON UAVID-GSD BENCHMARKVLM zero-shot (GPT-4V)37.2%Altimeter + fixed intrinsics14.8%VANGUARD (coarse)10.1%VANGUARD (fine-grained)5.6%lower is better · mean absolute error on held-out scenes
Fig. 2 — Six-percent error is inside what most downstream tasks tolerate. VLM scale hallucinations are not.

What the method doesn't do

It doesn't work in environments without vehicles. A forest fire drone cannot use VANGUARD. It doesn't work in non-terrestrial environments at all. And it has a graceful-degradation property we like, which is that in the one vehicle-less frame in a sequence, the last good GSD estimate persists — the drone keeps flying until the next car shows up.

I am a co-author on this one, not the driver. Yifei led the vehicle-prior work; I contributed the calibration analysis and some of the robustness-ablation framing. The project is a small reminder that the right choice of "anchor" — the thing whose size you happen to know — can collapse a hard problem into an easy one, and that anchors don't have to be exotic. Sometimes they are a Honda.

← all fun papers Next: 11 Cerebral Palsy →