Here is a problem that drove me mildly crazy for three months in 2024. You give a medical vision-language model a CT slice and ask, "what organ is this lesion in?" The model reads the entire image, the model produces a confident answer, and the model is wrong — not because it doesn't know what a liver is, but because it looked at the whole abdomen and averaged. Radiologists don't do that. Radiologists squint.
R-LLaVA is the smallest possible intervention that teaches a general-purpose VLM to squint.
The core idea in one paragraph
We take LLaVA and add a second visual signal: a region of interest crop, CLIP-encoded, prepended to the ordinary token stream as a kind of prior. The ROI is cheap — it can come from a doctor's quick box, a segmentation model, or, in principle, the question itself. The LLM learns, during instruction-tuning, that when an ROI is present it is probably the answer's neighborhood.
That's the whole architectural change. Everything else is training mix and ablations.
Why this beats fancier approaches
There is a line of Med-VQA work that retrains the visual encoder from scratch on medical data, and another line that bolts a segmentation head onto the VLM so it learns to localize end-to-end. Both work. Both are expensive. Both require you to give up a lot of the general-purpose prior the original VLM spent millions of dollars acquiring.
R-LLaVA takes the other path: keep the generalist, give it a hint. Because the hint is a small number of additional tokens, the original model's language fluency is preserved. Because those tokens are aligned with the CLIP visual space the LLM already speaks, no adapter retraining is needed.
The unglamorous part I am proudest of
You can spend a lot of time chasing accuracy points. The unglamorous truth is that in medical VQA, the dataset is the model. Our biggest win came not from architecture but from an instruction-tuning mix where roughly a third of examples had an ROI present and two thirds didn't. The model learned both to use an ROI and to gracefully ignore it when one wasn't provided — which matters because in the wild, most questions arrive without a box.
"Robustness to the absence of a hint" is the property that makes an intervention usable. It is also the property that does not make for a good figure in a paper. I remain bitter about this.
What I'd change if I were starting over
I would spend more of the budget on error-correcting data — pairs where the ROI is wrong on purpose. Teaching a model to use a hint is easy; teaching it to distrust a hint is the harder, more clinically useful skill. That's in the roadmap of whoever picks this up next, and it's a better PhD project than it is an evening project, which is why I didn't finish it.
The other thing: I'd try this in non-medical vertical domains. Legal-document VQA with a highlighted clause. Engineering-diagram VQA with a circled component. The "small token, big effect" pattern seems to generalize, and the domains where it matters most are the domains where experts already circle things with a pen.