âVLN-BERT: Improving Vision-And-Language Navigation With Image-Text Pairs from the Webâ, 2020-04-30 (; similar)â :
Following a navigation instruction such as âWalk down the stairs and stop at the brown sofaâ requires embodied AI agents to ground scene elements referenced via language (eg. âstairsâ) to visual content in the environment (pixels corresponding to âstairsâ).
We ask the following questionâcan we leverage abundant âdisembodiedâ web-scraped vision-and-language corpora (eg. Conceptual Captions) to learn visual groundings (what do âstairsâ look like?) that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)?
Specifically, we develop VLN-BERT, a visio-linguistic transformer-based model for scoring the compatibility between an instruction (ââŚstop at the brown sofaâ) and a sequence of panoramic RGB images captured by the agent.
We demonstrate that pretraining VLN-BERT on image-text pairs from the web before fine-tuning on embodied path-instruction data improves performance on VLNâoutperforming the prior state-of-the-art in the fully-observed setting by 4 absolute percentage points on success rate.
Ablations of our pretraining curriculum show each stage to be impactfulâwith their combination resulting in further positive synergistic effects.