ICLR Teaching Visual Language Models to Navigate using Maps

Poster
in
Workshop: 7th Robot Learning Workshop: Towards Robots with Human-Level Abilities

Teaching Visual Language Models to Navigate using Maps

Tigran Galstyan · Hakob Tamazyan · Narek Nurijanyan

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Visual Language Models (VLMs) have shown impressive abilities in understanding and gen-erating multimodal content by integrating visual and textual information. Recently, language-guided aerial navigation benchmarks have emerged, presenting a novel challenge for VLMs. Inthis work, we focus on the utilization of navigation maps, a critical component of the broaderaerial navigation problem. We analyze the CityNav benchmark, a recently introduced dataset forlanguage-goal aerial navigation that incorporates navigation maps and 3D point clouds of realcities to simulate environments for drones. We demonstrate that existing open-source VLMsperform poorly in understanding navigation maps in a zero-shot setting. To address this, wefine-tune one of the top-performing VLMs, Qwen2-VL, on map data, achieving near-perfectperformance on a landmark-based navigation task. Notably, our fine-tuned Qwen2-VL model,using only the landmark map, achieves performance on par with the best baseline model in theCityNav benchmark. This highlights the potential of leveraging navigation maps for enhancingVLM capabilities in aerial navigation tasks.

Chat is not available.

Poster in Workshop: 7th Robot Learning Workshop: Towards Robots with Human-Level Abilities

Teaching Visual Language Models to Navigate using Maps

Tigran Galstyan · Hakob Tamazyan · Narek Nurijanyan

Poster
in
Workshop: 7th Robot Learning Workshop: Towards Robots with Human-Level Abilities