Poster
in
Workshop: 7th Robot Learning Workshop: Towards Robots with Human-Level Abilities
Teaching Visual Language Models to Navigate using Maps
Tigran Galstyan · Hakob Tamazyan · Narek Nurijanyan
Visual Language Models (VLMs) have shown impressive abilities in understanding and gen-erating multimodal content by integrating visual and textual information. Recently, language-guided aerial navigation benchmarks have emerged, presenting a novel challenge for VLMs. Inthis work, we focus on the utilization of navigation maps, a critical component of the broaderaerial navigation problem. We analyze the CityNav benchmark, a recently introduced dataset forlanguage-goal aerial navigation that incorporates navigation maps and 3D point clouds of realcities to simulate environments for drones. We demonstrate that existing open-source VLMsperform poorly in understanding navigation maps in a zero-shot setting. To address this, wefine-tune one of the top-performing VLMs, Qwen2-VL, on map data, achieving near-perfectperformance on a landmark-based navigation task. Notably, our fine-tuned Qwen2-VL model,using only the landmark map, achieves performance on par with the best baseline model in theCityNav benchmark. This highlights the potential of leveraging navigation maps for enhancingVLM capabilities in aerial navigation tasks.