Enhancing Aerial Vision-Language Navigation with Map Grounding and History Awareness
Abstract
Vision-Language Navigation (VLN) for urban UAVs is frequently hindered by landmark blindness, where target landmarks are not visible from the agent's initial viewpoint. We address this by fine-tuning small Vision-Language Models using a ``Map-in-Pixel'' approach that interleaves 16 steps of egocentric visual frames with global geographic snapshots. To mitigate the data scarcity inherent in VLN datasets, we propose a synthetic augmentation strategy that generates diverse, causally consistent trajectories from randomized starting points. Through granular evaluation and targeted trajectory synthesis, we demonstrate that this history-rich training significantly improves the agent's ability to navigate toward distant objects. Our approach achieves a success rate of 12.5\% on the CityNav unseen test set, nearly doubling the baseline (6.4\%), while simultaneously reducing navigation error below baseline levels. This work underscores the efficacy of pixel-encoded maps, temporal history, and targeted data-centric design in empowering small-scale multimodal agents for long-horizon missions.