Poster
in
Workshop: 3rd ICLR Workshop on Machine Learning for Remote Sensing

Large Language Models for Captioning and Retrieving Remote Sensing Images

Joao Daniel Silva · Joao Magalhaes · Devis Tuia · Bruno Martins

2025 Poster
in
Workshop: 3rd ICLR Workshop on Machine Learning for Remote Sensing

Abstract

Remote sensing tasks, such as image captioning and cross-modal retrieval, enable non-expert users to extract relevant Earth observation by integrating visual and linguistic information. In this work, we propose RS-CapRet, a Vision and Language model for remote sensing data, in particular image captioning and text-image retrieval tasks. We integrate a large language model together with an image encoder adapted to remote sensing imagery through contrastive language-image pre-training. To bridge together the image encoder and the language decoder, we propose training lightweight linear layers with examples from combining different remote sensing image captioning datasets, keeping the other parameters frozen. RS-CapRet generates descriptions for remote sensing images and retrieves images from textual descriptions, achieving a competitive performance with existing methods.

Chat is not available.