Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Machine Learning for Remote Sensing (ML4RS)

Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment

Tengjun Huang


Abstract:

With the rise of Visual and Language Pretraining (VLP), an increasing number of downstream tasks are adopting the paradigm of pretraining followed by fine-tuning. Although this paradigm has demonstrated potential in various multimodal downstream tasks, its implementation in the remote sensing domain encounters some obstacles. Specifically, the tendency for same-modality embeddings to cluster together impedes efficient transfer learning. To tackle this issue, we review the aim of multimodal transfer learning for downstream tasks from a unified perspective, and rethink the optimization process based on three distinct objectives. We propose "Harmonized Transfer Learning and Modality Alignment (HarMA)", a method that simultaneously satisfies task constraints, modality alignment, and single-modality uniform alignment, while minimizing training overhead through parameter-efficient fine-tuning. HarMA can be integrated into almost all existing multimodal pretraining models. Remarkably, using the pretrained weights of GeoRSCLIP, HarMA achieves state-of-the-art performance in two popular multimodal retrieval tasks in the remote sensing field, without the need for external data for training. Even with minimal parameter adjustments, HarMA outperforms the fully fine-tuned GeoRSCLIP in image-text retrieval tasks on RSICD and RSITMD. Code will be released on https://anonymous.4open.science/r/HarMA-62BF/.

Chat is not available.