Oral
in
Workshop: 3rd ICLR Workshop on Machine Learning for Remote Sensing
Using multiple input modalities can improve data-efficiency for ML with satellite imagery
Arjun Rao · Esther Rolf
A large corpus of diverse geospatial data layers are available around the world ranging from remotely-sensed raster data like satellite imagery digital elevation maps, predicted land cover maps, and human-annotated data such as OpenStreetMaps, to data derived from environmental sensors such as air temperature or wind speed data. A large majority of geospatial machine learning (GeoML) models, however, are designed primarily for optical input modalities such as multi-spectral satellite imagery. We show improved GeoML model performance for classification and segmentation tasks when these geospatial inputs are fused as additional contextual clues with optical input imagery – either as an additional input band, or passed as an auxiliary token to a Vision Transformer within a supervised learning setting. Benefits are largest in settings where labeled data are limited and in geographic out-of-sample settings, suggesting that multi-modal inputs may be especially valuable for data-efficiency and out-of-sample performance of GeoML models.