ST-Align: Multi-Scale Image-Gene Foundation Modeling for Spatial Transcriptomics via Spot-Niche Alignment
Abstract
Spatial transcriptomics (ST) measures genome-wide gene expression together with tissue morphology at spatially indexed locations, enabling region-resolved molecular analysis that is not accessible to bulk sequencing or histology alone. Learning a robust multimodal representation from ST is challenging: spot images are low resolution, spot gene vectors reflect mixed-cell composition, and biologically meaningful signal often depends on local neighborhoods rather than isolated spots. We present ST-Align, a domain-adapted image-gene foundation model that injects an explicit spot-niche inductive bias for ST. ST-Align represents each spot together with a local neighborhood (niche) and aligns image and gene representations at three levels: spot-level image-gene alignment, niche-level alignment between neighborhood morphology and aggregated gene expression, and a cross-scale spot-niche objective that couples cellular- and tissue-scale information. We pretrain ST-Align on 1.3 million spot-level image-gene pairs from 573 curated human 10x Visium slides (STimage-1K4M) and evaluate (i) zero-shot transfer for spatial domain identification on six held-out human brain slices and (ii) image-to-gene prediction under patient-level splits. ST-Align improves spatial domain identification by 28.7% over the best multimodal baseline (ARI 0.340 vs. 0.256) and reduces gene prediction error by 16.5% (MSE 0.168 vs. 0.184), with particularly strong gains for non-laminar genes. Overall, these results support multi-scale spot-niche alignment as a practical principle for building foundation models in spatial biology.