Learning Joint Morpho-Molecular Tissue Representations with a Multimodal Transformer
Abstract
Understanding how molecular programs are embedded within tissue morphology is a central challenge in spatial biology. While vision transformer (ViT) foundation models capture rich histological structure and spatial transcriptomics (ST) provide molecular context, existing multimodal approaches largely rely on contrastive alignment and do not directly learn joint morpho-molecular representations. We introduce an early-fusion multimodal transformer that integrates subcellular Xenium transcript readouts directly into the ViT token stream, enabling fine-grained cross-modal interaction without cell segmentation. We evaluate our approach on a gene prediction task, predicting held-out genes from a targeted Xenium panel given histology and a core gene set. Across a comprehensive benchmark of unimodal baselines and vanilla late-fusion variants, early fusion achieves substantial improvements in gene expression prediction. We further show that performance gains are driven primarily by spatially aligned, token-level transcript representations rather than fusion timing alone. With appropriate transcript tokenization, late fusion can perform on par with early fusion, which explains the limitations observed in prior CLIP-style models. Our results highlight expressive, spatially grounded fusion as a key ingredient for multimodal representation learning in spatial biology.