Genes Are Not Words: Dependency-Aware Masking for Single-Cell Foundation Models
Alon Hacohen ⋅ Joseph Bingham ⋅ Binyamin Perets ⋅ Dvir Aran
Abstract
Many single-cell foundation models (scFMs) learn representations of cellular identity by applying masked language modeling to gene expression data, yet the direct transfer from NLP imports an implicit independence assumption that conflicts with the modular, co-regulated structure of gene regulatory networks. In this work, we show that this domain mismatch enables shortcut learning: models reconstruct masked genes from locally correlated partners rather than encoding global cellular state, disproportionately harming underrepresented cell types, the rare and transitional populations most relevant to perturbation biology and target identification. We introduce CorrMask, a genomics-native masking strategy that constructs a gene dependency graph from expression covariance and masks correlated gene groups jointly, forcing the model to rely on higher-order biological context. Evaluating on tissue-specific corpora, CorrMask substantially improves annotation of underrepresented cell populations, enhances gene-level generalization via dosage sensitivity prediction, and matches standard baselines with up to $3\times$ less pre-training data, all without architectural changes or additional external data. Our results highlight gene co-regulation as a critical barrier to effective self-supervised learning in genomics, and demonstrate how the scFMs can benefit from domain-aware masking.
Chat is not available.
Successful Page Load