Hiding in Plain Sight: Visible Gene Correlations Undermine Single-Cell Representations
Alon Hacohen ⋅ Joseph Bingham ⋅ Binyamin Perets ⋅ Dvir Aran
Abstract
Many single-cell foundation models (scFMs) learn representations of cellular identity through masked modeling of gene expression, yet standard random masking treats genes as independent tokens, a poor match for the modular, co-regulated structure of gene regulatory networks. In this work, we show that this mismatch enables shortcut learning: a model may reconstruct masked genes from locally correlated partners rather than capturing global cellular state, yielding representations that underserve underrepresented cell populations. We introduce CorrMask, a data-driven masking strategy that constructs a gene dependency graph from expression covariance and masks correlated gene groups jointly, forcing the model to rely on higher-order biological context. Evaluating on tissue-specific corpora, CorrMask produces representations that improve cell type annotation, particularly for underrepresented populations, and gene-level generalization, while matching standard baselines with up to $3{\times}$ less pre-training data. Our results suggest that meaningful single-cell representations require pre-training objectives that respect the dependency structure of the transcriptome.
Video
Chat is not available.
Successful Page Load