Poster
in
Workshop: Machine Learning for Genomics Explorations (MLGenX)
sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures
Andac Demir · Elizaveta Solovyeva · Jamie Boylan · Mei Xiao · Fabrizio Serluca · Sebastian Hoersch · Murthy Devarakonda · Bulent Kiziltan
Single-cell pretrained models are emerging, profoundly influenced by the breakthroughs in Large Language Models (LLMs). These models adapt transformers, conceptualizing genes as tokens analogous to words in a sentence. Yet, they overlook a crucial distinction: unlike sequential text data, scRNA-seq data is represented as bag-of-genes, not RNA sequences, with no sequential relationship among them. Besides, the quantity and quality of single-cell data is significantly lower than natural language processing data, e.g., it often suffers from technical artifacts and dropout events, as well as significant batch effects between sequencing platforms and experiments. Additionally, their cell type annotation performance under zero-shot setting or limited training data scenarios is outperformed by simpler models such as logistic regression. To address these challenges, we present sc-OTGM, streamlined with less than 500K parameters, making it approximately 100x more compact than the state-of-the-art LLMs, offering a more efficient alternative in genomic data analysis. sc-OTGM is an unsupervised model grounded in the inductive bias that the scRNA-seq data can be generated from a combination of finite multivariate Gaussian distributions. The core function of sc-OTGM is to create a probabilistic latent space utilizing a Gaussian mixture model (GMM) as its prior distribution and distinguish between distinct cell populations by learning their respective marginal probability density functions (PDFs). It then effectively employs a Hit-and-Run Markov chain sampler to learn the optimal transport (OT) plan across these PDFs within the GMM manifold. sc-OTGM offers dual advantages: it aids in the analysis of differential gene expression and cell type annotation, while also proficiently predicting the effects of single gene perturbations at the transcript level. Our experiments on the CROP-seq dataset, where individual genes are selectively up or down-regulated, validate sc-OTGM's capability to accurately identify perturbed genes and predict perturbation effects. Due to the double-blind review process, the codebase will be available upon acceptance of the paper.