Poster
in
Workshop: Machine Learning for Genomics Explorations (MLGenX)
Joint Embedding of Transcriptomes and Text Enables Interactive Single-Cell RNA-seq Data Exploration via Natural Language
Moritz Schaefer · Peter Peneder · Daniel Malzl · Anna Hakobyan · Varun Sharma · Thomas Krausgruber · Jörg Menche · Eleni Tomazou · Christoph Bock
Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular states, but interpreting the vast data it generates remains challenging. Here, we introduce CellWhisperer, a multimodal machine learning model that bridges the gap between transcriptomics data and natural language, enabling intuitive interaction with scRNA-seq datasets. Trained on the bulk RNA-seq data for over 650,000 samples and their textual annotations from the Gene Expression Omnibus (GEO), CellWhisperer employs contrastive learning to create a joint embedding space, enabling tasks such as cell retrieval based on free-text queries and zero-shot classification of cell types. We show that these abilities extend to scRNA-seq datasets with a broad range of cell types. Integrated into the CELLxGENE browser, this allows biologists to explore and label single-cell transcriptomes using natural language queries. Our experiments show that CellWhisperer can accurately annotate cellular states, beyond standard cell types, without relying on reference datasets. This work paves the way for accessible and nuanced interpretations of scRNA-seq data, including those that are poorly covered by reference data, leveraging the power of natural language in transcriptomics research.