Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Integrating Generative and Experimental Platforms for Biomolecular Design

Structure-Aware Language Models Trained on Ultra-Mega-Scale Metagenomic Data Improve Protein Folding Stability Prediction

Yehlin Cho · Kotaro Tsuboyama · Gabriel Rocklin · Sergey Ovchinnikov


Abstract:

Predicting absolute protein stability remains challenging due to the limited availability of experimental datasets and the intricate interplay between sequence and structure contributions to protein stability. In this study, we experimentally measured the folding stability of 2 million high-quality, diverse metagenomic MGnify sequences using high-throughput cDNA display methods. This dataset includes 814,000 wild-type (WT) proteins and sequences with point mutations and insertions/deletions. We fine-tuned the structure-based large language models, Saprot and ESM-3, using LoRA (Low-Rank Adapter) on stability measurements, achieving a Spearman correlation of 0.87 on the MGnify test dataset. Our results demonstrate that these models can predict absolute folding stability from both insertions/deletions and mutational effects, even in non-cDNA datasets covering a wide stability range, including large proteins.

Chat is not available.