Poster
in
Workshop: Integrating Generative and Experimental Platforms for Biomolecular Design

Conditioning Protein Language Models Using High-Throughput Sequence-Fitness Data Collection

Sonia Yuan ⋅ Jason Yang ⋅ Jinbei Li ⋅ Bastian Vogeli ⋅ Simon Krarup ⋅ Emily Roberts ⋅ Bjarke Erichsen ⋅ Vanessa Hurtado Mujica ⋅ Kenan Jijakli ⋅ Søren Karst ⋅ Lei Yang ⋅ Alex Nielsen ⋅ Tyler Korman ⋅ Frances Arnold

Project Page [ OpenReview]

Abstract

Current generative models of protein sequences, such as protein language models (pLMs), can generate novel functional sequences, but most strategies do not integrate labeled fitness data from real-world experiments. In this study, we explore fitness-conditioned generation from an autoregressive pLM, capturing evolutionary information from a protein family, using direct preference optimization (DPO) with large amounts of real-world experimental data. Our method leverages MillionFull, a high-throughput method used to collect over 100,000 unique sequence-fitness pairs for O-methyltransferases (OMTs) that form isovanillic acid, a non-native reaction. Specifically, we pretrain ProGen2 on natural OMTs, after which we use the MillionFull-collected labeled dataset to align the pLM to generate sequences with higher fitness. This DPO-conditioned model generates sequences with significantly higher predicted fitness than the pretrained model while maintaining high sequence diversity and mutational profiles consistent with top-performing experimental variants. Impressively, wet-lab validation confirms that the best-performing DPO variant has a 16-fold fitness increase from the parent sequence and a 3-fold increase from the top variant in the training data. Overall, we demonstrate a robust "lab-in-the-loop" framework capable of generating diverse, high-fitness enzyme variants for non-native functional targets.

Chat is not available.