Conditioning Protein Language Models Using High-Throughput Sequence-Fitness Data Collection
Abstract
Current generative models of protein sequences, such as protein language models (pLMs), can generate novel functional sequences, but most strategies do not integrate labeled fitness data from real-world experiments. In this study, we explore fitness-conditioned generation from an autoregressive pLM, capturing evolutionary information from a protein family, using direct preference optimization (DPO) with large amounts of real-world experimental data. Our method leverages MillionFull, a high-throughput method used to collect over 100,000 unique sequence-fitness pairs for O-methyltransferases (OMTs) that form isovanillic acid, a non-native reaction. Specifically, we pretrain ProGen2 on natural OMTs, after which we use the MillionFull-collected labeled dataset to align the pLM to generate sequences with higher fitness. This DPO-conditioned model generates sequences with significantly higher predicted fitness than the pretrained model while maintaining high sequence diversity and mutational profiles consistent with top-performing experimental variants. Impressively, wet-lab validation confirms that the best-performing DPO variant has a 16-fold fitness increase from the parent sequence and a 3-fold increase from the top variant in the training data. Overall, we demonstrate a robust "lab-in-the-loop" framework capable of generating diverse, high-fitness enzyme variants for non-native functional targets.