Poster
in
Workshop: Integrating Generative and Experimental Platforms for Biomolecular Design
Steering Generative Models with Experimental Data for Protein Fitness Optimization
Jason Yang · Wenda Chu · Daniel Khalil · Raul Astudillo · Bruce Wittmann · Frances Arnold · Yisong Yue
Protein fitness optimization involves finding an ideal protein sequence satisfying desired quantitative properties in an astronomically large design space of possible sequences, where it is often only possible to measure real-world fitness for few (hundreds of) sequences. Existing machine learning approaches for efficiently navigating the protein design space broadly fall into two categories–discriminative (often supervised) modeling and generative modeling–each with their own strengths and weaknesses. Supervised models can be used to identify promising variants, but require predicting fitness values for all possible sequences in a design space. Generative models, on the contrary, are not hampered by the size of a design space, but historically it has been difficult to direct these models toward specific fitness goals. To address these limitations, we propose a framework for protein sequence optimization in which generative priors on natural sequences are steered with assay-labeled fitness data, taking advantage of both unlabeled and labeled data. Specifically, we evaluate discrete diffusion and language models in combination with various steering techniques such as guidance and reinforcement learning. Our computational studies on the TrpB and CreiLOV protein fitness datasets show that various methods, particularly guidance with discrete diffusion models, are effective strategies for protein fitness optimization.