Poster
in
Workshop: Generative and Experimental Perspectives for Biomolecular Design
Green fluorescent protein engineering with a biophysics-based protein language model
Sam Gelman · Bryce Johnson · Chase Freschlin · Sameer D'Costa · Anthony Gitter · Philip Romero
Deep neural networks and language models are revolutionizing protein modeling and design, but these models struggle in low data settings and when generalizing beyond their training data. Although prior neural networks have proven capable in learning complex evolutionary or sequence-structure-function relationships from large datasets, they largely ignore the vast accumulated knowledge of protein biophysics, limiting their ability to perform the strong generalization required for protein engineering. We introduce Mutational Effect Transfer Learning (METL), a specialized protein language model for predicting quantitative protein function that bridges the gap between traditional biophysics-based and machine learning approaches. METL incorporates synthetic data from molecular simulations as a means to augment experimental data with biophysical information. Molecular modeling can generate large datasets revealing mappings from amino acid sequences to protein structure and properties. Pretraining protein language models on this data can impart fundamental biophysical knowledge that can be connected with experimental observations. To demonstrate METL's ability to guide protein engineering with limited training data, we applied it to design green fluorescent protein sequence variants in complex scenarios. Of the 20 designed sequences, 16 exhibited fluorescence, and 6 exhibited greater fluorescence than the wild type.