Poster
in
Workshop: Integrating Generative and Experimental Platforms for Biomolecular Design
Structure-based synthetic data augmentation for protein language models
Alex Lee · Ava Amini · Kevin K Yang · Sarah Alamdari · Chentong Wang · Reza Abbasi-Asl
Abstract:
The goal of $\textit{de novo}$ protein design is to leverage natural proteins to design new ones. Deep generative models of protein structure and sequence are the two dominant $\textit{de novo}$ design paradigms. Structure-based models can produce highly novel proteins, but are constrained by data to produce proteins with a narrow range of topologies. Sequence-based design models produce more natural samples over a wider range of topologies, but with reduced novelty. Here, we propose a structure-based synthetic data augmentation approach to combine the benefits of structure and sequence in generative models of proteins. We generated and characterized 240,830 $\textit{de novo}$ backbone structures and used these backbones to generate 45 million sequences for data augmentation. Models trained with structure-based synthetic data augmentation generate a shifted distribution of proteins that are more likely to express successfully in $\textit{E. coli}$ and are more thermostable. We release the trained models as well as our complete synthetic dataset, BackboneRef.
Chat is not available.
Successful Page Load