Poster
in
Workshop: Integrating Generative and Experimental Platforms for Biomolecular Design

Structure-based synthetic data augmentation for protein language models

Alex Lee ⋅ Ava Amini ⋅ Kevin K Yang ⋅ Sarah Alamdari ⋅ Chentong Wang ⋅ Reza Abbasi-Asl

Project Page [ OpenReview]

Abstract

The goal of $\textit{de novo}$ protein design is to leverage natural proteins to design new ones. Deep generative models of protein structure and sequence are the two dominant $\textit{de novo}$ design paradigms. Structure-based models can produce highly novel proteins, but are constrained by data to produce proteins with a narrow range of topologies. Sequence-based design models produce more natural samples over a wider range of topologies, but with reduced novelty. Here, we propose a structure-based synthetic data augmentation approach to combine the benefits of structure and sequence in generative models of proteins. We generated and characterized 240,830 $\textit{de novo}$ backbone structures and used these backbones to generate 45 million sequences for data augmentation. Models trained with structure-based synthetic data augmentation generate a shifted distribution of proteins that are more likely to express successfully in $\textit{E. coli}$ and are more thermostable. We release the trained models as well as our complete synthetic dataset, BackboneRef.

Chat is not available.