Skip to yearly menu bar Skip to main content

Workshop: Machine Learning for Drug Discovery (MLDD)

Auto-regressive WaveNet Variational Autoencoders for Alignment-free Generative Protein Design and Fitness Prediction

Niksa Praljak · Andrew Ferguson

Keywords: [ Deep generative modeling ] [ representation learning ]


Recently deep generative models (DGMs) have been highly successful in novel protein design and could enable an unprecedented level of control in therapeutic and industrial applications. One DGM approach is variational autoencoders (VAEs), which can infer higher-order amino acid dependencies for useful prediction of fitness effects of mutation. Additionally, the model infers a latent space distribution, which can learn biologically meaningful representations. Another example of a DGM approach is autoregressive models, commonly implemented in language or audio tasks that have been intensively explored in protein generation of unaligned sequences. Combining these two distinct DGM approaches for protein design and fitness prediction has not been extensively studied because VAEs are prone to posterior collapse when implemented by an expressive decoder. We explore and benchmark the use of VAEs with a WaveNet-based decoder. The advantage of WaveNet-based generators is the inexpensive training time and computation cost relative to recurrent neural networks (RNNs) and avoids vanishing gradients because WaveNets leverage dilated causal convolutions. To avoid posterior collapse, we implemented and adapted an Information Maximizing VAE (InfoVAE) loss objective instead of a standard ELBO training objective to a semi-supervised setting with an autoregressive reconstruction loss. We extend our model from unsupervised to a semi-supervised learning paradigm for fitness prediction tasks and benchmark our model's performance on FLIP and TAPE datasets for protein function prediction. To illustrate our model's performance for protein design, we have trained our models on unaligned homologous sequence libraries of the SH3 domain and AroQ Chorismate mutase enzymes. Then we deployed it to generate novel (variable-length) sequences that are computationally predicted to fold into native structures and possess natural function. Our results demonstrate that combining a semi-supervised InfoVAE model with a WaveNet-based generator provides a robust framework for functional prediction and generative protein design without requiring multiple sequence alignments.

Chat is not available.