We are at a pivotal moment in healthcare characterized by unprecedented scientific and technological progress in recent years together with the promise of personalized medicine to radically transform the way we provide care to patients. However, drug discovery has become an increasingly challenging endeavour: not only has the success rate of developing new therapeutics been historically low, but this rate has been steadily declining. The average cost to bring a new drug to market is now estimated at 2.6 billion – 140% higher than a decade earlier. Machine learning-based approaches present a unique opportunity to address this challenge. While there has been growing interest and pioneering work in the machine learning (ML) community over the past decade, the specific challenges posed by drug discovery are largely unknown by the broader community. We would like to organize a workshop on ‘Machine Learning for Drug Discovery’ (MLDD) at ICLR 2022 with the ambition to federate the community interested in this application domain where i) ML can have a significant positive impact for the benefit of all and ii) the application domain can drive ML method development through novel problem settings, benchmarks and testing grounds at the intersection of many subfields ranging representation, active and reinforcement learning to causality and treatment effects.
Fri 6:00 a.m. - 6:10 a.m.
|
Opening remarks
(
Intro
)
|
🔗 |
Fri 6:10 a.m. - 6:50 a.m.
|
Keynote - Regina Barzilay
(
Keynote
)
Infusing Biology into Molecular models |
🔗 |
Fri 6:50 a.m. - 7:10 a.m.
|
Invited Talk - Miguel Hernandez-Lobato
(
Invited Talk
)
Data-efficient Predictions of Molecular Properties using Meta-learning and Gaussian Processes |
🔗 |
Fri 7:10 a.m. - 7:30 a.m.
|
Invited Talk - Ole Winther
(
Invited Talk
)
Solving biological sequence analysis problems with deep language models and conditional random fields |
🔗 |
Fri 7:30 a.m. - 7:50 a.m.
|
Spotlight Presentations (Part 1)
(
Spotlight Presentations
)
Pre-recorded presentations for spotlight papers. There will be 4 spotlight papers presented per session (5-min each). |
🔗 |
Fri 7:30 a.m. - 7:35 a.m.
|
Deep sharpening of topological features for de novo protein design
(
Spotlight
)
link »
SlidesLive Video » Computational \emph{de novo} protein design allows the exploration of uncharted areas of the protein structure and sequence spaces. Classical approaches to \emph{de novo} protein design involve an iterative process where the desired protein shape is outlined, then sampled for structural backbones and designed with low energy amino acid sequences. Despite numerous successes, inaccuracies within energy functions and sampling methods often lead to physically unrealistic protein backbones yielding sequences that fail to fold experimentally. Recently, deep neural networks have successfully been used to design novel protein folds from scratch by iteratively predicting a structure and optimizing the sequence until a target protein structure is reached. These methods work well under circumstances where distributions of physically realistic target protein backbones can be readily defined, but lack the ability to \emph{de novo} design loosely specified protein shapes. In fact, a major challenge for \emph{de novo} protein design is to generate "designable" protein structures for defined folds, including native and artificial ("dark matter") folds that can then be used to find low energetic sequences in a generic manner. Here, we automate the task of creating designable backbones using a variational autoencoder framework, termed \textsc{Genesis}, to denoise sketches of protein topological lattice models by sharpening their 2D representations in distance and angle feature maps. In conjunction with the trRosetta design framework, large pools of diverse sequences for different protein folds were generated for the maps. We found that the \textsc{Genesis}-trDesign framework generates native-like feature maps for known and dark matter protein folds. Ultimately, the \textsc{Genesis} framework addresses the protein backbone designability problem and could contribute to the \emph{de novo} design of structurally defined artificial proteins that can be tailored for novel functionalities. |
Zander Harteveld · Joshua Southern · Michaël Defferrard · Andreas Loukas · Pierre Vandergheynst · Micheal Bronstein · Bruno Correia 🔗 |
Fri 7:35 a.m. - 7:40 a.m.
|
EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction
(
Spotlight
)
link »
SlidesLive Video » Predicting how a drug-like molecule binds to a specific protein target is a core problem in drug discovery. An extremely fast computational binding method would enable key applications such as fast virtual screening or drug engineering. Existing methods are computationally expensive as they rely on heavy candidate sampling coupled with scoring, ranking, and fine-tuning steps. We challenge this paradigm with EquiBind, an SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand's bound pose and orientation. EquiBind achieves significant speed-ups and better quality compared to traditional and recent baselines. Further, we show extra improvements when coupling it with existing fine-tuning techniques at the cost of increased running time. Finally, we propose a novel and fast fine-tuning model that adjusts torsion angles of a ligand's rotatable bonds based on closed-form global minima of the von Mises angular distance to a given input atomic point cloud, avoiding previous expensive differential evolution strategies for energy minimization. |
Hannes Stärk · Octavian Ganea · Lagnajit Pattanaik · Regina Barzilay · Tommi Jaakkola 🔗 |
Fri 7:40 a.m. - 7:45 a.m.
|
Predicting single-cell perturbation responses for unseen drugs
(
Spotlight
)
link »
SlidesLive Video » Single-cell transcriptomics enabled the study of cellular heterogeneity in response to perturbations at the resolution of individual cells. However, scaling high-throughput screens (HTSs) to measure cellular responses for many drugs remains a challenge due to technical limitations and, more importantly, the cost of such multiplexed experiments. Thus, transferring information from routinely performed bulk RNA-seq HTS is required to enrich single-cell data meaningfully.We introduce a new encoder-decoder architecture to study the perturbational effects of unseen drugs. We combine the model with a transfer learning scheme and demonstrate how training on existing bulk RNA-seq HTS datasets can improve generalisation performance. Better generalisation reduces the need for extensive and costly screens at single-cell resolution. We envision that our proposed method will facilitate more efficient experiment designs through its ability to generate in-silico hypotheses, ultimately accelerating targeted drug discovery. |
Leon Hetzel · Simon Boehm · Niki Kilbertus · Stephan Günnemann · Mohammad Lotfollahi · Fabian Theis 🔗 |
Fri 7:45 a.m. - 7:50 a.m.
|
GRPE: Relative Positional Encoding for Graph Transformer
(
Spotlight
)
link »
SlidesLive Video » Designing an efficient model to encode graphs is a key challenge of molecular representation learning. Transformer built upon efficient self-attention is a natural choice for graph processing, but it requires explicit incorporation of positional information. Existing approaches either linearize a graph to encode absolution position in the sequence of nodes, or encode relative position with another node using bias terms. The former loses preciseness of relative position from linearization, while the latter loses a tight integration of node-edge and node-spatial information. In this work, we propose relative positional encoding for a graph to overcome the weakness of the previous approaches. Our method encodes a graph without linearization and considers both node-spatial relation and node-edge relation. We name our method Graph Relative Positional Encoding dedicated to graph representation learning.Experiments conducted on various molecular property prediction datasets show that the proposed method outperforms previous approaches significantly. Our code is publicly available at https://github.com/lenscloth/GRPE}. |
Wonpyo Park · Woong-Gi Chang · Donggeon Lee · Juntae Kim · seung-won hwang 🔗 |
Fri 7:50 a.m. - 8:00 a.m.
|
Break
|
🔗 |
Fri 8:00 a.m. - 8:40 a.m.
|
Keynote - Aviv Regev
(
Keynote
)
SlidesLive Video » Design for inference and the power of random experiments in biology |
🔗 |
Fri 8:40 a.m. - 9:00 a.m.
|
Invited Talk - Connor Coley
(
Invited Talk
)
Molecular design and synthesis |
🔗 |
Fri 9:00 a.m. - 9:45 a.m.
|
Poster Session (Part 1) ( Poster Session ) link » | 🔗 |
Fri 9:45 a.m. - 10:30 a.m.
|
Lunch Break
|
🔗 |
Fri 10:30 a.m. - 11:10 a.m.
|
Keynote - Daphne Koller
(
Keynote
)
Transforming Drug Discovery using Digital Biology |
🔗 |
Fri 11:10 a.m. - 11:30 a.m.
|
Invited Talk - John Chodera
(
Invited Talk
)
Teaching molecular simulations to learn |
🔗 |
Fri 11:30 a.m. - 11:50 a.m.
|
Invited Talk - Mohammed AlQuraishi
(
Invited Talk
)
Protein structure prediction in a post-AlphaFold2 world |
🔗 |
Fri 11:50 a.m. - 12:30 p.m.
|
Keynote - Yoshua Bengio
(
Keynote
)
Towards AI-based scientific discovery with GFlowNets |
🔗 |
Fri 12:30 p.m. - 12:40 p.m.
|
Break
|
🔗 |
Fri 12:40 p.m. - 1:00 p.m.
|
Spotlight Presentations (Part 2)
(
Spotlight Presentations
)
Pre-recorded presentations for spotlight papers. There will be 4 spotlight papers presented per session (5-min each). |
🔗 |
Fri 12:40 p.m. - 12:45 p.m.
|
SystemMatch: optimizing preclinical drug models to human clinical outcomes via generative latent-space matching
(
Spotlight
)
link »
SlidesLive Video »
Translating the relevance of preclinical models ($\textit{in vitro}$, animal models, or organoids) to their relevance in humans presents an important challenge during drug development. The rising abundance of single-cell genomic data from human tumors and tissue offers a new opportunity to optimize model systems by their similarity to targeted human cell types in disease. In this work, we introduce SystemMatch to assess the fit of preclinical model systems to an $\textit{in sapiens}$ target population and to recommend experimental changes to further optimize these systems. We demonstrate this through an application to developing $\textit{in vitro}$ systems to model human tumor-derived suppressive macrophages. We show with held-out $\textit{in vivo}$ controls that our pipeline successfully ranks macrophage subpopulations by their biological similarity to the target population, and apply this analysis to rank a series of 18 $\textit{in vitro}$ macrophage systems perturbed with a variety of cytokine stimulations. We extend this analysis to predict the behavior of 66 $\textit{in silico}$ model systems generated using a perturbational autoencoder and apply a $k$-medoids approach to recommend a subset of these model systems for further experimental development in order to fully explore the space of possible perturbations. Through this use case, we demonstrate a novel approach to model system development to generate a system more similar to human biology.
|
Scott Gigante · Varsha Raghavan · Amanda Robinson · Rob Barton · Adeeb Rahman · Drausin Wulsin · Jacques Banchereau · Noam Solomon · Luis Voloch · Fabian Theis 🔗 |
Fri 12:45 p.m. - 12:50 p.m.
|
Physics-informed deep neural network for rigid-body protein docking
(
Spotlight
)
link »
SlidesLive Video »
Proteins are biological macromolecules that perform many essential roles within all living organisms. Many protein functions arise from physical interactions between them and also with other biomolecules (e.g. DNA, metabolites). Being able to predict whether and how two proteins interact is an important problem in fundamental biological research and translational drug discovery.In this work, we present an energy-based model for generating ensembles of rigid-body transformations to predict the configuration of protein complexes. The method incorporates strong, interpretable physical priors, it is by construction $\text{SE}(3)$ equivariant and fully-differentiable back to the atomic structure.We rely on the observation that bound protein-protein complexes show high geometric and chemical complementarity at the interface of the two proteins. Our method efficiently makes use of this prior by generating on-the-fly point cloud representations of the solvent-excluded surfaces of the proteins. Through learned point descriptors, we can infer regions of high complementarity between the two proteins and compute a proxy for the binding energy. By sampling transformations expected to adopt low energy states, we generate ensembles of bound poses where the two protein surfaces are brought into contact.We expect that the strong physical priors enforced by the network construction will aid in generalization to other related tasks and lead to a richer human understanding of the process behind the generation and scoring of the docked poses.The fact that the method is also fully differentiable allows for gradient-based modifications of the atomic structure which could be critical in tasks related to unbound docking or protein design which remain outstanding problems in protein modelling.
|
Freyr Sverrisson · Jean Feydy · Joshua Southern · Michael Bronstein · Bruno Correia 🔗 |
Fri 12:50 p.m. - 12:55 p.m.
|
Multi-Segment Preserving Sampling for Deep Manifold Sampler
(
Spotlight
)
link »
SlidesLive Video » Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility.The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences. Sampling was done by exploiting the gradients from a function predictor trained on top of the manifold sampler.In this work, we introduce an alternative approach to guided sampling that enables the direct inclusion of domain-specific knowledge by designating preserved and non-preserved segments along the input sequence, thereby restricting variation to only select regions.We call this method ``multi-segment preserving sampling" and present its effectiveness in the context of antibody design.We train two models: a deep manifold sampler and a GPT-2 language model on nearly six million heavy chain sequences annotated with the \textit{IGHV1-18} gene.During sampling, we restrict variation to only the complementarity-determining region 3 (CDR3) of the input. We obtain log probability scores from a GPT-2 model for each sampled CDR3 and demonstrate that multi-segment preserving sampling generates reasonable designs while maintaining the desired, preserved regions. |
Dan Berenberg · Jae Hyeon Lee · Simon Kelow · Ji Park · Andrew Watkins · Richard Bonneau · Vladimir Gligorijevic · Stephen Ra · Kyunghyun Cho 🔗 |
Fri 12:55 p.m. - 1:00 p.m.
|
Regression Transformer: Concurrent Conditional Generation and Regression by Blending Numerical and Textual Tokens
(
Spotlight
)
link »
SlidesLive Video » We report the Regression Transformer (RT), a method that abstracts regression as a conditional sequence modeling problem. The RT casts continuous properties as sequences of numerical tokens and encodes them jointly with conventional tokens.This yields a dichotomous model that concurrently excels at regression tasks and property-driven conditional generation.can seamlessly transition between solving regression tasks and conditional generation tasks; solely governed by the mask location. We propose several extensions to the XLNet objective and adopt an alternating training scheme to concurrently optimize property prediction and conditional text generation with on a self-consistency loss.Our experiments on both chemical and protein languages demonstrate that the performance of traditional regression models can be surpassed despite training with cross entropy loss.Importantly, priming the same model with continuous properties yields a highly competitive conditional generative models that outperforms specialized approaches in a constrained property optimization benchmark.In sum, the Regression Transformer opens the door for "swiss army knife" models that excel at both regression and conditional generation.This finds application particularly in property-driven, local exploration of the chemical or protein space.The code to reproduce all experiments of the paper is available at: https://anonymous.4open.science/r/regression-transformer/ |
Jannis Born · Matteo Manica 🔗 |
Fri 1:00 p.m. - 1:40 p.m.
|
GeneDisco Challenge
(
Discussion
)
Challenge recap & presentations from winning teams |
🔗 |
Fri 1:40 p.m. - 2:25 p.m.
|
Poster Session (Part 2) ( Poster Session ) link » | 🔗 |
Fri 2:25 p.m. - 2:30 p.m.
|
Closing remarks
(
Intro
)
|
🔗 |
-
|
Auto-regressive WaveNet Variational Autoencoders for Alignment-free Generative Protein Design and Fitness Prediction
(
Poster
)
link »
Recently deep generative models (DGMs) have been highly successful in novel protein design and could enable an unprecedented level of control in therapeutic and industrial applications. One DGM approach is variational autoencoders (VAEs), which can infer higher-order amino acid dependencies for useful prediction of fitness effects of mutation. Additionally, the model infers a latent space distribution, which can learn biologically meaningful representations. Another example of a DGM approach is autoregressive models, commonly implemented in language or audio tasks that have been intensively explored in protein generation of unaligned sequences. Combining these two distinct DGM approaches for protein design and fitness prediction has not been extensively studied because VAEs are prone to posterior collapse when implemented by an expressive decoder. We explore and benchmark the use of VAEs with a WaveNet-based decoder. The advantage of WaveNet-based generators is the inexpensive training time and computation cost relative to recurrent neural networks (RNNs) and avoids vanishing gradients because WaveNets leverage dilated causal convolutions. To avoid posterior collapse, we implemented and adapted an Information Maximizing VAE (InfoVAE) loss objective instead of a standard ELBO training objective to a semi-supervised setting with an autoregressive reconstruction loss. We extend our model from unsupervised to a semi-supervised learning paradigm for fitness prediction tasks and benchmark our model's performance on FLIP and TAPE datasets for protein function prediction. To illustrate our model's performance for protein design, we have trained our models on unaligned homologous sequence libraries of the SH3 domain and AroQ Chorismate mutase enzymes. Then we deployed it to generate novel (variable-length) sequences that are computationally predicted to fold into native structures and possess natural function. Our results demonstrate that combining a semi-supervised InfoVAE model with a WaveNet-based generator provides a robust framework for functional prediction and generative protein design without requiring multiple sequence alignments. |
Niksa Praljak · Andrew Ferguson 🔗 |
-
|
Learning multi-scale functional representations of proteins from single-cell microscopy data
(
Poster
)
link »
Protein function is inherently linked to its localization within the cell, and fluorescent microscopy data is an indispensable resource for learning representations of proteins. Despite major developments in molecular representation learning, extracting functional information from biological images remains a non-trivial computational task. Current state-of-the-art approaches use autoencoder models to learn high-quality features by reconstructing images. However, such methods are prone to capturing noise and imaging artifacts. In this work, we revisit deep learning models used for classifying major subcellular localizations, and evaluate representations extracted from their final layers. We show that simple convolutional networks trained on localization classification can learn protein representations that encapsulate diverse functional information, and significantly outperform autoencoder-based models. We also propose a robust evaluation strategy to assess quality of protein representations across different scales of biological function. |
Anastasia Razdaibiedina · Alexander Brechalov 🔗 |
-
|
Variational Interpretable Deep Canonical Correlation Analysis
(
Poster
)
link »
The main idea of canonical correlation analysis (CCA) is to map different views onto a common latent space with maximum correlation. We propose a deep interpretable variational canonical correlation analysis (DICCA) for multi-view learning. The developed model extends the existing latent variable model for linear CCA to nonlinear models through the use of deep generative networks. DICCA is designed to disentangle both the shared and view-specific variations for multi-view data. To further make the model more interpretable, we place a sparsity-inducing prior on the latent weight with a structured variational autoencoder that is comprised of view-specific generators. Empirical results on real-world datasets show that our method is competitive across domains. |
Lin Qiu · Lynn Lin · Vernon Chinchilli 🔗 |
-
|
Graph Anisotropic Diffusion for Molecules
(
Poster
)
link »
The capacity of existing graph neural networks (GNNs) is limited by the use of local or isotropic kernels. We present Graph Anisotropic Diffusion: a new GNN architecture that utilizes a diffusion PDE on graphs. The main ingredient in our model is a linear diffusion layer with two solving schemes that is applied independently to each feature channel of the vertices in the graph. This learned diffusion layer improves the message passing mechanism in GNNs by allowing continuous propagation of information among nodes with control over the diffusion step. This diffusion layer is combined with local anisotropic kernels to obtain a notion of direction. Empirically we demonstrate the capacity of our model to improve the performance of competitive GNNs in two common molecular property prediction benchmarks. |
Ahmed Elhag · Gabriele Corso · Hannes Stärk · Michael Bronstein 🔗 |
-
|
SystemMatch: optimizing preclinical drug models to human clinical outcomes via generative latent-space matching
(
Poster
)
link »
Translating the relevance of preclinical models ($\textit{in vitro}$, animal models, or organoids) to their relevance in humans presents an important challenge during drug development. The rising abundance of single-cell genomic data from human tumors and tissue offers a new opportunity to optimize model systems by their similarity to targeted human cell types in disease. In this work, we introduce SystemMatch to assess the fit of preclinical model systems to an $\textit{in sapiens}$ target population and to recommend experimental changes to further optimize these systems. We demonstrate this through an application to developing $\textit{in vitro}$ systems to model human tumor-derived suppressive macrophages. We show with held-out $\textit{in vivo}$ controls that our pipeline successfully ranks macrophage subpopulations by their biological similarity to the target population, and apply this analysis to rank a series of 18 $\textit{in vitro}$ macrophage systems perturbed with a variety of cytokine stimulations. We extend this analysis to predict the behavior of 66 $\textit{in silico}$ model systems generated using a perturbational autoencoder and apply a $k$-medoids approach to recommend a subset of these model systems for further experimental development in order to fully explore the space of possible perturbations. Through this use case, we demonstrate a novel approach to model system development to generate a system more similar to human biology.
|
Scott Gigante · Varsha Raghavan · Amanda Robinson · Rob Barton · Adeeb Rahman · Drausin Wulsin · Jacques Banchereau · Noam Solomon · Luis Voloch · Fabian Theis 🔗 |
-
|
Evaluating Generalization in GFlowNets for Molecule Design
(
Poster
)
link »
Deep learning bears promise for drug discovery problems such as de novo molecular design. Generating data to train such models is a costly and time-consuming process, given the need for wet-lab experiments or expensive simulations. This problem is compounded by the notorious data-hungriness of machine learning algorithms. In small molecule generation the recently proposed GFlowNet method has shown good performance in generating diverse high-scoring candidates and has the interesting advantage of being an off-policy offline method. Finding an appropriate generalization evaluation metric for such models, one predictive of the desired search performance (i.e. finding high-scoring diverse candidates), will help guide online data collection for such an algorithm. In this work, we develop techniques for evaluating GFlowNet performance on a test set, and identify the most promising metric for predicting generalization. We present empirical results on several small-molecule design tasks in drug discovery, for several GFlowNet training setups, and we find a metric strongly correlated with diverse high-scoring batch generation. This metric should be used to identify the best generative model from which to sample batches of molecules to be evaluated. |
Andrei Nica · Moksh Jain · Emmanuel Bengio · Cheng-Hao Liu · Maksym Korablyov · Michael Bronstein · Yoshua Bengio 🔗 |
-
|
Data-Driven Optimization for Protein Design: Workflows, Algorithms and Metrics
(
Poster
)
link »
Recent works have successfully demonstrated the ability of deep neural networks in predicting important properties such as fitness and stability from protein sequences via supervised learning. However, the use of learned deep neural network models for the task of designing de novo proteins that maximize a certain fitness value with backbones from scratch remains under-explored. In this paper, we study the problem of designing proteins where the optimization be carried out in a purely data-driven, ``offline'' manner, by utilizing databases of experimental data collected from wet lab evaluations. Synthesis of proteins proposed by the algorithm in an experimental setup in a wet lab, which incurs a big manual overhead for designers, is not allowed. Such an offline optimization problem require that a practitioner make several several design choices: a designer must decide what data distribution to train on, how their method would be evaluated, and must additionally devise workflows for tuning the optimization method they wish to use. In this paper, we perform a systematic study of various design choices that arise in in protein design, grounded in the problem of optimizing for protein stability, and use these insights to propose workflows, protocols and metrics to assist practitioners in effectively applying such data-driven approaches to protein design problems. |
Sathvik Kolli · Amy Lu · Xinyang Geng · Aviral Kumar · Sergey Levine 🔗 |
-
|
Fragment-based ligand generation guided by geometric deep learning on protein-ligand structures
(
Poster
)
link »
Computationally-aided design of novel molecules has the potential to accelerate drug discovery. Several recent generative models aimed to create new molecules for specific protein targets. However, a rate limiting step in drug development is molecule optimization, which can take several years due to the challenge of optimizing multiple molecular properties at once. We developed a method to solve a specific molecular optimization problem in silico: expanding a small, fragment-like starting molecule bound to a protein pocket into a larger molecule that matches that physiochemical properties of known drugs. Using data-efficient E(3) equivariant based neural networks and a 3D atomic point cloud representation, our model learns how to attach new molecular fragments to a growing structure by recognizing realistic intermediates generated en route to a final ligand. This approach always generates chemically valid molecules and incorporates all relevant 3D spatial information from the protein pocket. This framework produces promising molecules as assessed by multiple properties that address binding affinity, ease of synthesis, and solubility. Overall, we demonstrate the feasibility of 3D molecular structure expansion conditioned on protein pockets while maintaining desirable drug-like physiochemical properties and developed a tool that could accelerate the work of medicinal chemists. |
Alexander Powers · Helen Yu · Patricia Suriana · Ron Dror 🔗 |
-
|
Decoding Surface Fingerprints for Protein-Ligand Interactions
(
Poster
)
link »
Small molecules have been the preferred modality for drug development and therapeutic interventions. This molecular format presents a number of advantages, e.g. long half-lives and cell permeability, making it possible to access a wide range of therapeutic targets. However, finding small molecules that engage “hard-to-drug” protein targets specifically and potently remains an arduous process, requiring experimental screening of extensive compound libraries to identify candidate leads. The search continues with further optimization of compound leads to meet the required potency and toxicity thresholds for clinical applications. Here, we propose a new computational workflow for high-throughput fragment-based screening and binding affinity prediction where we leverage the available protein-ligand complex structures using a state-of-the-art protein surface embedding framework (dMaSIF). We developed a tool capable of finding suitable ligands and fragments for a given protein pocket solely based on protein surface descriptors, that capture chemical and geometric features of the target pocket. The identified fragments can be further combined into novel ligands. Using the structural data, our ligand discovery pipeline learns the signatures of interactions between surface patches and small pharmacophores. On a query target pocket, the algorithm matches known target pockets and returns either potential ligands or identifies multiple ligand fragments in the binding site. Our binding affinity predictor is capable of predicting the affinity of a given protein-ligand pair, requiring only limited information about the ligand pose. This enables screening without the costly step of first docking candidate molecules. Our framework will facilitate the design of ligands based on the target’s surface information. It may significantly reduce the experimental screening load and ultimately reveal novel chemical compounds for targeting challenging proteins. |
Ilia Igashov · Arian Jamasb · Ahmed Sadek · Freyr Sverrisson · Arne Schneuing · Tom Blundell · Pietro Lio · Michael Bronstein · Bruno Correia 🔗 |
-
|
Torsional Diffusion for Molecular Conformer Generation
(
Poster
)
link »
SlidesLive Video » Diffusion-based generative models generate samples by mapping noise to data via the reversal of a diffusion process which typically consists of the addition of independent Gaussian noise to every data coordinate. This diffusion process is, however, not well suited to the fundamental task of molecular conformer generation where the degrees of freedom differentiating conformers lie mostly in torsion angles. We, therefore, propose Torsional Diffusion that generates conformers by leveraging the definition of a diffusion process over the space T^m, a high dimensional torus representing torsion angles, and a SE(3) equivariant model capable of accurately predicting the score over this process. Empirically, we demonstrate that our model outperforms state-of-the-art methods in terms of both diversity and accuracy of generated conformers, reducing the minimum RMSD by respectively 27% and 9%. When compared to Gaussian diffusion models, Torsional Diffusion enables significantly more accurate generation while performing two orders of magnitude fewer inference time-steps. |
Bowen Jing · Gabriele Corso · Regina Barzilay · Tommi Jaakkola 🔗 |
-
|
Machine Learning to Hunt for Phage Proteins to Catch Klebsiella
(
Poster
)
link »
Antimicrobial resistance (AMR) has been declared a global threat by the WorldHealth Organization. Development of novel and effective therapies against mi-crobes is an active research area of ever-growing importance. One of the leadingthreats are Klebsiella species, which cause virulent AMR infections with highdeath rates, particularly in hospital settings. Klebsiella species are particularlyproblematic because they produce a thick sticky polysaccharide capsule that pro-tects them from antimicrobials and allows them to build highly resistant biofilms- defensive layers of cells. A natural solution to eradicate Klebsiella capsules andbiofilms are depolymerase proteins that can target and neutralize polysaccharidecapsules of specific Klebsiella species, often found in bacteriophages. However,machine learning guided discovery of depolymerase proteins in such phages is anunexplored area.In this work, we use machine learning to help identify proteins in phage pro-teomes that can act as depolymerases against Klebsiella. Specifically, we utilize adataset of phages, containing depolymerase proteins, that can target and neutralizepolysaccharide capsules of specific Klebsiella species. We train a ranking modelto rank proteins in an input phage proteome based on their predicted ability to actas a depolymerase. We use a non-redundant validation protocol to evaluate thepredictive accuracy of the proposed model. Our analysis shows that for all testproteomes containing at least one depolymerase, the depolymerase protein wasranked within the top scoring 5% of proteins. We expect that the proposed ap-proach (called Depolymerase Ranker) will be useful in accelerating the discoveryof such antibacterial proteins in the wet lab. |
George Wright · Fayyaz ul Amir Minhas · Slawomir Michniewski · Eleanor Jameson 🔗 |
-
|
Physics-informed deep neural network for rigid-body protein docking
(
Poster
)
link »
Proteins are biological macromolecules that perform many essential roles within all living organisms. Many protein functions arise from physical interactions between them and also with other biomolecules (e.g. DNA, metabolites). Being able to predict whether and how two proteins interact is an important problem in fundamental biological research and translational drug discovery.In this work, we present an energy-based model for generating ensembles of rigid-body transformations to predict the configuration of protein complexes. The method incorporates strong, interpretable physical priors, it is by construction $\text{SE}(3)$ equivariant and fully-differentiable back to the atomic structure.We rely on the observation that bound protein-protein complexes show high geometric and chemical complementarity at the interface of the two proteins. Our method efficiently makes use of this prior by generating on-the-fly point cloud representations of the solvent-excluded surfaces of the proteins. Through learned point descriptors, we can infer regions of high complementarity between the two proteins and compute a proxy for the binding energy. By sampling transformations expected to adopt low energy states, we generate ensembles of bound poses where the two protein surfaces are brought into contact.We expect that the strong physical priors enforced by the network construction will aid in generalization to other related tasks and lead to a richer human understanding of the process behind the generation and scoring of the docked poses.The fact that the method is also fully differentiable allows for gradient-based modifications of the atomic structure which could be critical in tasks related to unbound docking or protein design which remain outstanding problems in protein modelling.
|
Freyr Sverrisson · Jean Feydy · Joshua Southern · Michael Bronstein · Bruno Correia 🔗 |
-
|
High-Content Similarity-Based Virtual Screening Using a Distance Aware Transformer Model
(
Poster
)
link »
Molecular similarity search is an often-used method in drug discovery, especially in virtual screening studies. While simple one- or two dimensional similarity metrics can be applied to search databases containing billions of molecules in a reasonable amount of time, this is not the case for complex three dimensional methods. In this work, we trained a transformer model to autoencode tokenized SMILES strings using a custom loss function developed to conserve similarities in latent space. This allows the direct sampling of molecules in the generated latent space based on their Euclidian distance. Reducing the similarity between molecules to their Euclidian distance in latent space allows the model to perform independent of the similarity metric it was trained on, thus enabling high-content screening with time-consuming 3D similarity metrics. We show that the presence of a specific loss function for similarity conservation greatly improved the model’s ability to predict highly similar molecules. When applying the model to a database containing 1.5 billion molecules, our model managed to reduce the relevant search space by 5 orders of magnitude. We also show that our model was able to generalize adequately when trained on a relatively small dataset of representative structures. The herein presented method thereby provides new means of substantially reducing the relevant search space in virtual screening approaches, thus highly increasing their throughput. Additionally, the distance awareness of the model causes the performance of this method to be independent of the underlying similarity metric. |
Manuel Sellner · Amr Mahmoud · Markus Lill 🔗 |
-
|
De novo design of protein target specific scaffold-based Inhibitors via Reinforcement Learning
(
Poster
)
link »
Efficient design and discovery of target-driven molecules is a critical step in facilitating lead optimization in drug discovery. Current approaches to develop molecules for a target protein are intuition-driven, hampered by slow iterative design-test cycles due to computational challenges in utilizing 3D structural data, and ultimately limited by the expertise of the chemist – leading to bottlenecks in molecular design. In this contribution, we propose a novel framework, called 3D-MolGNN$_{RL}$, coupling reinforcement learning (RL) to a deep generative model based on 3D-Scaffold to generate target candidates specific to a protein building up atom by atom from the starting core scaffold. 3D-MolGNN$_{RL}$ provides an efficient way to optimize key features by multi-objective reward function within a protein pocket using parallel graph neural network models. The agent learns to build molecules in 3D space while optimizing the activity, binding affinity, potency, and synthetic accessibility of the candidates generated for infectious disease protein targets. Our approach can serve as an interpretable artificial intelligence (AI) tool for lead optimization with optimized activity, potency, and biophysical properties.
|
Andrew McNaughton · Carter Knutson · Mridula Bontha · Jenna Pope · Neeraj Kumar 🔗 |
-
|
Partial Product Aware Machine Learning on DNA-Encoded Libraries
(
Poster
)
link »
DNA encoded libraries (DELs) are used for rapid large-scale screening of small molecules against a protein target. These combinatorial libraries are built through several cycles of chemistry and DNA ligation, producing large sets of DNA-tagged molecules. Training machine learning models on DEL data has been shown to be effective at predicting molecules of interest dissimilar from those in the original DEL. Machine learning chemical property prediction approaches rely on the assumption that the property of interest is linked to a single chemical structure. In the context of DNA-encoded libraries, this is equivalent to assuming that every chemical reaction fully yields the desired product. However, in practice, multi-step chemical synthesis sometimes generates partial molecules. Each unique DNA tag in a DEL therefore corresponds to a set of possible molecules. Here, we leverage reaction yield data to enumerate the set of possible molecules corresponding to a given DNA tag. This paper demonstrates that training a custom GNN on this richer dataset improves accuracy and generalization performance. |
Polina Binder · Meghan Lawler · LaShadric Grady · Neil Carlson · Svetlana Belyanskaya · Joe Franklin · Nicolas Tilmans · Henri Palacci 🔗 |
-
|
DebiasedDTA: Model Debiasing to Boost Drug-Target Affinity Prediction
(
Poster
)
link »
Computational models that accurately identify high-affinity protein-chemical pairs can accelerate drug discovery pipelines. These models, trained on available protein-chemical interaction datasets, can be used to predict the binding affinity of an input protein-chemical pair. However, the training datasets may contain surface patterns, or dataset biases, such that the models memorize dataset-specific biomolecule properties instead of learning affinity prediction rules. As a result, the prediction performance of models drops for unseen biomolecules. Here, we present DebiasedDTA, a novel drug-target affinity (DTA) prediction model training framework that addresses dataset biases to improve affinity prediction for novel biomolecules. DebiasedDTA uses ensemble learning and sample weight adaptation to identify and avoid biases and is applicable to most DTA prediction models. The results show that DebiasedDTA can boost models while predicting the interactions between unseen biomolecules. In addition, prediction performance for seen biomolecules also improves when the surface patterns are debiased. The experiments also show that DebiasedDTA can avoid biases of different sources and augment DTA prediction models of different input and model structures. An open-source python package, pydta, is published to facilitate the adoption of DebiasedDTA by future DTA prediction studies. Out-of-the-box, pydta allows debiasing custom DTA prediction models with only two lines of code and eliminates two sources of bias. pydta is designed to be the go-to library for model debiasing in the field of computational drug discovery. |
Rıza Özçelik · Alperen Bağ · Berk Atıl · Arzucan Özgür · Elif Ozkirimli 🔗 |
-
|
Improving the assessment of deep learning models in the context of drug-target interaction prediction
(
Poster
)
link »
Machine Learning techniques have been widely adopted to predict drug-target interactions, a central area of research in early drug discovery. These techniques have shown promising results on various benchmarks although they tend to suffer from poor generalization. This is typically related to very sparse and nonuniform datasets available, which limits the applicability domain of Machine Learning techniques. Moreover, widespread approaches to split datasets (into training and test sets) treat a drug-target interaction as an independent entities, when in reality the drug and target involved may take part in other interactions, breaking apart the assumption of independence. We observe that this leads to overly optimistic test results and poor generalization of out-of-distribution samples for various state-of-the-art sequence-based Machine Learning models for drug-target prediction. We show that previous approaches to reduce bias in binding datasets focus on drug or target information only and, thus, lead to similar pitfalls. Finally, we propose a minimum viable solution to evaluate the generalization capability of a Machine Learning model based on the systematic separation of test samples with respect to drugs and targets in the training set, thus discerning the three out-of-distribution scenarios seen at test time: (1) drug or (2) target present in the training set, or neither (3). |
Mirko Torrisi · Antonio De la Vega de Leon · Guillermo Climent · Remco Loos · Alejandro Panjkovich 🔗 |
-
|
Deep Learning Model for Flexible and Efficient Protein-Ligand Docking
(
Poster
)
link »
Protein-ligand docking is an essential tool in structure-based drug design with applications ranging from virtual high-throughput screening to pose prediction for lead optimization. Most docking programs for pose prediction are optimized for re-docking to an existing co-crystalized protein structure ignoring protein flexibility. In real-world drug design applications, however, protein flexibility is an essential feature of the ligand-binding process. Here we present a deep learning model for flexible protein-ligand docking based on the prediction of an intermolecular Euclidean distance matrix (EDM), making the typical use of search algorithms obsolete. Our method introduces a new approach for the reconstruction of ligand poses in Cartesian coordinates, utilizing EDM completion and restrained energy-based optimization. The model was trained on a large-scale dataset of protein-ligand complexes and evaluated on standardized test sets. Our model generates high quality poses for a diverse set of protein and ligand structures and outperforms comparable docking methods. |
Matthew Masters · Amr Mahmoud · Yao Wei · Markus Lill 🔗 |
-
|
Predicting single-cell perturbation responses for unseen drugs
(
Poster
)
link »
SlidesLive Video » Single-cell transcriptomics enabled the study of cellular heterogeneity in response to perturbations at the resolution of individual cells. However, scaling high-throughput screens (HTSs) to measure cellular responses for many drugs remains a challenge due to technical limitations and, more importantly, the cost of such multiplexed experiments. Thus, transferring information from routinely performed bulk RNA-seq HTS is required to enrich single-cell data meaningfully.We introduce a new encoder-decoder architecture to study the perturbational effects of unseen drugs. We combine the model with a transfer learning scheme and demonstrate how training on existing bulk RNA-seq HTS datasets can improve generalisation performance. Better generalisation reduces the need for extensive and costly screens at single-cell resolution. We envision that our proposed method will facilitate more efficient experiment designs through its ability to generate in-silico hypotheses, ultimately accelerating targeted drug discovery. |
Leon Hetzel · Simon Boehm · Niki Kilbertus · Stephan Günnemann · Mohammad Lotfollahi · Fabian Theis 🔗 |
-
|
Contrastive learning of image- and structure-based representations in drug discovery
(
Poster
)
link »
Contrastive learning for self-supervised representation learning has brought a strong improvement to many application areas, such as computer vision and natural language processing. With the availability of large collections of unlabeled data in vision and language, contrastive learning of language and image representations has brought impressive results. The contrastive learning methods CLIP and CLOOB have demonstrated that the learned representations are highly transferable to a large set of diverse tasks when trained on multi-modal data from two different domains. In drug discovery, similar, large, multi-modal datasets comprising both cell-based microscopy images and chemical structures of molecules are available. However, contrastive learning has not been used for this type of multi-modal data in drug discovery, although transferable representations could be a remedy for the time-consuming and cost-expensive label acquisition in this domain. In this work, we present a contrastive learning method for image-based and structure-based representations of small molecules for drug discovery. Our method, Contrastive Leave-One-Out boost for Molecule Encoders (CLOOME), comprises an encoder for microscopy data, an encoder of chemical structures, and a contrastive learning objective. On the benchmark dataset ”Cell Painting”, we demonstrate the ability of our method to learn proficient representations by performing linear probing for activity prediction tasks. |
Ana Sanchez-Fernandez · Elisabeth Rumetshofer · Sepp Hochreiter · Günter Klambauer 🔗 |
-
|
The Rosenbluth sampling Calculation of Hydrophobic-Polar Model
(
Poster
)
link »
SlidesLive Video » Lattice proteins are models resembling real proteins. They comprise an energy function and a set of conditions specifying the interaction between elements occupying adjacent lattice sites. In this paper we present an approach examining the behavior of chains of a large number of molecules. We investigate this by solving a restricted random walk problem on a cubic lattice and square lattice. More specifically, we apply the Hydrophobic-Polar model to examine the spatial characteristics of protein folds using the Monte Carlo method. This technique is the so-called Rosenbluth sampling method for solving restricted random walk problems. Specifically, by solving such walks we resolve folds. In addition, this method can be extended to solve the Hydrophobic-Polar model. In this paper, we describe this method as an algorithm that calculates the energy spectrum for the Hydrophobic-Polar model, and the related formula for estimating the number of folds. Moreover, we estimate the number of folds for each sequence using Hydrophobic-Polar model energy estimation. On test sequences the predicted protein folds were obtained with a mismatch of one unit according to the energy. We also observe that the estimated number of folds depends only on the length and not on the type of sequence. This promising strategy can be extended to quantify other proteins in nature. |
Marcin Wierzbinski · Alessandro Crimi 🔗 |
-
|
GRPE: Relative Positional Encoding for Graph Transformer
(
Poster
)
link »
Designing an efficient model to encode graphs is a key challenge of molecular representation learning. Transformer built upon efficient self-attention is a natural choice for graph processing, but it requires explicit incorporation of positional information. Existing approaches either linearize a graph to encode absolution position in the sequence of nodes, or encode relative position with another node using bias terms. The former loses preciseness of relative position from linearization, while the latter loses a tight integration of node-edge and node-spatial information. In this work, we propose relative positional encoding for a graph to overcome the weakness of the previous approaches. Our method encodes a graph without linearization and considers both node-spatial relation and node-edge relation. We name our method Graph Relative Positional Encoding dedicated to graph representation learning.Experiments conducted on various molecular property prediction datasets show that the proposed method outperforms previous approaches significantly. Our code is publicly available at https://github.com/lenscloth/GRPE}. |
Wonpyo Park · Woong-Gi Chang · Donggeon Lee · Juntae Kim · seung-won hwang 🔗 |
-
|
Multi-Segment Preserving Sampling for Deep Manifold Sampler
(
Poster
)
link »
SlidesLive Video » Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility.The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences. Sampling was done by exploiting the gradients from a function predictor trained on top of the manifold sampler.In this work, we introduce an alternative approach to guided sampling that enables the direct inclusion of domain-specific knowledge by designating preserved and non-preserved segments along the input sequence, thereby restricting variation to only select regions.We call this method ``multi-segment preserving sampling" and present its effectiveness in the context of antibody design.We train two models: a deep manifold sampler and a GPT-2 language model on nearly six million heavy chain sequences annotated with the \textit{IGHV1-18} gene.During sampling, we restrict variation to only the complementarity-determining region 3 (CDR3) of the input. We obtain log probability scores from a GPT-2 model for each sampled CDR3 and demonstrate that multi-segment preserving sampling generates reasonable designs while maintaining the desired, preserved regions. |
Dan Berenberg · Jae Hyeon Lee · Simon Kelow · Ji Park · Andrew Watkins · Richard Bonneau · Vladimir Gligorijevic · Stephen Ra · Kyunghyun Cho 🔗 |
-
|
MetaDTA: Meta-learning-based drug-target binding affinity prediction
(
Poster
)
link »
We propose a meta-learning-based model for drug-target binding affinity prediction (MetaDTA), for which no information of the protein structures or binding sites is available. We formulate our method based on the Attentive Neural Processes (ANPs) (Kim et al., 2019), where the binding affinities for each target protein are modeled as a regression function of the compounds. Known drug-target binding affinity pairs are used as support set to determine the regression function. We designed few-shot prediction experiments with small number of support set data, which are similar to the typical situations in actual drug discovery processes. Experimental results showed that the proposed method outperforms the sequence-based baseline models with the same amount of limited data. |
Eunjoo Lee · Jiho Yoo · Huisun Lee · Seunghoon Hong 🔗 |
-
|
An evaluation framework for the objective functions of de novo drug design benchmarks
(
Poster
)
link »
De novo drug design has recently received increasing attention from the machine learning community. It is important that the field is aware of the actual goals and challenges of drug design and the roles that de novo molecule design algorithms could play in accelerating the process, so that algorithms can be evaluated in a way that reflects how they would be applied in real drug design scenarios. In this paper, we propose a framework for critically assessing the merits of benchmarks, and argue that most of the existing de novo drug design benchmark functionsare either highly unrealistic or depend upon a surrogate model whose performance is not well characterized. In order for the field to achieve its long-term goals, we recommend that poor benchmarks (especially logP and QED) be deprecated in favour of better benchmarks. We hope that our proposed framework can play a part in developing new de novo drug design benchmarks that are more realistic and ideally incorporate the intrinsic goals of drug design. |
Austin Tripp · Wenlin Chen · José Miguel Hernández Lobato 🔗 |
-
|
PREDICTION OF MOLECULAR FIELD POINTS USING SE(3)-TRANSFORMER MODEL
(
Poster
)
link »
Due to their computational efficiency, 2D fingerprints are typically usedin similarity-based high-content screening. The interaction of a ligandwith its target protein, however, relies on its physicochemical interactionsin 3D space. Thus, ligands with different 2D scaffolds can bind to thesame protein if these ligands share similar interaction patterns. Molecularfields can represent those interaction profiles. For efficiency, the extrema ofthose molecular fields, named field points, are used to quantify the ligandsimilarity in 3D. The calculation of field points involves the evaluation of theinteraction energy between the ligand and a small probe shifted on a fine gridrepresenting the molecular surface. These calculations are computationallyprohibitive for large datasets of ligands, making field point representationsof molecules intractable for high-content screening. Here, we overcome thisroadblock by one-shot prediction of field points using generative neuralnetworks based on the molecular structure alone. Field points are predictedby training an SE(3)-Transformer, an equivariant, attention-based graphneural network architecture, on a large set of ligands with field point data.Initial data demonstrates the feasibility of this approach to precisely generatenegative, positive and hydrophobic field points within 1 Å of the groundtruth for a diverse set of drug-like molecules. |
Florian Hinz · Amr Mahmoud · Markus Lill 🔗 |
-
|
Convolutions are competitive with transformers for protein sequence pretraining
(
Poster
)
link »
Pretrained protein sequence language models largely rely on the transformer architecture. However, transformer run-time and memory requirements scale quadratically with sequence length. We investigate the potential of a convolution-based architecture for protein sequence masked language model pretraining and subsequent finetuning. CNNs are competitive on the pretraining task with transformers across several orders of magnitude in parameter size while scaling linearly with sequence length. More importantly, CNNs are competitive with and occasionally superior to transformers across an extensive set of downstream evaluations, including structure prediction, zero-shot mutation effect prediction, and out-of-domain generalization. We also demonstrate strong performance on sequences longer than the positional embeddings allowed in the current state-of-the-art transformer protein masked language models. Finally, we close with a call to disentangle the effects of pretraining task and model architecture when studying pretrained protein sequence models. |
Kevin K Yang · Alex Lu · Nicolo Fusi 🔗 |
-
|
Glolloc: Mixture of Global and Local Experts for Molecular Activity Prediction
(
Poster
)
link »
SlidesLive Video » Quantitative structure-activity relationships (QSAR) models have been used for decades to predict the activity of small molecules, using encodings of the molecular structure, for which simple 2D descriptors of the molecular graph are still most commonly used. One of the recurrent problems of QSAR models is that relationships observed for a specific scaffold (pruned molecular skeleton) are sometimes not translatable to another, due to the 3D flexibility of molecular objects. This is also true when building multitask networks predicting the activity against several proteins at the same time - sometimes single protein models work better, and adding dissimilar proteins into the model decreases performance. In this paper, mixtures of experts (MoE) are used to combine a global network and local structures of the dataset (e.g. molecular scaffold, single protein) in the single task or multitask framework. We show that structuring the learning process with protein or chemical series information can enhance model performance and provide a built-in model introspection tool. |
Héléna A. Gaspar · Matthew Seddon 🔗 |
-
|
Benchmarking Uncertainty Quantification for Protein Engineering
(
Poster
)
link »
Machine learning sequence-function models for proteins could enable significant advances in protein engineering, especially when paired with state-of-the-art methods to select new sequences for property optimization and/or model improvement. Such methods (Bayesian optimization and active learning) require calibrated estimations of model uncertainty. While studies have benchmarked a variety of deep learning uncertainty quantification (UQ) methods on standard and molecular machine-learning datasets, it is not clear how well these results extend to protein datasets. In this work, we implement a panel of deep learning UQ methods on the Fitness Landscape Inference for Proteins (FLIP) benchmark regression tasks. We compare results across different degrees of distributional shift using metrics that assess each UQ method's accuracy, calibration, coverage, width, and rank correlation to provide recommendations for the effective design of biological sequences. |
Kevin P Greenman · Ava Soleimany · Kevin K Yang 🔗 |
-
|
EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction
(
Poster
)
link »
Predicting how a drug-like molecule binds to a specific protein target is a core problem in drug discovery. An extremely fast computational binding method would enable key applications such as fast virtual screening or drug engineering. Existing methods are computationally expensive as they rely on heavy candidate sampling coupled with scoring, ranking, and fine-tuning steps. We challenge this paradigm with EquiBind, an SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand's bound pose and orientation. EquiBind achieves significant speed-ups and better quality compared to traditional and recent baselines. Further, we show extra improvements when coupling it with existing fine-tuning techniques at the cost of increased running time. Finally, we propose a novel and fast fine-tuning model that adjusts torsion angles of a ligand's rotatable bonds based on closed-form global minima of the von Mises angular distance to a given input atomic point cloud, avoiding previous expensive differential evolution strategies for energy minimization. |
Hannes Stärk · Octavian Ganea · Lagnajit Pattanaik · Regina Barzilay · Tommi Jaakkola 🔗 |
-
|
ChemSpacE: Toward Steerable and Interpretable Chemical Space Exploration
(
Poster
)
link »
Discovering new structures in the chemical space is a long-standing challenge and has important applications to various fields such as chemistry, material science, and drug discovery. Deep generative models have been used in \textit{de novo} molecule design to embed molecules in a meaningful latent space and then sample new molecules from it. However, the steerability and interpretability of the learned latent space remains much less explored. In this paper, we introduce a new task named \textit{molecule manipulation}, which aims to align the properties of the generated molecule and its latent activation in order to achieve the interactive molecule editing. Then we develop a method called \textbf{Chem}ical \textbf{Spac}e \textbf{E}xplorer (ChemSpacE), which identifies and traverses interpretable directions in the latent space that align with molecular structures and property changes. ChemSpacE is highly efficient in terms of training/inference time, data, and the number of oracle calls. Experiments show that the ChemSpacE can efficiently steer the latent spaces of multiple state-of-the-art molecule generative models for interactive molecule design and discovery. |
Yuanqi Du · Xian Liu · Shengchao Liu · Jieyu Zhang · Bolei Zhou 🔗 |
-
|
Deep sharpening of topological features for de novo protein design
(
Poster
)
link »
SlidesLive Video » Computational \emph{de novo} protein design allows the exploration of uncharted areas of the protein structure and sequence spaces. Classical approaches to \emph{de novo} protein design involve an iterative process where the desired protein shape is outlined, then sampled for structural backbones and designed with low energy amino acid sequences. Despite numerous successes, inaccuracies within energy functions and sampling methods often lead to physically unrealistic protein backbones yielding sequences that fail to fold experimentally. Recently, deep neural networks have successfully been used to design novel protein folds from scratch by iteratively predicting a structure and optimizing the sequence until a target protein structure is reached. These methods work well under circumstances where distributions of physically realistic target protein backbones can be readily defined, but lack the ability to \emph{de novo} design loosely specified protein shapes. In fact, a major challenge for \emph{de novo} protein design is to generate "designable" protein structures for defined folds, including native and artificial ("dark matter") folds that can then be used to find low energetic sequences in a generic manner. Here, we automate the task of creating designable backbones using a variational autoencoder framework, termed \textsc{Genesis}, to denoise sketches of protein topological lattice models by sharpening their 2D representations in distance and angle feature maps. In conjunction with the trRosetta design framework, large pools of diverse sequences for different protein folds were generated for the maps. We found that the \textsc{Genesis}-trDesign framework generates native-like feature maps for known and dark matter protein folds. Ultimately, the \textsc{Genesis} framework addresses the protein backbone designability problem and could contribute to the \emph{de novo} design of structurally defined artificial proteins that can be tailored for novel functionalities. |
Zander Harteveld · Joshua Southern · Michaël Defferrard · Andreas Loukas · Pierre Vandergheynst · Micheal Bronstein · Bruno Correia 🔗 |
-
|
Isolating salient variations of interest in single-cell transcriptomic data with contrastiveVI
(
Poster
)
link »
Single-cell RNA sequencing (scRNA-seq) technologies enable a better understanding of previously unexplored biological diversity. Oftentimes, researchers are specifically interested in modeling the latent structures and variations enriched in one target scRNA-seq dataset as compared to another background dataset generated from sources of variation irrelevant to the task at hand. For example, we may wish to isolate factors of variation only present in measurements from patients with a given disease as opposed to those shared with data from healthy control subjects. Here we introduce Contrastive Variational Inference (contrastiveVI; https://github.com/suinleelab/contrastiveVI), a framework for end-to-end analysis of target scRNA-seq datasets that decomposes the variations into shared and target-specific factors of variation. On four target-background dataset pairs, we apply contrastiveVI to perform a number of standard analysis tasks, including visualization, clustering, and differential expression testing, and we consistently achieve results that agree with known biological ground truths. |
Ethan Weinberger · Chris Lin · Su-In Lee 🔗 |
-
|
Regression Transformer: Concurrent Conditional Generation and Regression by Blending Numerical and Textual Tokens
(
Poster
)
link »
We report the Regression Transformer (RT), a method that abstracts regression as a conditional sequence modeling problem. The RT casts continuous properties as sequences of numerical tokens and encodes them jointly with conventional tokens.This yields a dichotomous model that concurrently excels at regression tasks and property-driven conditional generation.can seamlessly transition between solving regression tasks and conditional generation tasks; solely governed by the mask location. We propose several extensions to the XLNet objective and adopt an alternating training scheme to concurrently optimize property prediction and conditional text generation with on a self-consistency loss.Our experiments on both chemical and protein languages demonstrate that the performance of traditional regression models can be surpassed despite training with cross entropy loss.Importantly, priming the same model with continuous properties yields a highly competitive conditional generative models that outperforms specialized approaches in a constrained property optimization benchmark.In sum, the Regression Transformer opens the door for "swiss army knife" models that excel at both regression and conditional generation.This finds application particularly in property-driven, local exploration of the chemical or protein space.The code to reproduce all experiments of the paper is available at: https://anonymous.4open.science/r/regression-transformer/ |
Jannis Born · Matteo Manica 🔗 |