We are at a pivotal time in healthcare characterized by unprecedented scientific and technological progress in recent years together with the promise borne by personalized medicine to radically transform the way we provide care to patients. However, drug discovery has become an increasingly challenging endeavor: not only has the success rate of developing new therapeutics been historically low, but this rate has been steadily declining. The average cost to bring a new drug to market (factoring in failures) is now estimated at 2.6 billion – 140% higher than a decade earlier. Machine learning-based approaches present a unique opportunity to address this challenge. While there has been growing interest and pioneering work in the machine learning (ML) community over the past decade, the specific challenges posed by drug discovery are largely unknown by the broader community. Last year, the first MLDD workshop at ICLR 2022 brought together hundreds of attendees, world-class experts in ML for drug discovery, received about 60 paper submissions from the community, and featured a two-month community challenge in parallel to the workshop. Building on the success from last year, we would like to organize a second instance of the MLDD workshop at ICLR 2023, with the ambition to federate the community interested in this application domain where i) ML can have a significant positive impact for the benefit of all and ii) the application domain can drive ML method development through novel problem settings, benchmarks and testing grounds at the intersection of many subfields ranging representation, active and reinforcement learning to causality and treatment effects.
Fri 2:00 a.m. - 2:10 a.m.
|
Intro
(
Workshop introduction
)
SlidesLive Video » |
Pascal Notin 🔗 |
Fri 2:10 a.m. - 2:55 a.m.
|
Invited Talk - Fabian Theis
(
Talk
)
SlidesLive Video » |
🔗 |
Fri 2:55 a.m. - 3:40 a.m.
|
Invited Talk - Michael Bronstein
(
Talk
)
SlidesLive Video » |
Micheal Bronstein 🔗 |
Fri 3:40 a.m. - 3:45 a.m.
|
Do deep learning models really outperform traditional approaches in molecular docking?
(
Oral
)
link »
SlidesLive Video » Molecular docking, given a ligand molecule and a ligand binding site (called "pocket'') on a protein, predicting the binding mode of the protein-ligand complex, is a widely used technique in drug design. Many deep learning models have been developed for molecular docking, while most existing deep learning models perform docking on the whole protein, rather than on a given pocket as the traditional molecular docking approaches, which does not match common needs. What's more, they claim to perform better than traditional molecular docking, but the approach of comparison is not fair, since traditional methods are not designed for docking on the whole protein without a given pocket. In this paper, we design a series of experiments to examine the actual performance of these deep learning models and traditional methods. For a fair comparison, we decompose the docking on the whole protein into two steps, pocket searching and docking on a given pocket, and build pipelines to evaluate traditional methods and deep learning methods respectively. We find that deep learning models are actually good at pocket searching, but traditional methods are better than deep learning models at docking on given pockets. Overall, our work explicitly reveals some potential problems in current deep learning models for molecular docking and provides several suggestions for future works. |
Yuejiang Yu · Shuqi Lu · Zhifeng Gao · Hang Zheng · Guolin Ke 🔗 |
Fri 3:45 a.m. - 3:50 a.m.
|
Differentiable Multi-Target Causal Bayesian Experimental Design
(
Oral
)
link »
SlidesLive Video » We introduce a gradient-based approach for the problem of Bayesian optimal experimental design to learn causal models in a batch setting --- a critical component for causal discovery from finite data where interventions can be costly or risky. Existing methods rely on greedy approximations to construct a batch of experiments while using black-box methods to optimize over a single target-state pair to intervene with. In this work, we completely dispose of the black-box optimization techniques and greedy heuristics and instead propose a conceptually simple end-to-end gradient-based optimization procedure to acquire a set of optimal intervention target-state pairs. Such a procedure enables parameterization of the design space to efficiently optimize over a batch of multi-target-state interventions, a setting which has hitherto not been explored due to its complexity. We demonstrate that our proposed method outperforms baselines and existing acquisition strategies in both single-target and multi-target settings across a number of synthetic datasets. |
Panagiotis Tigas · Yashas Annadani · Desi Ivanova · Andrew Jesson · Yarin Gal · Adam Foster · Stefan Bauer 🔗 |
Fri 3:50 a.m. - 3:55 a.m.
|
Do Deep Learning Methods Really Perform Better in Molecular Conformation Generation?
(
Oral
)
link »
SlidesLive Video » Molecular conformation generation (MCG) is a fundamental and important problem in drug discovery. Many traditional methods have been developed to solve the MCG problem, such as systematic searching, model-building, random searching, distance geometry, molecular dynamics, Monte Carlo methods, etc. However, they have some limitations depending on the molecular structures. Recently, there are plenty of deep learning based MCG methods, which claim they largely outperform the traditional methods. However, to our surprise, we design a simple and cheap algorithm (parameter-free) based on the traditional methods and find it is comparable to or even outperforms deep learning based MCG methods in the widely used GEOM-QM9 and GEOM-Drugs benchmarks. In particular, our design algorithm is simply the clustering of the RDKIT-generated conformations. We hope our findings can help the community to revise the deep learning methods for MCG. The code of the proposed algorithm could be found at https://gist.github.com/ZhouGengmo/5b565f51adafcd911c0bc115b2ef027c. |
Gengmo Zhou · Zhifeng Gao · Zhewei Wei · Hang Zheng · Guolin Ke 🔗 |
Fri 3:55 a.m. - 4:00 a.m.
|
Exploring Chemical Space with Score-based Out-of-distribution Generation
(
Oral
)
link »
SlidesLive Video » A well-known limitation of existing molecular generative models is that the generated molecules highly resemble those in the training set. To generate truly novel molecules that may have even better properties for de novo drug discovery, more powerful exploration in the chemical space is necessary. To this end, we propose Molecular Out-Of-distribution Diffusion (MOOD), a novel score-based diffusion scheme that incorporates out-of-distribution (OOD) control in the generative stochastic differential equation (SDE) with simple control of a hyperparameter, thus requires no additional computational costs. Since some novel molecules may be chemically implausible or may not meet the basic requirements of real-world drugs, MOOD performs conditional generation by utilizing the gradients from a property predictor that guides the reverse-time diffusion process to high-scoring regions according to target properties such as protein-ligand interactions, drug-likeness, and synthesizability. This allows MOOD to search for novel and meaningful molecules rather than generating unseen yet trivial ones. We experimentally validate that MOOD is able to explore the chemical space beyond the training distribution, generating molecules that outscore ones found with existing methods, and even the top 0.01% of the original training pool. |
Seul Lee · Jaehyeong Jo · Sung Ju Hwang 🔗 |
Fri 4:00 a.m. - 4:45 a.m.
|
Invited Talk - Djork Arne Clevert
(
Talk
)
SlidesLive Video » |
🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
IMPROVING PROTEIN-PEPTIDE INTERFACE PREDIC- TIONS IN THE LOW DATA REGIME
(
Poster
)
link »
We propose a novel approach for predicting protein-peptide interactions using a bi-modal transformer architecture that learns an inter-facial joint distribution of residual contacts. The current data sets for crystallized protein-peptide complexes are limited, making it difficult to accurately predict interactions between proteins and peptides. To address this issue, we propose augmenting the existing data from PepBDB with pseudo protein-peptide complexes derived from the PDB. The augmented data set acts as a method to transfer physics-based context-dependent intra-residue (within a domain) interactions to the inter-residual (between) domains. We show that the distributions of inter-facial residue-residue interactions share overlap with inter residue-residue interactions, enough to increase predictive power of our bi-modal transformer architecture. In addition, this data-augmentation allows us to leverage the vast amount of protein-only data available in the PDB to train neural networks, in contrast to template-based modeling that acts as a prior. |
Justin Diamond · Markus Lill 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
MiDi: Mixed Graph and 3D Denoising Diffusion for Molecule Generation
(
Poster
)
link »
This work introduces MiDi, a diffusion model for jointly generating molecular graphs and corresponding 3D conformers. In contrast to existing models which derive molecular bonds from the conformation using predefined rules, MiDi streamlines the molecule generation process with an end-to-end differentiable model. Preliminary results demonstrate the benefits of this approach: on the complex GEOM-DRUGS dataset, our model generates significantly better molecular graphs than 3D-based models, and even surpasses specialized algorithms that directly optimize the bond orders for validity. |
Clément Vignac · Nagham Osman · Laura Toni · Pascal Frossard 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
Uni-Fold MuSSe: De Novo Protein Complex Prediction with Protein Language Models
(
Poster
)
link »
Accurately solving the structures of protein complexes is crucial for understanding and further modifying biological activities. Recent success of AlphaFold and its variants shows that deep learning models are capable of accurately predicting protein complex structures, yet with the painstaking effort of homology search and pairing. To bypass this need, we present Uni-Fold MuSSe (Multimer with Single Sequence inputs), which predicts protein complex structures from their primary sequences with the aid of pre-trained protein language models. Specifically, we built protein complex prediction models based on the protein sequence representations of ESM-2, a large protein language model with 3 billion parameters. In order to adapt the language model to inter-protein evolutionary patterns, we slightly modified and further pre-trained the language model on groups of protein sequences with known interactions. Our results highlight the potential of protein language models for complex prediction and suggest room for improvements. |
Jinhua Zhu · Zhenyu He · Ziyao Li · Guolin Ke · Linfeng Zhang 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
Probing Graph Representations
(
Poster
)
link »
Today we have a good theoretical understanding of the representational power of Graph Neural Networks (GNNs). For example, their limitations have been characterized in relation to a hierarchy of Weisfeiler-Lehman (WL) isomorphism tests. However, we do not know what is encoded in the learned representations. This is our main question. We answer it using a probing framework to quantify the amount of meaningful information captured in graph representations. Our findings on molecular datasets show the potential of probing for understanding the inductive biases of graph-based models. We compare different families of models, and show that transformer-based models capture more chemically relevant information compared to models based on message passing. We also study the effect of different design choices such as skip connections and virtual nodes. We advocate for probing as a useful diagnostic tool for evaluating and developing graph-based models. |
Mohammad Sadegh Akhondzadeh · Vijay Chandra Lingam · Aleksandar Bojchevski 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
A Multi-Omics Visible Deep Network for Drug Activity Prediction
(
Poster
)
link »
Drug discovery is a challenging task, characterized by a significant amount of time between initial development and market release, with a high rate of attrition at each stage. Computational virtual screening, powered by machine learning algorithms, has emerged as a promising approach for predicting therapeutic efficacy of drugs. However, the complex relationships between features learned by these algorithms can be challenging to decipher.We have devised a neural network model for the prediction of drug sensitivity, which employs a biologically-informed visible neural network (VNN), leveraging multi-omics data and molecular descriptors. The trained model can be scrutinized to investigate the biological pathways that play a fundamental role in prediction, as well as the chemical properties of drugs that influence sensitivityWe have extended the model to predict drug synergy, resulting in favorable outcomes while retaining interpretability. Given the often unbalanced nature of publicly available drug screening datasets, our model demonstrates superior performance compared to state-of-the-art visible machine learning algorithms. |
Luigi Ferraro · Giovanni Scala · Luigi Cerulo · Emanuele Carosati · Michele Ceccarelli 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
Flexible Small-Molecule Design and Optimization with Equivariant Diffusion Models
(
Poster
)
link »
Recent advancements in generative models for Structure-based Drug Design (SBDD) have surpassed traditional methods, but their confined scope restricts the data available for training and their practical applications. To overcome these limitations, we introduce a flexible SBDD method based on an equivariant diffusion model, which was trained via a broadly applicable training objective and could therefore leverage the large and diverse sets of protein-ligand complexes available. Our approach excels in a wide range of SBDD subtasks, including scaffold hopping, fragment merging, and fragment growing, without requiring specialized training. Additionally, it not only generates hits but can also optimize desirable properties of existing hits, such as binding score and synthetic accessibility. Our optimization framework opens up new opportunities for negative design and increasing target specificity. It can be utilized in both a highly automated and manually controlled manner, offering drug discovery scientists fine-grained control. This versatile method has the potential to be valuable for a broad range of molecular design tasks, serving as a foundation for future advancements in the field. |
Charles Harris · Kieran Didi · Arne Schneuing · Yuanqi Du · Arian Jamasb · Michael Bronstein · Bruno Correia · Pietro Lio · Tom Blundell 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
Differentiable Multi-Target Causal Bayesian Experimental Design
(
Poster
)
link »
We introduce a gradient-based approach for the problem of Bayesian optimal experimental design to learn causal models in a batch setting --- a critical component for causal discovery from finite data where interventions can be costly or risky. Existing methods rely on greedy approximations to construct a batch of experiments while using black-box methods to optimize over a single target-state pair to intervene with. In this work, we completely dispose of the black-box optimization techniques and greedy heuristics and instead propose a conceptually simple end-to-end gradient-based optimization procedure to acquire a set of optimal intervention target-state pairs. Such a procedure enables parameterization of the design space to efficiently optimize over a batch of multi-target-state interventions, a setting which has hitherto not been explored due to its complexity. We demonstrate that our proposed method outperforms baselines and existing acquisition strategies in both single-target and multi-target settings across a number of synthetic datasets. |
Panagiotis Tigas · Yashas Annadani · Desi Ivanova · Andrew Jesson · Yarin Gal · Adam Foster · Stefan Bauer 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
Towards antigenic peptide discovery with better MHC-I binding prediction and improved benchmark methodology
(
Poster
)
link »
The Major Histocompatibility Complex (MHC) is a crucial component of the cellular immune system in vertebrates, responsible for, among others, presenting peptides derived from intracellular proteins. The MHC-I presentation is vital in the immune response and holds great promise in vaccine development and cancer immunotherapy. In this study, we analyze the limitations of existing methods and benchmarks for MHC-I presentation. We introduce a new benchmark to measure crucial generalization properties and models' reliability on unseen MHC molecules and peptides. Finally, we present ImmunoBert, a pre-trained language model which significantly surpasses prior methods on our benchmark and also sets new state-of-the-art on the old benchmarks. |
Grzegorz Preibisch 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
PF-ABGen: A Reliable and Efficient Antibody Generator via Poisson Flow
(
Poster
)
link »
An antibody is a special type of protein in the immune system to recognize and neutralize pathogenic targets, including bacteria and viruses. Antibody design is therefore valuable for the development of new therapeutics, while experimental based methods are generally inefficient and expensive. Despite the fruitful progress in protein design with generative neural networks, including diffusion models, they still suffer from high computational costs. In this paper, we propose Poisson Flow based AntiBody Genenator (PF-ABGen), a novel antibody structure and sequence designer. We adopt the protein structure representation with torsion and bond angles, which allows us to represent the conformations more elegantly, and take advantage of the efficient sampling procedure of the Poisson Flow Generative Model. Our computational experiments demonstrate that PF-ABGen can generate natural and realistic antibodies in an efficient and reliable way. Notably, PF-ABGen can also be applied to antibody design with variable lengths. |
Chutian HUANG · Zijing Liu · Shengyuan Bai · Linwei Zhang · Chencheng Xu · ZHE WANG · Yang Xiang · Yuanpeng Xiong 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models
(
Poster
)
link »
Understanding how proteins structurally interact is crucial to modern biology, with applications in drug discovery and protein design. Recent machine learning methods have formulated protein-small molecule docking as a generative problem with significant performance boosts over both traditional and deep learning baselines. In this work, we propose a similar approach for rigid protein-protein docking: DiffDock-PP is a diffusion generative model that learns to translate and rotate unbound protein structures into their bound conformations. We achieve state-of-the-art performance on DIPS with a median C-RMSD of 4.85, outperforming all considered baselines. Additionally, DiffDock-PP is faster than all search-based methods and generates reliable confidence estimates for its predictions. |
Mohamed Amine Ketata · Cedrik Laue · Ruslan Mammadov · Hannes Stärk · Rachel (Menghua) Wu · Gabriele Corso · Céline Marquet · Regina Barzilay · Tommi Jaakkola 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
PocketNet: Ligand-Guided Pocket Prediction for Blind Docking
(
Poster
)
link »
We introduce PocketNet, a novel method for identifying ligand binding sites (LBS). Unlike current methods, PocketNet is tailored to identify the binding site specifically associated with the target ligand. With most protein targets having multiple binding sites, the selection process becomes ambiguous without specific ligand information. This limitation negatively impacts downstream applications such as docking and virtual screening. PocketNet addresses this challenge by combining the ouput of multiple LBS prediction tools and utilizing a deep neural network that incorporates ligand information to re-rank the sites. Our results demonstrate that PocketNet outperforms the latest methods for both pocket prediction and blind docking tasks. |
Matthew Masters · Amr Mahmoud · Markus Lill 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
Accurate Free Energy Estimations of Molecular Systems Via Flow-based Targeted Free Energy Perturbation
(
Poster
)
link »
The Targeted Free Energy Perturbation (TFEP) method aims to overcome the time-consuming and computer-intensive stratification process of standard methods for estimating the free energy difference between two states. To achieve this, TFEP uses a mapping function between the high-dimensional probability densities of these states. The bijectivity and invertibility of normalizing flow neural networks fulfill the requirements for serving as such a mapping function. Despite its theoretical potential for free energy calculations, TFEP has not yet been adopted in practice due to challenges in entropy correction, limitations in energy-based training, and mode collapse when learning density functions of larger systems with a high number of degrees of freedom. In this study, we expand flow-based TFEP to systems with variable number of atoms in the two states of consideration by exploring the theoretical basis of entropic contributions of dummy atoms, and validate our reasoning with analytical derivations for a model system containing coupled particles. We also extend the TFEP framework to handle systems of hybrid topology, propose auxiliary additions to improve the TFEP architecture, and demonstrate accurate predictions of relative free energy differences for large molecular systems. Our results provide the first practical application of the fast and accurate deep learning-based TFEP method for biomolecules and introduce it as a viable free energy estimation method within the context of drug design. |
Soo Jung Lee · Amr Mahmoud · Markus Lill 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
Domain-aware representation of small molecules for explainable property prediction models
(
Poster
)
link »
The advances in deep learning algorithms have impacted the drug discovery pipeline in many ways. Specifically, generative artificial intelligence (AI) algorithms can now explore the large chemical space and design novel and diverse molecules. While there has been significant progress in generative AI models, it is equally important to develop predictive models for various properties, which can help to characterize novel drug-like molecules. Further, the predictive model acts as a critic to design multi-property optimized molecules, which can potentially reduce the late stage attrition of drug candidates. Nevertheless, understanding the reason behind model predictions can guide the medicinal chemist to modify substructures that can make the molecules undesirable during the lead optimization stage of drug discovery. However, current explainable approaches are mostly atom-based where, often only a fraction of a fragment is shown to be significant. To address the above challenges, we have developed a novel domain-aware fragment-based graph input representation based on a molecular fragmentation approach termed pBRICS, which can fragment small molecules into their functional groups. Both single and multi-task models were developed to predict various properties including ADMET properties. The fragment level explainability were obtained using the Grad-CAM approach. The method was further validated with the available Matched Molecular Pairs (MMP) for blood brain barrier permeability (BBBP) and Ames Mutagenicity. |
Sarveswara Rao Vangala · Sowmya Ramaswamy Krishnan · Navneet Bung · Rajgopal Srinivasan · Arijit Roy 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
Exploring Chemical Space with Score-based Out-of-distribution Generation
(
Poster
)
link »
A well-known limitation of existing molecular generative models is that the generated molecules highly resemble those in the training set. To generate truly novel molecules that may have even better properties for de novo drug discovery, more powerful exploration in the chemical space is necessary. To this end, we propose Molecular Out-Of-distribution Diffusion (MOOD), a novel score-based diffusion scheme that incorporates out-of-distribution (OOD) control in the generative stochastic differential equation (SDE) with simple control of a hyperparameter, thus requires no additional computational costs. Since some novel molecules may be chemically implausible or may not meet the basic requirements of real-world drugs, MOOD performs conditional generation by utilizing the gradients from a property predictor that guides the reverse-time diffusion process to high-scoring regions according to target properties such as protein-ligand interactions, drug-likeness, and synthesizability. This allows MOOD to search for novel and meaningful molecules rather than generating unseen yet trivial ones. We experimentally validate that MOOD is able to explore the chemical space beyond the training distribution, generating molecules that outscore ones found with existing methods, and even the top 0.01% of the original training pool. |
Seul Lee · Jaehyeong Jo · Sung Ju Hwang 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
LEA: Latent Eigenvalue Analysis in application to high-throughput phenotypic drug screening
(
Poster
)
link »
Understanding the phenotypic characteristics of cells in culture and detecting perturbations introduced by drug stimulation is of great importance for biomedical research. However, a thorough and comprehensive analysis of phenotypic heterogeneity is challenged by the complex nature of cell-level data. Here, we propose a novel Latent Eigenvalue Analysis (LEA) framework and apply it to high-throughput phenotypic profiling with single-cell and single-organelle granularity. Using the publicly available SARS-CoV-2 datasets stained with the multiplexed fluorescent cell-painting protocol, we demonstrate the power of the LEA approach in the investigation of phenotypic changes induced by more than 1800 drug compounds. As a result, LEA achieves a robust quantification of phenotypic changes introduced by drug treatment. Moreover, this quantification can be biologically supported by simulating clearly observable phenotypic transitions in a broad spectrum of use cases. In conclusion, LEA represents a new and broadly applicable approach for quantitative and interpretable analysis in routine drug screening practice. |
Jiqing Wu · Viktor Koelzer 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
Graph Generation with Destination-Driven Diffusion Mixture
(
Poster
)
link »
Generation of graphs is a major challenge for real-world tasks that require understanding the complex nature of their non-Euclidean structures. Although diffusion models have achieved notable success in graph generation recently, they are ill-suited for modeling the structural information of graphs since learning to denoise the noisy samples does not explicitly capture the graph topology. To tackle this limitation, we propose a novel generative process that models the topology of graphs by predicting the destination of the process. Specifically, we design the generative process as a mixture of diffusion processes conditioned on the endpoint in the data distribution, which drives the process toward the probable destination. Further, we introduce new training objectives for learning to predict the destination, and discuss the advantages of our generative framework that can explicitly model the graph topology and exploit the inductive bias of the data. Through extensive experimental validation on general graph and 2D/3D molecular graph generation tasks, we show that our method outperforms previous generative models, generating graphs with correct topology with both continuous and discrete features. |
Jaehyeong Jo · Dongki Kim · Sung Ju Hwang 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
SmilesFormer: Language Model for Molecular Design
(
Poster
)
link »
The objective of drug discovery is to find novel compounds with desirable chemical properties. Generative models have been utilized to sample molecules at the intersection of multiple property constraints. In this paper we pose molecular design as a language modeling problem where the model implicitly learns the vocabulary and composition of valid molecules, hence it is able to generate new molecules of interest. We present SmilesFormer, a Transformer-based model which is able to encode molecules, molecule fragments, and fragment compositions as latent variables, which are in turn decoded to stochastically generate novel molecules. This is achieved by fragmenting the molecules into smaller combinatorial groups, then learning the mapping between the input fragments and valid SMILES sequences. The model is able to optimize molecular properties through a stochastic latent space traversal technique. This technique systematically searches the encoded latent space to find latent vectors that are able to produce molecules to meet the multi-property objective. The model was validated through various de novo molecular design tasks, achieving state-of-the-art performances when compared to previous methods. Furthermore, we used the proposed method to demonstrate a drug rediscovery pipeline for Donepezil, a known Acetylcholinesterase Inhibitor. |
Joshua Owoyemi · Nazim Medzhidov 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
RetroG: Retrosynthetic Planning with Tree Search and Graph Learning
(
Poster
)
link »
Retrosynthesis Planning (RP) is one of the challenging problems in organic chemistry. It involves designing target molecules using compounds which are commercially available or easy to synthesize by following a series of backward steps. While the earlier RP methods were majorly expert-based, strong computer-aided RP methods have emerged in the recent past. The success of computer-aided RP is critical to the development of new drugs and the synthesis of target compounds in material science and agrochemicals. In this paper, we present an RP model called RetroG. Its design is based on tree search with a Graph Neural Network (GNN) as a value function. The model adapts successful reaction templates and product molecules to the route length. The evaluation of RetroG on the test benchmark datasets records new results while also presenting interesting future research areas. |
Stephen Obonyo · Nicolas Jouandeau · Dickson Owuor 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
An Exploration of Conditioning Methods in Graph Neural Networks
(
Poster
)
link »
The flexibility and effectiveness of message passing based graph neural networks (GNNs) induced considerable advances in deep learning on graph-structured data. In such approaches, GNNs recursively update node representations based on their neighbors and they gain expressivity through the use of node and edge attribute vectors. E.g., In computational tasks such as physics and chemistry usage of edge attributes such as relative position or distance proved to be essential. In this work, we address not what kind of attributes to use, but how to condition on this information to improve model performance. We consider three types of conditioning; weak, strong, and pure, which respectively relate to concatenation-based conditioning, gating, and transformations that are causally dependent on the attributes. This categorization provides a unifying viewpoint on different classes of GNNs, from separable convolutions to various forms of message passing networks. We provide an empirical study on the effect of conditioning methods in several tasks in computational chemistry. |
Yeskendir Koishekenov · Erik Bekkers 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
GCI: A (G)raph (C)oncept (I)nterpretation Framework
(
Poster
)
link »
Explainable AI (XAI) underwent a recent surge in research on concept extraction, focusing on extracting human-interpretable concepts from Deep Neural Networks. An important challenge facing concept extraction approaches is the difficulty of interpreting and evaluating discovered concepts, especially for complex tasks such as molecular property prediction. We address this challenge by presenting GCI: a (G)raph (C)oncept (I)nterpretation framework, used for quantitatively measuring alignment between concepts discovered from Graph Neural Networks (GNNs) and their corresponding human interpretations. GCI encodes concept interpretations as functions, which can be used to quantitatively measure the alignment between a given interpretation and concept definition. We demonstrate four applications of GCI: (i) quantitatively evaluating concept extractors, (ii) measuring alignment between concept extractors and human interpretations, (iii) measuring the completeness of interpretations with respect to an end task and (iv) a practical application of GCI to molecular property prediction, in which we demonstrate how to use chemical \textit{functional groups} to explain GNNs trained on molecular property prediction tasks, and implement interpretations with a $0.76$ AUCROC completeness score.
|
Dmitry Kazhdan · Botty Dimanov · Lucie Charlotte Magister · Pietro Barbiero · Mateja Jamnik · Pietro Lio 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
Structure-Based Drug Design via Semi-Equivariant Conditional Normalizing Flows
(
Poster
)
link »
We propose an algorithm for learning a conditional generative model of a molecule given a target. Specifically, given a receptor molecule that one wishes to bind to, the conditional model generates candidate ligand molecules that may bind to it. Our problem is formulated mathematically as learning conditional distributions between two 3D graphs. The distribution should be invariant to rigid body transformations that act $\textit{jointly}$ on the ligand and the receptor; it should also be invariant to permutations of either the ligand or receptor atoms. Our learning algorithm is based on a continuous normalizing flow. We establish semi-equivariance conditions on the flow which guarantee the aforementioned invariance conditions on the conditional distribution. We propose a graph neural network architecture which implements this flow, and which is designed to learn effectively despite the vast differences in size between the ligand and receptor. We evaluate our method on the CrossDocked2020 dataset, displaying high quality performance in the key $\Delta$Binding metric. We also demonstrate how the learned density may be usefully employed to define a scoring function.
|
Eyal Rozenberg · Ehud Rivlin · Daniel Freedman 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
Multi-scale Sinusoidal Embeddings Enable Learning on High Resolution Mass Spectrometry Data
(
Poster
)
link »
Small molecules in biological samples are studied to provide information about disease states, environmental toxins, natural product drug discovery, and many other applications. The primary window into the composition of small molecule mixtures is tandem mass spectrometry (MS2), which produces high sensitivity and part per million resolution data. We adopt multi-scale sinusoidal embeddings of the mass data in MS2 designed to meet the challenge of learning from the full resolution of MS2 data. Using these embeddings, we provide a new state of the art model for spectral library search, the standard task for initial evaluation of MS2 data. We also investigate the task of chemical property prediction from MS2 data, that has natural applications in high-throughput MS2 experiments and show that an average $R^2$ of 80\% for novel compounds can be achieved across 10 chemical properties prioritized by medicinal chemists. We vary the resolution of the input spectra directly by using different floating point representations of the MS2 data, and show that the resulting sinusoidal embeddings are able to learn from high resolution portion of the input MS2 data. We apply dimensionality reduction to the embeddings that result from different resolution input masses to show the essential role multi-scale sinusoidal embeddings play in learning from MS2 data.
|
Gennady Voronov · Rose Lightheart · Joe Davison · Christoph Krettler · David Healey · Thomas Butler 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
The Power of Motifs as Inductive Bias for Learning Molecular Distributions
(
Poster
)
link »
Machine learning for molecules holds great potential for efficiently exploring the vast chemical space and thus streamlining the drug discovery process by facilitating the design of new therapeutic molecules. Deep generative models have shown promising results for molecule generation, but the benefits of specific inductive biases for learning distributions over small graphs are unclear. Our study aims to investigate the impact of subgraph structures and vocabulary design on distribution learning, using small drug molecules as a case study. To this end, we introduce Subcover, a new subgraph-based fragmentation scheme, and evaluate it through a two-step variational auto-encoder. Our results show that Subcover’s improved identification of chemically meaningful subgraphs leads to a relative improvement of the FCD score by 30%, outperforming previous methods. Our findings highlight the potential of Subcover to enhance the performance and scalability of existing methods, contributing to the advancement of drug discovery. |
Johanna Sommer · Leon Hetzel · David Lüdke · Fabian Theis · Stephan Günnemann 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
Predicting protein stability changes under multiple amino acid substitutions using equivariant graph neural networks
(
Poster
)
link »
The accurate prediction of changes in protein stability under multiple amino acid substitutions is essential for realising true in-silico protein re-design. To this purpose, we propose improvements to state-of-the-art Deep learning (DL) protein stability prediction models, enabling first of a kind predictions for variable numbers of amino acid substitutions. By decoupling the atomic and residue scales of protein representations, using E(3)-equivariant graph neural networks (EGNN) for both Atomic Environment (AE) embedding and residue level scoring tasks. OurAE embedder was used to featurise a residue level graph, then trained to score mutant stability (∆∆G). To achieve effective training of this predictive EGNN we have leveraged the unprecedented scale of a new high-throughput protein stability experimental data-set, Mega-scale. Finally, we demonstrate the immeadiately promising results of this procedure, discuss the current shortcomings, and highlight potential future strategies |
Sebastien Boyer · Sam Money-Kyrle · Oliver Bent 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
DOG: Discriminator-only Generation
(
Poster
)
link »
As an alternative to generative modeling approaches such as denoising diffusion, energy-based models (EBMs), and generative adversarial networks (GANs), we explore discriminator-only generation (DOG). DOG obtains samples by direct gradient descent on the input of a discriminator. DOG is conceptually simple, generally applicable to many domains, and even trains faster than GANs on the QM9 molecule dataset. While DOG does not (yet?) reach state-of-the-art quality on image generation tasks, it outperforms recent GAN approaches on several graph generation benchmarks, using only their discriminators. |
Franz Rieger · Joergen Kornfeld 🔗 |
Fri 4:45 a.m. - 5:40 a.m.
|
Epigenomic Language Models Powered By Cerebras
(
Poster
)
link »
Large scale self-supervised pre-training of Transformer language models has advanced the field of Natural Language Processing and shown promise in cross-application to the biological ‘languages’ of proteins and DNA. Learning effective representations of DNA sequences using large genomic sequence corpuses may accelerate the development of models of gene regulation and function through transfer learning. However, to accurately model cell type-specific gene regulation and function, it is necessary to consider not only the information contained in DNA nucleotide sequences, which is mostly invariant between cell types, but also how the local chemical and structural ‘epigenetic state’ of chromosomes varies between cell types. Here, we introduce a Bidirectional Encoder Representations from Transformers (BERT) model that learns representations based on both DNA sequence and paired epigenetic state inputs, which we call Epigenomic BERT (or EBERT). We pre-train EBERT with a masked language model objective across the entire human genome and across 127 cell types. Training this complex model with a previously prohibitively large dataset was made possible for the first time by a partnership with Cerebras Systems, whose CS-1 system powered all pre-training experiments. We show EBERT’s transfer learning potential by demonstrating strong performance on a cell type-specific transcription factor binding prediction task. Our fine-tuned model exceeds state of the art performance on 4 of 13 evaluation datasets from ENCODE-DREAM benchmarks and earns an overall rank of 3rd on the challenge leaderboard. We explore how the inclusion of epigenetic data and task-specific feature augmentation impact transfer learning performance. |
Meredith Trotter · Cuong Nguyen · Stephen Young · Rob Woodruff · kim branson 🔗 |
Fri 5:40 a.m. - 6:30 a.m.
|
Lunch Break
|
🔗 |
Fri 6:30 a.m. - 7:15 a.m.
|
Invited Talk - Rafa Bombarelli
(
Talk
)
SlidesLive Video » |
Rafael Gomez-Bombarelli 🔗 |
Fri 7:15 a.m. - 8:00 a.m.
|
Invited Talk - Liz Wood
(
Talk
)
SlidesLive Video » |
🔗 |
Fri 8:00 a.m. - 8:45 a.m.
|
Invited Talk - Caroline Ulher
(
Talk
)
SlidesLive Video » |
Caroline Uhler 🔗 |
Fri 8:45 a.m. - 9:00 a.m.
|
Afternoon Break
|
🔗 |
Fri 9:00 a.m. - 9:05 a.m.
|
Holographic-(V)AE: an end-to-end SO(3)-Equivariant (Variational) Autoencoder in Fourier Space
(
Oral
)
link »
SlidesLive Video » Group-equivariant neural networks have emerged as a data-efficient approach to solve classification and regression tasks, while respecting the relevant symmetries of the data. However, little work has been done to extend this paradigm to the unsupervised and generative domains. Here, we present Holographic-(V)AE (H-(V)AE), a fully end-to-end SO(3)-equivariant (variational) autoencoder in Fourier space, suitable for unsupervised learning and generation of data distributed around a specified origin. H-(V)AE is trained to reconstruct the spherical Fourier encoding of data, learning in the process a latent space with a maximally informative invariant embedding alongside an equivariant frame describing the orientation of the data. We show the potential utility of H-(V)AE on structural biology tasks. Specifically, we train H-(V)AE on protein structure microenvironments, and show that its latent space can be used to extract compact embeddings of local structural features which, paired with a Random Forest Regressor, enable state-of-the-art predictions of protein-ligand binding affinity. |
Gian Marco Visani · Michael Pun · Arman Angaji · Armita Nourmohammad 🔗 |
Fri 9:05 a.m. - 9:10 a.m.
|
Graph Generation with Destination-Driven Diffusion Mixture
(
Oral
)
link »
SlidesLive Video » Generation of graphs is a major challenge for real-world tasks that require understanding the complex nature of their non-Euclidean structures. Although diffusion models have achieved notable success in graph generation recently, they are ill-suited for modeling the structural information of graphs since learning to denoise the noisy samples does not explicitly capture the graph topology. To tackle this limitation, we propose a novel generative process that models the topology of graphs by predicting the destination of the process. Specifically, we design the generative process as a mixture of diffusion processes conditioned on the endpoint in the data distribution, which drives the process toward the probable destination. Further, we introduce new training objectives for learning to predict the destination, and discuss the advantages of our generative framework that can explicitly model the graph topology and exploit the inductive bias of the data. Through extensive experimental validation on general graph and 2D/3D molecular graph generation tasks, we show that our method outperforms previous generative models, generating graphs with correct topology with both continuous and discrete features. |
Jaehyeong Jo · Dongki Kim · Sung Ju Hwang 🔗 |
Fri 9:10 a.m. - 9:15 a.m.
|
Multi-scale Sinusoidal Embeddings Enable Learning on High Resolution Mass Spectrometry Data
(
Oral
)
link »
SlidesLive Video »
Small molecules in biological samples are studied to provide information about disease states, environmental toxins, natural product drug discovery, and many other applications. The primary window into the composition of small molecule mixtures is tandem mass spectrometry (MS2), which produces high sensitivity and part per million resolution data. We adopt multi-scale sinusoidal embeddings of the mass data in MS2 designed to meet the challenge of learning from the full resolution of MS2 data. Using these embeddings, we provide a new state of the art model for spectral library search, the standard task for initial evaluation of MS2 data. We also investigate the task of chemical property prediction from MS2 data, that has natural applications in high-throughput MS2 experiments and show that an average $R^2$ of 80\% for novel compounds can be achieved across 10 chemical properties prioritized by medicinal chemists. We vary the resolution of the input spectra directly by using different floating point representations of the MS2 data, and show that the resulting sinusoidal embeddings are able to learn from high resolution portion of the input MS2 data. We apply dimensionality reduction to the embeddings that result from different resolution input masses to show the essential role multi-scale sinusoidal embeddings play in learning from MS2 data.
|
Gennady Voronov · Rose Lightheart · Joe Davison · Christoph Krettler · David Healey · Thomas Butler 🔗 |
Fri 9:15 a.m. - 9:20 a.m.
|
DOG: Discriminator-only Generation
(
Oral
)
link »
SlidesLive Video » As an alternative to generative modeling approaches such as denoising diffusion, energy-based models (EBMs), and generative adversarial networks (GANs), we explore discriminator-only generation (DOG). DOG obtains samples by direct gradient descent on the input of a discriminator. DOG is conceptually simple, generally applicable to many domains, and even trains faster than GANs on the QM9 molecule dataset. While DOG does not (yet?) reach state-of-the-art quality on image generation tasks, it outperforms recent GAN approaches on several graph generation benchmarks, using only their discriminators. |
Franz Rieger · Joergen Kornfeld 🔗 |
Fri 9:20 a.m. - 10:00 a.m.
|
Challenge recap & presentations from winning teams
(
Casual Bench Challenge
)
|
🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
MoDTI: Modular Framework For Evaluating Inductive Biases in DTI Modeling
(
Poster
)
link »
Drug-Target Interaction (DTI) prediction remains a critical problem in drug discovery. Machine learning (ML) has shown great promise in feature-based DTI prediction. However, the vast number of ML architectures and biomolecular representations available makes selecting an appropriate model architecture a challenge. In this work, we propose a modular framework, MoDTI, that facilitates the exploration of three key inductive biases in DTI prediction: protein representation, multi-view learning, and modularity. We evaluate the impact of each of these inductive biases on DTI prediction performance and compare the performance of MoDTI against existing state-of-the-art models on multiple benchmarks. Our findings provide valuable insights into the importance of each component of the MoDTI model for improving DTI prediction, and we present general guidelines for the rapid development of more accurate DTI models. Through extensive empirical evaluation, we demonstrate the effectiveness of our proposed approach and its potential for further understanding key inductive biases for DTI prediction. |
Roy Pavel Samuel Henha Eyono · Prudencio Tossou · Cas Wognum · Emmanuel Noutahi 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
Evaluating Prompt Tuning for Conditional Protein Sequence Generation
(
Poster
)
link »
Text generation models originally developed for natural language processing have proven to be successful in generating protein sequences. These models are often finetuned for improved performance on more specific tasks, such as generation of proteins from families unseen in training. Considering the high computational cost of finetuning separate models for each downstream task, prompt tuning has been proposed as an alternative. However, no openly available implementation of this approach compatible with protein language models has been previously published. Thus, we adapt an open-source codebase designed for NLP models to build a pipeline for prompt tuning on protein sequence data, supporting the protein language models ProtGPT2 and RITA. We evaluate our implementation by learning prompts for conditional sampling of sequences belonging to a specific protein family. This results in improved performance compared to the base model. However, in the presented use case, we observe discrepancies between text-based evaluation and predicted biological properties of the generated sequences, identifying open problems for principled assessment of protein sequence generation quality. |
Andrea Nathansen · Kevin Klein · Bernhard Renard · Melania Nowicka · Jakub Bartoszewicz 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
HiGeN: HIERARCHICAL MULTI-RESOLUTION GRAPH GENERATIVE NETWORK
(
Poster
)
link »
In real world domains, most graphs naturally exhibit a hierarchical structure. However, data-driven graph generation is yet to effectively capture such structures. To address this, we propose a novel approach that recursively generates community structures at multiple resolutions, with the generated structures conforming to training data distribution at each level of the hierarchy. The graphs generation is designed as a sequence of coarse-to-fine generative models allowing for parallel generation of all sub-structures, resulting in a high degree of scalability. Furthermore, we model the output distribution of edges with a more expressive multinomial distribution and derive a recursive factorization for this distribution, making it a suitable choice for graph generative models. This allows for the generation of graphs with integer-valued edge weights. Our method achieves state-of-the-art performance in both accuracy and efficiency on multiple graph datasets |
Mahdi Karami · Jun Luo 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
Generating Multi-Step Chemical Reaction Pathways with Black-Box Optimization
(
Poster
)
link »
The practical usability of de novo small molecule generation depends heavily on the synthesizability of generated molecules.We propose BBO-SYN, a generative framework based on black-box optimization (BBO), which predicts diverse molecules with desired properties together with corresponding synthesis pathways.Given an input molecule A, BBO-SYN employs a state-of-the-art BBO method operating on a latent space of molecules to find a reaction partner B, which maximizes the property score of the reaction product C, as determined by a pre-trained template-free reaction predictor. This single-step reaction (A+B→C) forms the basis for an optimization loop, resulting in a synthesis tree yielding products with high property scores.Empirically, the sampling and search strategy of BBO-SYN outperforms comparable baselines on four synthesis-aware optimization tasks (QED, DRD2, GSK3$\beta$, and JNK3), increasing product diversity by 37% and mean property score by 25% on our hardest JNK3 task.
|
Danny Reidenbach · Connor Coley · Kevin Yang 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
Molecular Fragment-based Diffusion Model for Drug Discovery
(
Poster
)
link »
Due to the recent successes of generative models much attention has been paid to de novo generation of drug-like molecules using machine learning. A particular class of generative models, diffusion probabilistic models, have recently been shown to work extraordinarily well across a diverse set of generative tasks, and a growing body of literature has applied diffusion probabilistic models directly to the molecule discovery problem. However, existing methods work with atom- based molecule representations, whereas work in the fragment-based drug design community indicates that using a molecular fragment-based approach can provide a much better inductive bias for the generative model. To this end, in our work we attempt to use diffusion probabilistic models to de novo generate drug-like molecules with a fragment-based representation, yielding more valid and drug-like molecules than existing approaches. |
Daniel Levy · Jarrid Rector-Brooks 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
LEP-AD: Language Embeddings of Proteins and Attention to Drugs predicts drug target interactions
(
Poster
)
link »
Predicting drug-target interactions is an outstanding challenge relevant to drug development and lead optimization. Recent advances include training algorithms to learn drug-target interactions from data and molecular simulations. Here we utilize Evolutionary Scale Modeling (ESM-2) models to establish a Transformer protein language model for drug-target interaction predictions. Our architecture, LEP-AD, combines pre-trained ESM-2 and Transformer-GCN models predicting binding affinity values. We report new best in class state-of-the-art results compared to competing methods such as SimBoost, DeepCPI, Attention-DTA, GraphDTA, and more using multiple datasets, including Davis, KIBA, DTC, Metz, ToxCast, and STITCH. Finally, we find that a pre-trained model with embedding of proteins, as in our LED-AD, outperforms a model using an explicit alpha-fold 3D representation of proteins. The LEP-AD model scales favourably in performance with the size of training data. Code available at https://github.com/adaga06/LEP-AD |
Anuj Daga · Sumeer Khan · David Cabrero · Robert Hoehndorf · Narsis Kiani · Jesper Tegnér 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
EigenFold: Generative Protein Structure Prediction with Diffusion Models
(
Poster
)
link »
Protein structure prediction has reached revolutionary levels of accuracy on single structures, yet distributional modeling paradigms are needed to capture the conformational ensembles and flexibility that underlie biological function. Towards this goal, we develop EigenFold, a diffusion generative modeling framework for sampling a distribution of structures from a given protein sequence. We define a novel diffusion process that models the structure as a system of harmonic oscillators and which naturally induces a cascading-resolution generative process along the eigenmodes of the system. On recent CAMEO targets, EigenFold achieves a median TMScore of 0.85, while providing a more comprehensive picture of model uncertainty via the ensemble of sampled structures relative to existing methods. We then assess EigenFold's ability to model and predict conformational heterogeneity for fold-switching proteins and ligand-induced conformational change. |
Bowen Jing · Ezra Erives · Peter Pao-Huang · Gabriele Corso · Bonnie Berger · Tommi Jaakkola 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
Holographic-(V)AE: an end-to-end SO(3)-Equivariant (Variational) Autoencoder in Fourier Space
(
Poster
)
link »
Group-equivariant neural networks have emerged as a data-efficient approach to solve classification and regression tasks, while respecting the relevant symmetries of the data. However, little work has been done to extend this paradigm to the unsupervised and generative domains. Here, we present Holographic-(V)AE (H-(V)AE), a fully end-to-end SO(3)-equivariant (variational) autoencoder in Fourier space, suitable for unsupervised learning and generation of data distributed around a specified origin. H-(V)AE is trained to reconstruct the spherical Fourier encoding of data, learning in the process a latent space with a maximally informative invariant embedding alongside an equivariant frame describing the orientation of the data. We show the potential utility of H-(V)AE on structural biology tasks. Specifically, we train H-(V)AE on protein structure microenvironments, and show that its latent space can be used to extract compact embeddings of local structural features which, paired with a Random Forest Regressor, enable state-of-the-art predictions of protein-ligand binding affinity. |
Gian Marco Visani · Michael Pun · Arman Angaji · Armita Nourmohammad 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
Accelerating Antimicrobial Peptide Discovery with Latent Sequence-Structure Model
(
Poster
)
link »
Antimicrobial peptide (AMP) is a promising therapy in the treatment of broad-spectrum antibiotics and drug-resistant infections. Recently, an increasing number of researchers have been introducing deep generative models to accelerate AMP discovery. However, current studies mainly focus on sequence attributes and ignore structure information, which is important in AMP biological functions. In this paper, we propose a latent sequence-structure model for AMPs (LSSAMP) with multi-scale VQ-VAE to incorporate secondary structures. By sampling in the latent space, LSSAMP can simultaneously generate peptides with ideal sequence attributes and secondary structures. Experimental results show that the peptides generated by LSSAMP have a high probability of AMP, and two of the 21 candidates have been verified to have good antimicrobial activity. Our model will be released to help create high-quality AMP candidates for follow-up biological experiments and accelerate the whole AMP discovery. |
Danqing Wang · Zeyu Wen · Fei YE · Lei Li · Hao Zhou 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
FlexVDW: A machine learning approach to account for protein flexibility in ligand docking
(
Poster
)
link »
Most widely used ligand docking methods assume a rigid protein structure. This leads to problems when the structure of the target protein deforms upon ligand binding. In particular, the ligand’s true binding pose is often scored very unfavorably due to apparent clashes between ligand and protein atoms, which lead to extremely high values of the calculated van der Waals energy term. Traditionally, this problem has been addressed by explicitly searching for receptor conformations to account for the flexibility of the receptor in ligand binding. Here we present a deep learning model trained to take receptor flexibility into account implicitly when predicting van der Waals energy. We show that incorporating this machine-learned energy term into a state-of-the-art physics-based scoring function improves small molecule ligand pose prediction results in cases with substantial protein deformation, without degrading performance in cases with minimal protein deformation. This work demonstrates the feasibility of learning effects of protein flexibility on ligand binding without explicitly modeling changes in protein structure. |
Patricia Suriana · Joseph Paggi · Ron Dror 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
Improving Graph Generation by Restricting Graph Bandwidth
(
Poster
)
link »
Deep graph generative modeling has proven capable of learning the distribution of complex, multi-scale structures characterizing real-world graphs. However, one of the main limitations of existing methods is their large output space, which limits generation scalability and hinders accurate modeling of the underlying distribution. To overcome these limitations, we propose a novel approach that significantly reduces the output space of existing graph generative models. Specifically, starting from the observation that many real-world graphs have low graph bandwidth, we restrict graph bandwidth during training and generation. Our strategy improves both generation scalability and quality without increasing architectural complexity or reducing expressiveness. Our approach is compatible with existing graph generative methods, and we describe its application to both autoregressive and one-shot models. We extensively validate our strategy on synthetic and real datasets, including molecular graphs. Our experiments show that, in addition to improving generation efficiency, our approach consistently improves generation quality and reconstruction accuracy. The implementation will be publicly released upon unblinding. |
Nathaniel Diamant · Alex M Tseng · Kangway Chuang · Tommaso Biancalani · Gabriele Scalia 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
Task-Agnostic Graph Neural Network Evaluation via Adversarial Collaboration
(
Poster
)
link »
It has been increasingly demanding to develop reliable methods to evaluate the progress of Graph Neural Network (GNN) research for molecular representation learning. Existing GNN benchmarking methods for molecular representation learning focus on comparing the GNNs' performances on some node/graph classification/regression tasks on certain datasets. However, there lacks a principled, task-agnostic method to directly compare two GNNs. Additionally, most of the existing self-supervised learning works incorporate handcrafted augmentations to the data, which has several severe difficulties to be applied on graphs due to their unique characteristics. To address the aforementioned issues, we propose GraphAC (Graph Adversarial Collaboration) – a conceptually novel, principled, task-agnostic, and stable framework for evaluating GNNs through contrastive self-supervision. We introduce a novel objective function: the Competitive Barlow Twins, that allow two GNNs to jointly update themselves from direct competitions against each other. GraphAC succeeds in distinguishing GNNs of different expressiveness across various aspects, and has demonstrated to be a principled and reliable GNN evaluation method, without necessitating any augmentations. |
Xiangyu Zhao · Hannes Stärk · Dominique Beaini · Yiren Zhao · Pietro Lio 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
SupSiam: Non-contrastive Auxiliary Loss for Learning from Molecular Conformers
(
Poster
)
link »
We investigate Siamese networks for learning related embeddings for augmented samples of molecular conformers. We find that a non-contrastive (positive-pair only) auxiliary task aids in supervised training of Euclidean neural networks (E3NNs) and increases manifold smoothness (MS) around point-cloud geometries. We demonstrate this property for multiple drug-activity prediction tasks while maintaining relevant performance metrics, and propose an extension of MS to probabilistic and regression settings. We provide an analysis of representation collapse, finding substantial effects of task-weighting, latent dimension, and regularization. We expect the presented protocol to aid in the development of reliable E3NNs from molecular conformers, even for small-data drug discovery programs. |
Michael Maser · Joshua Yao-Yu Lin · Ji Won Park · Jae Hyeon Lee · Nathan Frey · Andrew Watkins 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
Improving Small Molecule Generation using Mutual Information Machine
(
Poster
)
link »
We address the task of controlled generation of small molecules, which entails finding novel molecules with desired properties under certain constraints. Here we introduce MolMIM, a probabilistic auto-encoder for small molecule drug discovery that learns an informative and clustered latent space.MolMIM is trained with Mutual Information Machine (MIM) learning and provides a fixed-size representation of variable-length SMILES strings.Since encoder-decoder models can learn representations with ``holes'' of invalid samples, here we propose a novel extension to the MIM training procedure which promotes a dense latent space and allows the model to sample valid molecules from random perturbations of latent codes.We provide a thorough comparison of MolMIM to several variable-size and fixed-size encoder-decoder models, demonstrating MolMIM's superior generation as measured in terms of validity, uniqueness, and novelty.We then utilize CMA-ES, a naive black-box, and gradient-free search algorithm, over MolMIM's latent space for the task of property-guided molecule optimization.We achieve state-of-the-art results in several constrained single-property optimization tasks and show competitive results in the challenging task of multi-objective optimization.We attribute the strong results to the structure of MolMIM's learned representation which promotes the clustering of similar molecules in the latent space, whereas CMA-ES is often used as a baseline optimization method. We also demonstrate MolMIM to be favorable in a compute-limited regime. |
Danny Reidenbach · Micha Livne · Rajesh Ilango · Michelle Gill · Johnny Israeli 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
LEARNING PROTEIN FAMILY MANIFOLDS WITH SMOOTHED ENERGY-BASED MODELS
(
Poster
)
link »
We resolve difficulties in training and sampling from discrete energy-based models (EBMs) by learning a smoothed energy landscape, sampling the smoothed data manifold with Langevin Markov chain Monte Carlo, and projecting back to the true data manifold with one-step denoising. Our formalism combines the attractive properties of EBMs and improved sample quality of score-based models, while simplifying training and sampling by requiring only a single noise scale. We demonstrate the robustness of our approach on generative modeling of antibody proteins. |
Nathan Frey · Dan Berenberg · Joseph Kleinhenz · Stephen Ra · Isidro Hotzel · Julien Lafrance-Vanasse · Ryan Kelly · Yan Wu · Arvind Rajpal · Richard Bonneau · Kyunghyun Cho · Andreas Loukas · Vladimir Gligorijevic · Saeed Saremi
|
Fri 10:00 a.m. - 10:55 a.m.
|
Multiparameter Persistent Homology for Molecular Property Prediction
(
Poster
)
link »
In this study, we present a novel molecular fingerprint generation method based on multiparameter persistent homology. This approach reveals the latent structures and relationships within molecular geometry, and detects topological features that exhibit persistence across multiple scales along multiple parameters, such as atomic mass, partial charge, and bond type, and can be further enhanced by incorporating additional parameters like ionization energy, electron affinity, chirality and orbital hybridization. The proposed fingerprinting method provides fresh perspectives on molecular structure that are not easily discernible from single-parameter or single-scale analysis. Besides, in comparison with traditional graph neural networks, multiparameter persistent homology has the advantage of providing a more comprehensive and interpretable characterization of the topology of the molecular data. We have established theoretical stability guarantees for multiparameter persistent homology, and have conducted extensive experiments on the Lipophilicity, FreeSolv, and ESOL datasets to demonstrate its effectiveness in predicting molecular properties. |
Andac Demir · Bulent Kiziltan 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
Geometry-Complete Diffusion for 3D Molecule Generation
(
Poster
)
link »
Denoising diffusion probabilistic models (DDPMs) have recently taken the field of generative modeling by storm, pioneering new state-of-the-art results in disciplines such as computer vision and computational biology for diverse tasks ranging from text-guided image generation to structure-guided protein design. Along this latter line of research, methods such as those of Hoogeboom et al. 2022 have been proposed for unconditionally generating 3D molecules using equivariant graph neural networks (GNNs) within a DDPM framework. Toward this end, we propose GCDM, a geometry-complete diffusion model that achieves new state-of-the-art results for 3D molecule diffusion generation by leveraging the representation learning strengths offered by GNNs that perform geometry-complete message-passing. Our results with GCDM also offer preliminary insights into how physical inductive biases impact the generative dynamics of molecular DDPMs. The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/bio-diffusion. |
Alex Morehead · Jianlin Cheng 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
GraphGUIDE: interpretable and controllable conditional graph generation with discrete Bernoulli diffusion
(
Poster
)
link »
Diffusion models achieve state-of-the-art performance in generating realistic objects and have been successfully applied to images, text, and videos. Recent work has shown that diffusion can also be defined on graphs, including graph representations of drug-like molecules. Unfortunately, it remains difficult to perform conditional generation on graphs in a way which is interpretable and controllable. In this work, we propose GraphGUIDE, a novel framework for graph generation using diffusion models, where edges in the graph are flipped or set at each discrete time step. We demonstrate GraphGUIDE on several graph datasets, and show that it enables full control over the conditional generation of arbitrary structural properties without relying on predefined labels. Our framework for graph diffusion can have a large impact on the interpretable conditional generation of graphs, including the generation of drug-like molecules with desired properties in a way which is informed by experimental evidence. |
Alex M Tseng · Nathaniel Diamant · Tommaso Biancalani · Gabriele Scalia 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
Enhancing Protein Language Model with Structure-based Encoder and Pre-training
(
Poster
)
link »
Protein language models (PLMs) pre-trained on large-scale protein sequence corpora have achieved impressive performance on various downstream protein understanding tasks. Despite the ability to implicitly capture inter-residue contact information, transformer-based PLMs cannot encode protein structures explicitly for better structure-aware protein representations. Besides, the power of pre-training on available protein structures has not been explored for improving these PLMs, though structures are important to determine functions. To tackle these limitations, in this work, we enhance the PLM with structure-based encoder and pre-training. We first explore feasible model architectures to combine the advantages of a state-of-the-art PLM (i.e., ESM-1b) and a state-of-the-art protein structure encoder (i.e., GearNet). We empirically verify the ESM-GearNet that connects two encoders in a series way as the most effective combination model. To further improve the effectiveness of ESM-GearNet, we pre-train it on massive unlabeled protein structures with contrastive learning, which aligns representations of co-occurring subsequences so as to capture their biological correlation. Extensive experiments on EC and GO protein function prediction benchmarks demonstrate the superiority of ESM-GearNet over previous PLMs and structure encoders, and clear performance gains are further achieved by structure-based pre-training upon ESM-GearNet. The source code will be made public upon acceptance. |
Zuobai Zhang · Minghao Xu · Aurelie Lozano · Vijil Chenthamarakshan · Payel Das · Jian Tang 🔗 |
Fri 10:00 a.m. - 10:55 a.m.
|
EurNet: Efficient Multi-Range Relational Modeling of Protein Structure
(
Poster
)
link »
Modeling the 3D structures of proteins is critical for obtaining effective protein structure representations, which further boosts protein function understanding. Existing protein structure encoders mainly focus on modeling short-range interactions within protein structures, while they neglect modeling the interactions at multiple length scales that are actually complete interactive patterns in protein structures. To attain complete interaction modeling with efficient computation, we introduce the EurNet for Efficient multi-range relational modeling. In EurNet, we represent the protein structure as a multi-relational residue-level graph with different types of edges for modeling short-range, medium-range and long-range interactions. To efficiently process these different interactive relations, we propose a novel modeling layer, called Gated Relational Message Passing (GRMP), as the basic building block of EurNet. GRMP can capture multiple interactive relations in protein structures with little extra computational cost. We verify the state-of-the-art performance of EurNet on EC and GO protein function prediction benchmarks, and the proposed GRMP layer is proved to achieve better efficiency-performance trade-off than the widely-used relational graph convolution. |
Minghao Xu · Yuanfan Guo · Yi Xu · Jian Tang · Xinlei Chen · Yuandong Tian 🔗 |
Fri 10:55 a.m. - 11:00 a.m.
|
Closing Remarks
SlidesLive Video » |
🔗 |