Oral Session
Oral Session 3B
Moderator: Wenhan Gao
Learning to Discover Regulatory Elements for Gene Expression Prediction
Xingyu Su · Haiyang Yu · Degui Zhi · Shuiwang Ji
We consider the problem of predicting gene expressions from DNA sequences. A key challenge of this task is to find the regulatory elements that control gene expressions. Here, we introduce Seq2Exp, a Sequence to Expression network explicitly designed to discover and extract regulatory elements that drive target gene expression, enhancing the accuracy of the gene expression prediction. Our approach captures the causal relationship between epigenomic signals, DNA sequences and their associated regulatory elements. Specifically, we propose to decompose the epigenomic signals and the DNA sequence conditioned on the causal active regulatory elements, and apply an information bottleneck with the Beta distribution to combine their effects while filtering out non-causal components. Our experiments demonstrate that Seq2Exp outperforms existing baselines in gene expression prediction tasks and discovers influential regions compared to commonly used statistical methods for peak detection such as MACS3. The source code is released as part of the AIRS library (https://github.com/divelab/AIRS/).
Steering Protein Family Design through Profile Bayesian Flow
Jingjing Gong · Yu Pei · Siyu Long · Yuxuan Song · Zhe Zhang · Wenhao Huang · Ziyao Cao · Shuyi Zhang · Hao Zhou · Wei-Ying Ma
Protein family design emerges as a promising alternative by combining the advantages of de novo protein design and mutation-based directed evolution.In this paper, we propose ProfileBFN, the Profile Bayesian Flow Networks, for specifically generative modeling of protein families. ProfileBFN extends the discrete Bayesian Flow Network from an MSA profile perspective, which can be trained on single protein sequences by regarding it as a degenerate profile, thereby achieving efficient protein family design by avoiding large-scale MSA data construction and training. Empirical results show that ProfileBFN has a profound understanding of proteins. When generating diverse and novel family proteins, it can accurately capture the structural characteristics of the family. The enzyme produced by this method is more likely than the previous approach to have the corresponding function, offering better odds of generating diverse proteins with the desired functionality.
Proteina: Scaling Flow-based Protein Structure Generative Models
Tomas Geffner · Kieran Didi · Zuobai Zhang · Danny Reidenbach · Zhonglin Cao · Jason Yim · Mario Geiger · Christian Dallago · Emine Kucukbenli · Arash Vahdat · Karsten Kreis
Recently, diffusion- and flow-based generative models of protein structures have emerged as a powerful tool for de novo protein design. Here, we develop *Proteina*, a new large-scale flow-based protein backbone generator that utilizes hierarchical fold class labels for conditioning and relies on a tailored scalable transformer architecture with up to $5\times$ as many parameters as previous models. To meaningfully quantify performance, we introduce a new set of metrics that directly measure the distributional similarity of generated proteins with reference sets, complementing existing metrics. We further explore scaling training data to millions of synthetic protein structures and explore improved training and sampling recipes adapted to protein backbone generation. This includes fine-tuning strategies like LoRA for protein backbones, new guidance methods like classifier-free guidance and autoguidance for protein backbones, and new adjusted training objectives. Proteina achieves state-of-the-art performance on de novo protein backbone design and produces diverse and designable proteins at unprecedented length, up to 800 residues. The hierarchical conditioning offers novel control, enabling high-level secondary-structure guidance as well as low-level fold-specific generation.
Latent Bayesian Optimization via Autoregressive Normalizing Flows
Seunghun Lee · Jinyoung Park · Jaewon Chu · Minseo Yoon · Hyunwoo Kim
Bayesian Optimization (BO) has been recognized for its effectiveness in optimizing expensive and complex objective functions.Recent advancements in Latent Bayesian Optimization (LBO) have shown promise by integrating generative models such as variational autoencoders (VAEs) to manage the complexity of high-dimensional and structured data spaces.However, existing LBO approaches often suffer from the value discrepancy problem, which arises from the reconstruction gap between input and latent spaces.This value discrepancy problem propagates errors throughout the optimization process, leading to suboptimal outcomes.To address this issue, we propose a Normalizing Flow-based Bayesian Optimization (NF-BO), which utilizes normalizing flow as a generative model to establish one-to-one function between input and latent spaces, eliminating the reconstruction gap.Specifically, we introduce SeqFlow, an autoregressive normalizing flow for sequence data.In addition, we develop a new candidate sampling strategy that dynamically adjusts the exploration probability for each token based on its importance.Through extensive experiments, our NF-BO method demonstrates superior performance in molecule generation tasks, significantly outperforming both traditional and recent LBO approaches.
Composing Unbalanced Flows for Flexible Docking and Relaxation
Gabriele Corso · Vignesh Ram Somnath · Noah Getz · Regina Barzilay · Tommi Jaakkola · Andreas Krause
Diffusion models have emerged as a successful approach for molecular docking, but they often cannot model protein flexibility or generate nonphysical poses. We argue that both these challenges can be tackled by framing the problem as a transport between distributions. Still, existing paradigms lack the flexibility to define effective maps between such complex distributions. To address this limitation we propose Unbalanced Flow Matching, a generalization of Flow Matching (FM) that allows trading off sample efficiency with approximation accuracy and enables more accurate transport. Empirically, we apply Unbalanced FM on flexible docking and structure relaxation, demonstrating our ability to model protein flexibility and generate energetically favorable poses. On the PDBBind docking benchmark, our method FlexDock improves the docking performance while increasing the proportion of energetically favorable poses from 30% to 73%.