Towards A Generative Protein Evolution Machine with DPLM-Evo
Abstract
Proteins are shaped by evolution under biophysical and functional constraints. Protein language models can learn rich evolutionary constraints, and discrete diffusion-based PLMs (DPLMs) are promising for both understanding and generation. However, existing DPLMs rely on masking-based diffusion, which is a loose proxy for evolution, and difficult to model the edit operations that drive sequence change in nature: substitutions and insertions/deletions (indels). In this paper, we present DPLM-Evo, an evolutionay discrete diffusion protein language model that explicitly predicts substitution, insertion, and deletion actions during denoising. To make indel-aware generation tractable, we introduce a latent alignment formulation that supports variable-length sequences. To make substitution corruption informative, we propose a contextual evolutionary noising kernel that generates biologically plausible mutations. On ProteinGym, DPLM-Evo achieves state-of-the-art mutation effect prediction in the single-sequence setting, and it enables variable-length generation and post-editing via explicit edit trajectories.