MutaGen: Implicitly Guided Protein Evolution from Ranked Feedback via Pair-Based Discrete Flow Matching
Abstract
Machine learning-directed evolution (MLDE) aims at democratizing protein engineering, enabling optimization of any protein with any assay at accessible cost by drastically reducing the screening of thousands of protein sequences. In this work, we introduce a novel discrete flow-matching (DFM) method, MutaGen, trained to iteratively mutate protein sequences towards high-fitness regions of the protein fitness landscape, without relying on noisy in-silico fitness predictions. Training minimizes a token-level cross-entropy flow-matching loss to learn a vector field of improvement from ranked sequence pairs alone. Across realistic screening budgets, MutaGen enables multi-mutational protein optimization with minimal data (as low as 20 sequences per round of evolution) while bypassing the need for an explicit fitness predictor. We validate our approach on standard in silico benchmarks (GFP and AAV) and experimentally in a four-round campaign on NanoLuc, achieving an >80-fold increase in luminescence over the wild-type.