Reinforcement Learning for Black-Box Objectives in Logit-Based Protein Hallucination
Jacob Pettit ⋅ Steven Magana-Zook ⋅ Mikel Landajuela Larma
Abstract
Gradient-based ``hallucination'' methods such as Germinal and BindCraft enable efficient protein binder design by optimizing a continuous relaxation of the sequence (a logit matrix) using gradients from differentiable structure predictors and protein language models. However, many practically useful objectives for ranking or filtering designs are non-differentiable with respect to sequence (e.g., external confidence metrics from AlphaFold3 or Chai, experimental readouts, or arbitrary black-box scores), preventing direct backpropagation through the objective. We extend the Germinal pipeline with an \emph{optional} policy-gradient update on the sequence logits, enabling direct optimization of black-box rewards while preserving Germinal's differentiable optimization backbone. Our implementation reuses Germinal's existing filter metrics as modular reward components and supports both Chai-1 and AlphaFold3 backends for reward evaluation. On a nanobody design task against PD-L1, adding a small policy-gradient term during the \emph{high-softness} portion of Germinal's optimization yields a 2.3$\times$ improvement in acceptance rate among processed trajectories, from $0.181\pm0.107$ to $0.342\pm0.097$ (mean$\pm$std over six seeds), while maintaining comparable confidence metrics for accepted designs. This suggests that combining policy-gradient-based black-box optimization helps improve design success rates, potentially improving downstream wet-lab metrics
Video
Chat is not available.
Successful Page Load