Investigation of Scaling Laws for Encoder-Decoder Protein Language Models
Abstract
We explore the optimal scaling of encoder-decoder protein language models, a critical yet underexplored architecture for deciphering biological systems. While dense transformer models have achieved remarkable success, guidance on efficiently scaling sparse and asymmetric architectures in the protein domain remains limited. Our investigation is grounded in a systematic study covering over 90 model variants, spanning varying scales (100M to 5B total parameters) and compute budgets (8B to 64B tokens). We derive unified scaling laws for both dense and sparse Mixture-of-Experts (MoE) models, providing the first comprehensive roadmap for this design space. First, we demonstrate that MoE models consistently exhibit superior scaling efficiency compared to dense counterparts, offering a more favorable compute-performance frontier. Second, we dissect the behavior of asymmetric encoder-decoder configurations (e.g., 5-of-6 ratios). While these architectures can accelerate convergence, we uncover a critical stability-efficiency trade-off, identifying specific regimes where training instability may offset efficiency gains. Finally, we validate our scaling laws across eight diverse downstream tasks, confirming that our compute-optimal findings translate directly to biological predictive power. Our work provides empirical evidence and practical guidelines for developing next-generation, compute-efficient protein language models.