Poster Thu, Apr 23, 2026 • 11:15 AM – 1:45 PM PDT

MAGO: Beyond Fixed Hyperparameters with Multi-Objective Pareto Optimization for Hybrid LLM Reasoning

Xuanze Zhao · Hongcheng Ding · Ruiting Deng · LIU QINGYU · Deshinta Dewi · Shamsul Abdullah

Abstract

Large language models (LLMs) with advanced step-by-step reasoning capabilities have achieved remarkable performance in complex problem-solving through chain-of-thought (CoT) reasoning. However, uniformly applying elaborate reasoning to all queries creates substantial computational inefficiency, as many problems can be solved directly without extended reasoning chains. Current hybrid reasoning approaches rely on static hyperparameters and heuristic single-objective optimization, leading to suboptimal trade-offs and poor adaptation to varying task complexities. To address these limitations, we propose a multi-objective adaptive generation optimization (MAGO) framework, which integrates multi-objective optimization with dynamic adaptive weighting into hybrid reasoning. MAGO optimizes three competing objectives simultaneously: accuracy (maintaining solution correctness), efficiency (minimizing computational costs through appropriate mode selection), and calibration (ensuring mode selection aligns with model capabilities). The framework employs Pareto frontier maintenance with correlation-aware optimization to automatically explore the full trade-off space, avoiding the spatial constraints that limit fixed-weight approaches to narrow cone-shaped regions of the objective space. Unlike existing methods requiring manual hyperparameter tuning, MAGO's Pareto optimization dynamically adapts weights based on task complexity and training progress, achieving principled and adaptive decision-making across varying problem complexities. Comprehensive evaluation on mathematical reasoning benchmarks including AIME, Minerva Algebra, MATH-500, and GSM-8K shows $2.2\times$ to $3\times$ token-efficiency gains and relative accuracy improvements of $0.6\%$ to $9.4\%$ over heuristic baselines, while remaining competitive with the strongest task-specific models. Additional experiments on CommonsenseQA and MedQA further confirm the framework's generalizability beyond mathematics, achieving $1$ to $2\%$ higher accuracy and approximately $2\times$ efficiency improvement without additional fine-tuning.