Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning
Yonghyeon Jo · Sunwoo Lee · Seungyul Han
Abstract
Value decomposition has been extensively studied as a core approach for cooperative multi-agent reinforcement learning (MARL) under the CTDE paradigm. Despite this progress, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), a framework that successively learns multiple sub-value functions to retain information about alternative high-value actions. By incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables $Q^{\text{tot}}$ to adjust quickly when the optimal action changes. Extensive experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance.
Successful Page Load