GASP: Guided Asymmetric Self-Play For Coding LLMs
Abstract
Asymmetric self-play has emerged as a promising paradigm for post-training large language models, where a teacher continually generates questions for a student to solve at the edge of the student's learnability. Although these methods promise open-ended data generation bootstrapped from no human data, they suffer from one major problem: not all problems that are hard to solve are interesting or informative to improve the overall capabilities of the model. Current asymmetric self-play methods are goal-agnostic with no real grounding. We propose Guided Asymmetric Self-Play (GASP), where grounding is provided by real-data goalpost questions that are identified to pose a hard exploration challenge to the model. During self-play, the teacher generates first an easier and then a harder variant of a hard question, with the goal of pushing these questions more and more towards the goalpost throughout training. Doing so, we observe better performance than unguided asymmetric self-play on LiveCodeBench (LCB), and manage to solve some of these goalpost questions through the curriculum provided by the teacher questions.