Spinning Straw into Gold: Relabeling LLM Agent Trajectories in Hindsight for Successful Demonstrations
Abstract
Large language model agents operate in partially observable, long-horizon settings where obtaining supervision remains a major bottleneck. We address this by leveraging a source of supervision overlooked in existing post-training methods: ``unintended yet successful'' goals embedded within agent rollouts. We introduce Hindsight Supervised Learning (HSL), where an auxiliary LLM reviews each completed trajectory and relabels it with natural-language goals the agent actually achieved. HSL then pairs the trajectory with its relabeled goals and uses these pairs for additional fine-tuning. To mitigate suboptimality in the relabeled data, HSL incorporates irrelevant-action masking and sample reweighting. We show that HSL is flexible and compatible with existing post-training pipelines. It improves both SFT and DPO, with larger gains on long-horizon embodied and web agent tasks such as ALFWorld and WebShop. Moreover, HSL is sample-efficient: on ALFWorld, it surpasses baselines trained on the full dataset while using only one quarter of the ground-truth demonstrations.