Embedding Distance as a Reward Signal can replace Verifiers for LLM Reasoning
Abstract
Reinforcement Learning (RL) has emerged as a powerful paradigm for adapting Large Language Models (LLMs), offering advantages over Supervised Fine-Tuning (SFT) including reduced catastrophic forgetting and improved generalization. However, these benefits require explicit reward signals, often obtained from human preferences or verifiable outcomes, which are unavailable in many cases. We address this gap by introducing a framework that derives reward functions directly from supervised data, enabling RL-based training without additional annotation. Our approach formulates reward functions as a weighted distance between embeddings of labels and generated answers. Experiments with LLMs fine-tuning for a reasoning task demonstrate that our learned rewards match the performance of oracle RL that has access to groundtruth rewards.