Poster
in
Workshop: Workshop on Logical Reasoning of Large Language Models

Embedding Distance as a Reward Signal can replace Verifiers for LLM Reasoning

Abdelhakim Benechehab ⋅ Youssef Attia El Hili ⋅ Albert Thomas ⋅ Giuseppe Paolo ⋅ Maurizio Filippone

Project Page [ OpenReview]

Abstract

Reinforcement Learning (RL) has emerged as a powerful paradigm for adapting Large Language Models (LLMs), offering advantages over Supervised Fine-Tuning (SFT) including reduced catastrophic forgetting and improved generalization. However, these benefits require explicit reward signals, often obtained from human preferences or verifiable outcomes, which are unavailable in many cases. We address this gap by introducing a framework that derives reward functions directly from supervised data, enabling RL-based training without additional annotation. Our approach formulates reward functions as a weighted distance between embeddings of labels and generated answers. Experiments with LLMs fine-tuning for a reasoning task demonstrate that our learned rewards match the performance of oracle RL that has access to groundtruth rewards.

Chat is not available.