Learning from Synthetic Data Improves Multi-hop Reasoning
Abstract
Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often obtained through human-annotated datasets and LLM-as-verifier loops. Both of these data types have considerable limitations: human-annotated datasets are small and expensive to curate, while LLM verifiers have high scoring latency and are costly to operate. In this work, we investigate the use of synthetic datasets in RL fine-tuning for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, even though the synthetic data only contain fictional knowledge. On stratifying model performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge, which we to be a fundamental and generalizable reasoning skill. Our work thus highlights the utility of synthetic reasoning datasets in improving LLM reasoning capabilities.