Open Data Synthesis for Deep Research
Abstract
Deep research becomes increasingly important as people seek to solve complex problems that require gathering and synthesizing information from diverse sources. A key capability in this process is agentic search, where an LLM-agent iteratively retrieves relevant information across multiple sources while performing multi-step reasoning. However, developing effective agentic search systems is challenging due to the lack of high-quality training data that reflects the complexity of real-world research tasks. To address this gap, we introduce InfoSeek, a novel data synthesis framework that conceptualizes agentic search as a Hierarchical Constraint Satisfaction Problem (HCSP), where solving a task requires satisfying layered constraints across multiple levels of sub-problems. InfoSeek employs a Diffusion–Retrospection process: in the diffusion phase, the framework expands outward from a seed webpage, generating constraints that connect to neighboring pages and forming an exploration tree; in the retrospection phase, a subtree is sampled and backtracking constraints are introduced, which are then blurred and integrated into an HCSP instance. As a generic framework, InfoSeek can be easily extended to other domains beyond web, facilitating ad-hoc optimization of deep research. To our knowledge, InfoSeek is the first publicly released framework in this area, complete with open-source code and well-curated datasets. Extensive experiments on diverse information-seeking benchmarks show that training on InfoSeek-generated data substantially improves agentic search performance, delivering significantly larger gains than traditional datasets across diverse model backends and training strategies, thereby validating the effectiveness of our approach.