Skills, Benchmarks, and Verification Are What AI-Assisted Research Needs
Abstract
AI adoption in research has outpaced our understanding of its limitations. Researchers embraced these tools for everything from literature review to code generation, yet the jagged frontier—the uneven boundary where AI excels at some tasks and fails unpredictably at others—remains unmapped. We already see real-world consequences: papers with fabricated citations passed peer review at NeurIPS 2025, and the rising tide of "AI slop"—low-quality, machine-generated submissions—is straining review systems across venues. A preliminary survey of AI researchers reveals a similar finding: while many actively use vanilla AI coding pipelines, there is relatively little adoption and development of robust and verified workflows for other parts of the scientific process like ideation, literature research, or experiment design. This suggests the path forward: build infrastructure for discovery, comparison, and verification. We propose the Research Agora to close this gap—a marketplace for discovering reusable AI workflows (skills), benchmarks for comparing their effectiveness, and test-driven research for verifying outputs before they propagate. We have built working examples—from reference checking that catches hallucinated citations to structured writing with layered quality checks—demonstrating how different verification levels apply to different tasks. We release these as a starting point and call on the research community to extend, improve, and benchmark them.