ICLR Poster LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Poster

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Yuhao Wu · Ming Shan Hee · Zhiqiang Hu · Roy Ka-Wei Lee

Hall 3 + Hall 2B #217

[ Abstract ] [ Project Page ]

Wed 23 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract: Current benchmarks like

$\textit{Needle-in-a-Haystack}$ '' (

$\textit{NIAH}$ ),

$\textit{Ruler}$ , and

$\textit{Needlebench}$ focus on models' ability to understand long-context input sequences but fail to capture a critical dimension: the generation of high-quality long-form text. Applications such as design proposals, technical documentation, and creative writing rely on coherent, instruction-following outputs over extended sequences—a challenge that existing benchmarks do not adequately address. To fill this gap, we introduce

$\textit{LongGenBench}$ , a novel benchmark designed to rigorously evaluate large language models' (LLMs) ability to generate long text while adhering to complex instructions. Through tasks requiring specific events or constraints within generated text,

$\textit{LongGenBench}$ evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens). Our evaluation of ten state-of-the-art LLMs reveals that, despite strong results on

$\textit{Ruler}$ , all models struggled with long text generation on

$\textit{LongGenBench}$ , particularly as text length increased. This suggests that current LLMs are not yet equipped to meet the demands of real-world, long-form text generation. We open-source

$\textit{LongGenBench}$ to promote comprehensive evaluation and improvement in this critical area, with code and data available at

${anonymousurl}$ .

Live content is unavailable. Log in and register to view live content