I Can't Believe LLMs Still Can't Write Drama: Multi-Dimensional Failures in Script Continuation
Abstract
Despite remarkable advances in creative text generation, we find that state-of-the-art large language models exhibit systematic and surprising failures in drama script continuation. Through DramaBench, a six-dimensional evaluation framework applied to 8,824 model-script evaluations across 8 frontier models, we uncover three "I Can't Believe It's Not Better" findings: (1) No universal winner: No model excels across all quality dimensions—GPT-5.2 leads in narrative efficiency but ranks 4th in emotional depth, while Qwen3-Max excels at emotion but ranks 6th in narrative; (2) LLM-as-Judge fails for creative writing: Human-LLM evaluator agreement is non-significant for 2 of 5 dimensions (Narrative Efficiency: r=0.07, Character Consistency: r=-0.04), challenging the reliability of automated creative writing evaluation; (3) Persistent logic failures: Even the best models show 2–5% logic contradiction rates, with 17.6% of scripts containing at least one factual violation. These findings suggest that drama script continuation remains an unsolved challenge requiring targeted architectural or training innovations beyond scale.