One Step Forward, Two Steps Back: Regression Errors and Cost Inefficiencies in LLM Iterative Refinement for Code Generation
Abstract
Self-Refine, a self-reflection framework, has proven effective for improving Large Language Model performance in stylistic tasks, such as code readability. However, its utility for ensuring functional correctness in code generation is often undermined by significant reliability and cost trade-offs. In this work, we characterize the specific failure modes of Self-Refine in this domain and analyze the impact of "Asymmetric Self-Refine" (utilizing smaller, cost-efficient models as critics) to mitigate high inference costs of iterative refinement. Evaluating on the LiveCodeBench dataset, we uncover a critical negative result: the framework fails to reliably improve Pass@1 scores and often degrades them due to hallucination-induced regression errors. Critics hallucinated flaws in functionally correct code, prompting the actor to introduce bugs into valid solutions. This instability, present even in symmetric configurations, is exacerbated by smaller critics, which drive total inference costs higher due to excessive refinement loops, contradicting the expectation of cost savings.