Poster
in
Workshop: I Can't Believe It's Not Better: Where Large Language Models need to improve

QuanBench Plus: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Ali Slim ⋅ Haydar Hamieh ⋅ Jawad Kotaich ⋅ Yehya Ghosn ⋅ Chehimi ⋅ Hasan Hammoud ⋅ Ammar Mohanna ⋅ Bernard Ghanem

Project Page [ Poster] [ OpenReview]

Abstract

Large language models (LLMs) are increasingly used for code generation and task automation. However, their effectiveness in quantum code generation across multiple major frameworks remains underexplored. This work introduces \textit{QuanBench Plus}, a unified multi-framework benchmark spanning Qiskit, Pennylane, and Cirq. Specifically, 42 tasks are adapted across three foundational categories (quantum algorithms, gate decomposition, and state preparation) and framework-aligned canonical solutions are provided for automated grading. Following the functional-evaluation paradigm popularized by functional code-generation benchmarks such as HumanEval, correctness assessment is standardized using Pass@k-based functional evaluation and KL-divergence-based acceptance is added for probabilistic outputs. The Pass@1 results are reported using greedy decoding and Pass@5 results using $k=5$ samples per task. Pass@1 after feedback (FB) is additionally reported when feedback to the model is triggered by an incorrect answer or a compilation error. Fidelity is excluded from primary scoring because circuit similarity may not reflect prompt-specific functional correctness. The best-performing models achieve Pass@1 results up to 42.9\% in Pennylane, 54.8\% in Cirq, and 59.5\% in Qiskit, illustrating both progress and remaining gaps in using LLMs for reliable quantum code generation.

Chat is not available.