ICLR Poster CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair

Poster

CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair

Mingjie Liu · Yun-Da Tsai · Wenfei Zhou · Haoxing Ren

Hall 3 + Hall 2B #219

[ Abstract ] [ Project Page ]

Fri 25 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Despite the significant progress made in code generation with large language models, challenges persist, especially with hardware description languages such as Verilog. This paper first presents an analysis of fine-tuned LLMs on Verilog coding, with synthetic data from prior methods. We identify two main issues: difficulties in handling non-textual representations (Karnaugh maps, state-transition diagrams and waveforms) and significant variability during training with models randomly making ''minor'' mistakes. To address these limitations, we enhance data curation by creating correct-by-construction data targeting non-textual representations. Additionally, we introduce an automated framework that generates error reports from various model checkpoints and injects these errors into open-source code to create targeted code repair data. Our fine-tuned Starcoder2-15B outperforms prior state-of-the-art results by 3.8\%, 10.9\%, 6.6\% for pass@1 on VerilogEval-Machine, VerilogEval-Human, and RTLLM.

Live content is unavailable. Log in and register to view live content