Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 2nd Workshop on Navigating and Addressing Data Problems for Foundation Models (DATA-FM)

D$^3$: A Large Dataset for Training Code Language Models to Act Diff-by-Diff

Ulyana Piterbarg · Kanishk Gandhi · Lerrel Pinto · Noah Goodman · Rob Fergus


Abstract: We introduce D$^3$, a dataset for training LMs to iteratively synthesize general-purpose Python source code by generating file diffs. D$^3$ frames code synthesis as a goal-conditioned sequential decision-making problem, where goals, states, and actions are represented by token sequences corresponding to the description of a functionality to add, the current contents of a file, and a file diff, respectively. To construct the dataset, we filter, augment, and annotate code from a pretraining corpus of permissively licensed source code (The Stack) using Llama 3.1 70B Instruct and the LintSeq algorithm for sampling synthetic file diffs. D$^3$ contains 8 billion tokens of instruction + file-state + file-diff-sequence examples generated from 850,000 Human-written programs. In a preliminary set of experiments, we show that finetuning LMs like Llama 3.2 1B on examples from D$^3$ improves model performance on code synthesis, debugging, and repository-level editing tasks.

Chat is not available.