Poster
in
Workshop: Navigating and Addressing Data Problems for Foundation Models (DPFM)
Cookbook: A framework for improving LLM generative abilities via programmatic data generating templates
Avanika Narayan · Mayee Chen · Kush Bhatia · Christopher Re
Keywords: [ weak supervision ] [ programmatic data ] [ large language models ] [ instruction-tuning ] [ LLMs ]
The prevailing way to improve the generative capabilities of large language models (LLMs) is via fine-tuning on instruction datasets (e.g., OpenHermes, OpenOrca). These datasets have several limitations: they are costly to curate, might violate user privacy agreements or terms of service of LLM providers, and are unclear in terms of the task skills (e.g.,"rules") the model learns from their instruction samples. In this work, we introduce Cookbook, a framework that uses data generating templates---simple Python functions over random tokens---to produce programmatic training data which improves LLM task performance. By defining simple data generating functions that encode the underlying "rules" of generative tasks, our framework can efficiently produce data without privacy concerns or the need for extensive curation. In the single-task setting, we show that Cookbook is effective across a wide range of tasks from document QA to entity disambiguation, providing performance gains of up to 60.1 accuracy points. We extend Cookbook to a multi-task setting, proposing an algorithm to optimally mix data from various templates, thereby addressing the challenge of improving LLMs across a broad range of tasks. On the standard multi-task GPT4ALL evaluation suite, we find that, averaged across tasks, Mistral-7B fine-tuned using Cookbook is the best 7B parameter instruction-tuned model, and, on an individual task level, is the best performing model on 3/8 tasks. Finally, to analyze Cookbook, we introduce the template alignment statistic as a novel metric for understanding how training on template-generated data enhances model performance by adhering to the task-specific rules encoded in the templates.