Skip to yearly menu bar Skip to main content


Poster
in
Workshop: The 3rd DL4C Workshop: Emergent Possibilities and Challenges in Deep Learning for Code

Does Instruction Tuning Reduce Diversity? A Case Study Using Code Generation Copy

Alexander Shypula · Shuo Li · Botong Zhang · Vishakh Padmakumar · Kayo Yin · Osbert Bastani


Abstract:

Large Language Models (LLMs) should ideally generate diverse content for open-ended prompts. Preliminary evidence has suggested that preference-tuned language models struggle to generate diverse content, which would have important implications for how we align models. However, research on this question has been limited by the difficulty of measuring diversity, which naïvely would require costly human evaluation.We propose to leverage code as a means to study semantic diversity since code has executable semantics. To this end, we create an open-ended program synthesis task, enabling us to cheaply evaluate the diversity of hundreds of thousands of generations. Using our methodology, we find that while preference-tuning reduces syntactic and lexical diversity, it can increase semantic diversity.We also study the effect of model size and prompting technique on diversity. Finally, we find that neural diversity metrics correlate poorly with our semantic diversity metrics, highlighting the need for more rigorous methodologies for evaluating diversity.

Chat is not available.