Poster
in
Workshop: 2nd Workshop on Mathematical and Empirical Understanding of Foundation Models
Attributing Mode Collapse in the fine-tuning of Large Language Models
Laura O'Mahony · Leo Grinsztajn · Hailey Schoelkopf · Stella R Biderman
Large language models (LLMs) are typically trained in two stages: first, pre-training on a large, diverse dataset for general-purpose language modeling capabilities, followed by a fine-tuning stage (often called "instruction tuning" or "alignment") on smaller, more curated datasets to adapt them to a specific task or downstream application, such as chat, or general instruction-following. It is a well-known anecdotal observation that instruction-tuned models have less output diversity, meaning a model lacks the ability to generate varied outputs, which can be a limitation for many use cases. In this manuscript, we quantify how each step in a typical RLHF or instruction-tuning pipeline changes a model's diversity, for a series of models trained in a controlled fine-tuning setup. We distinguish between two categories of diversity in LLMs: token-level prediction diversity, and model output generation diversity. We find that the supervised fine-tuning and reward-based fine-tuning steps have different effects on these distinct diversity types. Our results have implications for better understanding the effects of instruction tuning on language models.