ICLR Poster Towards Auto-Regressive Next-Token Prediction: In-context Learning Emerges from Generalization

Poster

Towards Auto-Regressive Next-Token Prediction: In-context Learning Emerges from Generalization

Zixuan Gong · Xiaolin Hu · Huayi Tang · Yong Liu

Hall 3 + Hall 2B #443

[ Abstract ]

Thu 24 Apr 7 p.m. PDT — 9:30 p.m. PDT

Abstract:

Large language models (LLMs) have demonstrated remarkable in-context learning (ICL) abilities. However, existing theoretical analysis of ICL primarily exhibits two limitations: \textbf{(a) Limited \textit{i.i.d.} Setting.} Most studies focus on supervised function learning tasks where prompts are constructed with \textit{i.i.d.} input-label pairs. This \textit{i.i.d.} assumption diverges significantly from real language learning scenarios where prompt tokens are interdependent. \textbf{(b) Lack of Emergence Explanation.} Most literature answers \textbf{\textit{what}} ICL does from an implicit optimization perspective but falls short in elucidating \textbf{\textit{how}} ICL emerges and the impact of pre-training phase on ICL. In our paper, to extend (a), we adopt a more practical paradigm, \textbf{\textit{auto-regressive next-token prediction (AR-NTP)}}, which closely aligns with the actual training of language models. Specifically, within AR-NTP, we emphasize prompt token-dependency, which involves predicting each subsequent token based on the preceding sequence. To address (b), we formalize a systematic pre-training and ICL framework, highlighting the layer-wise structure of sequences and topics, alongside a two-level expectation. In conclusion, we present data-dependent, topic-dependent and optimization-dependent PAC-Bayesian generalization bounds for pre-trained LLMs, investigating that \textbf{\textit{ICL emerges from the generalization of sequences and topics}}. Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets.

Live content is unavailable. Log in and register to view live content