Thinking into the Future: Latent Lookahead Training for Transformers
Lorenzo Noci ⋅ Gregor Bachmann ⋅ Seyed-Mohsen Moosavi-Dezfooli ⋅ Moin Nabi
Abstract
Autoregressive language models trained with next-token prediction generate text by sampling one discrete token at a time. Although very scalable, this objective forces the model at every step to commit to a singular choice, preventing it from exploring or reflecting upon multiple plausible continuations. Furthermore, the compute allocation across tokens is uniform; every token is formed based on a single forward-pass, potentially limiting the model's expressiveness in cases where difficult tokens require inherently more compute. Towards addressing these limitations, we introduce *latent lookahead*, a training strategy that enables models to ``think'' before generating: at selected positions in the sequence, before committing to the next token, the model performs a multi-step lookahead in latent space. More precisely, instead of sampling future tokens, we leverage the network's latent space by recursively feeding its hidden states back into the context for $\tau$ steps, investing more compute on predicting that token. This produces $\tau$ latent predictions that are supervised against the next $\tau$ ground-truth tokens, encouraging the model to "lookahead" and refine its prediction. We show that latent lookahead substantially outperforms both autoregressive and non-autoregressive baselines on planning tasks such as maze solving, Sudoku, and ProsQA, where foresight is essential.
Chat is not available.
Successful Page Load