ICLR 2026 Workshop on Multimodal Intelligence: Next Token Prediction and Beyond
Abstract
Foundation models have transformed multimodal intelligence, enabling open-ended reasoning, dialogue, and generation across vision, language, and audio. A growing body of work now frames this progress under the unifying paradigm of next-X prediction, where X may denote tokens, frames, or scales across discrete or continuous spaces. Discrete autoregressive models, such as Chameleon, extend next-token prediction beyond text, while continuous formulations like VAR, MAR, TransFusion, BAGEL, and Fluid capture next-frame or next-scale dynamics in latent space. Meanwhile, predictive encoders—exemplified by V-JEPA 2—eschew token emission to forecast future representations, focusing on salient, structured aspects of perception and behavior. Complementary to both, discrete diffusion models such as Diffusion-LM, LLaDA, and LaViDa redefine generation as iterative denoising, offering parallelism and improved global consistency. This workshop provides a timely venue to connect these emerging paradigms—next-token generation, predictive encoding, and diffusion-based modeling—and to explore how they can be integrated into unified multimodal systems. Key questions include: Which learning paradigm scales most effectively? How do they differ in representation quality, efficiency, and controllability? And can hybrid models combine their strengths? By bringing together researchers from these diverse communities, the workshop aims to chart a coherent roadmap for the next generation of multimodal foundation models—beyond token prediction alone.