ICLR 2026 Blog Track Posters

Skip to yearly menu bar Skip to main content

Poster

Evolution of Flash Attention

Harshwardhan Fartale ⋅ Akshata Kishore Moharir ⋅ Ashish Kattamuri

Standard attention is a bottleneck. As Large Language Models scale, the memory-bound nature of attention operations and GPU memory hierarchy constraints create performance walls that raw compute power cannot solve alone. This post delivers a deep mathematical and technical breakdown of FlashAttention's evolution from V1 to V4, revealing how IO-aware algorithm design has become critical to unlocking scale. We trace the architectural journey: from V1's tiled exact attention and online softmax, through V2's parallelism refinements and improved work partitioning, to V3's asynchronous execution on Hopper architectures, and finally V4's advanced pipelining designed for Blackwell GPUs. Each version fundamentally reshaped attention computation by reducing data movement, maximizing parallelism, and aligning more closely with evolving GPU hardware capabilities. We analyze the design choices that drive these improvements—examining IO complexity, arithmetic intensity, forward and backward recomputation trade-offs, and their impact on throughput and utilization across A100, H100, and Blackwell-class hardware. These optimizations translate to measurable gains: we present concrete throughput improvements for long-context transformer workloads that demonstrate why memory hierarchy optimization now matters as much as algorithmic complexity. We conclude that modern LLM system performance depends critically on memory hierarchy optimization and execution efficiency, challenging the traditional focus on asymptotic complexity alone.

View full details

Poster

In-context learning of representations can be explained by induction circuits

Andy Arditi

Park et al., 2025 demonstrate that large language models can learn to trace random walks on graphs presented in context, and observe that token representations reorganize to reflect the underlying graph structure. This has been interpreted as evidence that models 'flexibly manipulate their representations' to reflect in-context semantics, and that this reorganization enables task performance. We offer a simpler mechanistic explanation. We first observe that task performance can be fully explained by induction circuits (Olsson et al., 2022), and show that ablating the attention heads that comprise these circuits substantially degrades performance. As for the geometric structure, we propose that it could result from previous token heads effectively mixing the representations of graph neighbors together. We show that a single round of such 'neighbor mixing' on random embeddings recreates the observed graph correspondence in PCA visualizations. These results suggest that apparent 'representation reorganization' may be a byproduct of the model's induction circuits, rather than a critical strategy useful for in-context learning.

View full details

Poster

From U-Nets to DiTs: The Architectural Evolution of Text-to-Image Diffusion Models (2021–2025)

Zhenyuan Chen ⋅ Zechuan Zhang ⋅ Feng Zhang

A comprehensive analysis of how diffusion model architectures evolved from U-Net backbones to Diffusion Transformers, transforming text-to-image generation capabilities. .

View full details

Poster

An Overview of Subliminal Learning

Samuel Spillard ⋅ Daniel Martin

In this blog post we survey the current state of subliminal learning research. We conclude by discussing the gaps in the literature which would take these techniques from research interests to potential real world concerns.

View full details

Poster

From REINFORCE to Dr. GRPO: A Unified Perspective on LLM Post-Training

Qingfeng Lan

Recently, many reinforcement learning (RL) algorithms have been applied to improve the post-training of large language models (LLMs). In this article, we aim to provide a unified perspective on the objectives of these RL algorithms, exploring how they relate to each other through the Policy Gradient Theorem — the fundamental theorem of policy gradient methods.

View full details

Poster

From Trajectories to Operators — A Unified Flow Map Perspective on Generative Modeling

Anbu Huang

In this post, we reframe continuous-time generative modeling from integrating trajectories to learning two-time operators (flow maps). This operator view unifies diffusion, flow matching, and consistency models, and suggests a practical diagnostic — semigroup-consistent jumps yield both step-robust generation and low compositional drift. We derive Eulerian/Lagrangian distillation objectives and use inpainting experiments to show why semigroup-consistent jumps can be both step-robust and composition-stable.

View full details

Poster

Language as a Window Into the Mind: How NLP and LLMs Advance Human Sciences

Lotem Peled-Cohen ⋅ Nitay Calderon ⋅ Roi Reichart

Can NLP predict heroin-addiction outcomes, uncover suicide risk, or simulate (and even influence) brain activity? Could LLMs one day contribute to research worthy of a Nobel Prize for advancing our understanding of human behavior? And what role do NLP scientists play in shaping that possibility? This post explores these questions, arguing that language technologies are not just tools that support scientific work (like literature search agents, writing tools, or coding assistants), but that by treating language as a window into the human mind, NLP and LLMs can actively help researchers uncover mechanisms of human behavior, cognition, and brain function.

View full details

Poster

ChunkTabPFN: Training-free Long Context

Renat Sergazinov ⋅ Shao-An Yin

Tabular foundation models struggle with large datasets due to the quadratic attention. While methods like FlashAttention promise scalability, practical challenges persist in their application to tabular foundation models. Our work resolves these hurdles, enabling efficient attention, and reveals that contrary to the eariler reports, TabPFN's performance improves with larger contexts, highlighting its inherent robustness and minimal fine-tuning needs when scaling to complex, long datasets from the TabArena benchmark.

View full details

Poster

Navigating the Manifold — A Geometric Perspective on Diffusion-Based Inverse Problems

Anbu Huang

This blogpost develops a geometric and probabilistic lens on diffusion priors for inverse problems. We show that a wide range of methods mostly instantiate two operator-splitting paradigms, i.e., posterior-guided sampling and clean-space local-MAP optimization. Through manifold diagrams, Tweedie-based animations, and step-by-step derivations, we explain how these paradigms decouple a pretrained diffusion prior from measurement physics, clarify when they approximate full posterior sampling versus MAP estimation, and distill practical design rules for building robust diffusion-based inverse solvers.

View full details

Poster

How To Open the Black Box: Modern Models for Mechanistic Interpretability

Juntai Cao ⋅ Xiang Zhang ⋅ Raymond Li ⋅ Jiarui Ding

Understanding how transformers represent and transform internal features is a core challenge in mechanistic interpretability. Traditional tools like attention maps and probing reveal only partial structure, often blurred by polysemanticity and superposition. New model-based methods offer more principled insight: Sparse Autoencoders extract sparse, interpretable features from dense activations; Semi-Nonnegative Matrix Factorization uncovers how neuron groups themselves encode concepts; Cross-Layer Transcoders track how these representations evolve across depth; and Weight-Sparse Transformers encourage inherently modular computation through architectural sparsity. Together, these approaches provide complementary pathways for opening the black box and understanding the circuits that underpin transformer behavior.

View full details

Poster

Is the evidence in 'Language Models Learn to Mislead Humans via RLHF' valid?

Aaryan Chandna ⋅ Lukas Fluri ⋅ Micah Carroll

*Language Models Learn to Mislead Humans Via RLHF* (published at ICLR 2025) argues that RLHF can unintentionally train models to mislead humans – a phenomenon termed Unintentional-SOPHISTRY. However, our review of the paper's code and experiments suggests that a significant portion of their empirical findings may be due largely to major bugs that make the RLHF setup both unrealistic and highly prone to reward hacking. In addition to high-level claims, we correct these issues for one of their experiments, and fail to find evidence that supports the original paper's claims.

View full details

Poster

Extracting Model Precision from 20 Logprobs

Yiming Zhang ⋅ Javier Rando ⋅ Florian Tramer ⋅ Daphne Ippolito ⋅ Nicholas Carlini

We demonstrate that the internal floating-point precision of language models can be inferred from API-exposed logprobs. Our key insight is that log-softmax shifts all logits by a shared constant, and we can search for shift values that map logprobs back to values representable in a given precision. Using just 20 logprobs from a single API call, we can reliably distinguish FP32, BF16, FP16, and FP8 formats. Applying our method to production APIs, we find that older OpenAI models (GPT-3.5, GPT-4) use FP32 logits while newer models (GPT-4o, GPT-4.1) use BF16, and Gemini 2.0 uses FP32.

View full details

Poster

Effect of Parallel Environments and Rollout Steps in PPO

Teerthaa Parakh

The blog post explores batch size in PPO - what happens when we increase the number of parallel environments versus the number of rollout steps, while keeping the total samples per update fixed. We discuss how this affects bias and variance in gradient estimation.

View full details

Poster

Wait, Do We Need to Wait? Revisiting Budget Forcing for Sequential Test-Time Scaling

Pittawat Taveekitworachai ⋅ Kunat Pipatanakul

In this blog post, we revisit the technique of ***budget forcing*** — a sequential test-time scaling technique that controls reasoning budget in reasoning models by appending a "Wait" keyword (or equivalently forcing a stop when the budget is exceeded), thereby determining whether the model continues thinking or directly outputs an answer. We explore three main questions: 1. To what extent does budget-forcing generalize across different model families and settings? 2. Does it work with non-reasoning models? 3. Can other keywords serve the same function as "Wait"? We present experimental results, including cases where budget forcing does and does *not* help and offer practical guidance for applying budget-forcing in test-time scaling.

View full details

Poster

Visualizing LLM Latent Space Geometry Through Dimensionality Reduction

Alex Ning ⋅ Vainateya Rangaraju ⋅ Yen-Ling Kuo

In this blog post, we extract, process, and visualize latent state geometries in Transformer-based language models through dimensionality reduction to build a better intuition of their internal dynamics. We demonstrate experiments with GPT-2 and LLaMa models, uncovering interesting geometric patterns in their latent spaces. Notably, we identify a clear separation between attention and MLP component outputs across intermediate layers, a pattern not documented in prior work to our knowledge.

View full details

Poster

Using Graph Neural Networks in Reinforcement Learning: A Practical Guide

Alex Schutz ⋅ Victor-Alexandru Darvariu

Graph Neural Networks (GNNs) have achieved excellent results for modelling relational data in many supervised learning domains. However, much fewer works have explored their potential in Reinforcement Learning (RL) despite the ubiquity of practical problems defined over graphs. In this blog post, we discuss how GNNs can be effectively integrated in Deep RL frameworks, covering crucial design decisions and practical implementation concerns. In doing so, we hope to facilitate unlocking new capabilities for RL agents to reason in graph-structured environments with dynamic action spaces and varying input sizes.

View full details

Poster

Trade-offs in LLM Compute for Reasoning-Intensive Information Retrieval

Sreeja Apparaju ⋅ Nilesh Gupta

The BRIGHT benchmark (ICLR 2025 Spotlight) revealed that reasoning-intensive information retrieval requires LLM-augmented pipelines, but this raises a critical resource allocation question: where should computational budget be invested for maximum effectiveness? We conduct a systematic study on BRIGHT using the Gemini 2.5 model family, evaluating trade-offs across model strength, inference-time thinking depth, and reranking depth. Our controlled experiments quantify the marginal gains of allocating compute to query expansion versus reranking, providing practical guidance for optimizing LLM-based retrieval systems on reasoning-intensive tasks.

View full details

Poster

Artistic Style and the Play of Neural Style Representations

Abhishek Dangeti ⋅ Pavan Gajula ⋅ Vikram Jamwal ⋅ Vivek Srivastava

Apr 24, 3:15 PM - 5:45 PM Pavilion 3

How do neural networks perceive the complex human construct of artistic style? We explore the dynamic interplay between diverse machine representations of style and style definitions. We reveal a profound divergence where models often reject established historical narratives in favor of their own perceptual truths.

View full details

Poster

The effect of feature resolution on embedding dimension

Louise Beyers ⋅ Ruan van der Merwe

Apr 25, 10:30 AM - 1:00 PM Pavilion 3

High-dimensional data can be compressed into lower-dimensional embeddings while retaining a relatively large amount of relevant information, a phenomenon which, despite its widespread use, we struggle to fully explain. In this post, we use a common property of datasets - a limit on the number of features per data point - to show how a slight uniform dependence between features can be exploited to reduce the required dimensions by at least a third, while sacrificing no information about the features. To do so, we introduce the concepts of dataset resolution and feature composition of a dataset, and analyse how a set of orderings of the dataset affects the types of partitions we can create of the dataset.

View full details

Poster

Divide, Conquer, and Standardize — A Recursive Architecture for Multi-Agent Systems (MAS)

Ronaldinho Vega Centeno Olivera ⋅ Allan M. de Souza ⋅ JULIO DOS REIS ⋅ Mateus da Silveira ⋅ Alejandro Núñez Arroyo

Apr 24, 3:15 PM - 5:45 PM Pavilion 3

The scalability and robustness of current Multi-Agent Systems (MAS) are severely constrained by the heterogeneity of communication interfaces and a reliance on fragile ad-hoc integrations. We introduce FRACTAL-MAS, a recursive architecture that standardizes orchestration through the convergence of MCP and A2A protocols, integrating a unified control loop with procedural memory grounded in Case-Based Reasoning (CBR). This design allows for continuous adaptation without fine-tuning and enables a seamless transition from rigid hierarchical structures to decentralized networks, providing a reference architecture for the robust and scalable construction of MAS.

View full details

Poster

What (and What Not) are Calibrated Probabilities Actually Useful for?

Guoxuan Xia

Apr 24, 10:30 AM - 1:00 PM Pavilion 3

This blogpost clarifies the practical usefulness of having a model with calibrated probabilities, something that is not often clearly stated in the calibration literature. We show that a calibrated model can be relied on to estimate average loss/reward, however, good calibration does not mean that a model is useful for per-sample decision making.

View full details

Poster

The Layered Ontology of Models, Resolving the Epistemological Crisis of AI

Zhun Sun

Apr 24, 10:30 AM - 1:00 PM Pavilion 3

With the rapid development of modern Artificial Intelligence, especially the emergence of Large Language Models (LLMs), we face a growing epistemological crisis: our engineering capabilities have far surpassed our philosophical vocabulary. We have built systems that demonstrate emergent reasoning abilities, yet we struggle to articulate exactly what we have built. The traditional naming convention, _e.g._, lumping code, parameters, and behaviors together as a "Model", is no longer sufficient. It fails to capture the widening gap between human design intent and the resulting behavioral artifacts. Current discussions often oscillate between two extremes: a reductionist view that dismisses these systems as merely "stochastic parrots," and an anthropomorphic view that prematurely attributes consciousness to them. Both views stem from a lack of structural granularity when defining the ontological status of AI agents. This paper proposes to solve this problem through a "Five-Layer Model Hierarchy Ontology." Inspired by systems theory and cognitive science, we deconstruct the concept of a "Model" into five distinct layers: the Noumenal Model ($\mathcal{M}_N$), the Conceptual Model ($\mathcal{M}_C$), the Instantiated Model ($\mathcal{M}_I$), the Reachable Model ($\mathcal{M}_R$), and the Observable Model ($\mathcal{M}_O$). By tracing the evolution of these layers from classical machine learning to foundation models, we reveal how the transition from "Tabula Rasa" (blank slate) to "Artifact" has fundamentally changed. Furthermore, we apply this framework to reconstruct two classic philosophical problems, namely the nature of meaning (via the "Stochastic Chinese Room") and the nature of truth (via the "Paradox of the Two Poetics"), demonstrating that the essence of synthetic intelligence lies not in biological mimicry, but in the topological structure of statistical manifolds.

View full details

Poster

The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection

Vyzantinos Repantis ⋅ Harshvardhan Singh ⋅ Tony Joseph ⋅ Cien Zhang ⋅ Akash Vishwakarma ⋅ Svetlana Karslioglu ⋅ Michael Thot ⋅ Ameya Gawde

Apr 24, 3:15 PM - 5:45 PM Pavilion 3

For most of information retrieval's history, search results were designed for human consumers who could scan, filter, and discard irrelevant content. This shaped retrieval systems to optimize for finding and ranking relevant documents, but not for minimizing noise, because humans served as the final filter. Retrieval-augmented generation (RAG) and tool-using agents flip these assumptions. Now the consumer is often an LLM, not a person, and the model does not skim. In practice, introducing excessive or irrelevant context into the input can dilute the model's ability to identify and focus on the most critical information. We define selectivity as the ability of a retrieval system to surface relevant items while excluding irrelevant ones. It is measured relative to random chance. We introduce Bits-over-Random (BoR), a measure of retrieval selectivity that reveals when high success rates mask random-level performance. A system with high selectivity finds needles without bringing along the haystack items. BoR uses a logarithmic scale where each bit represents a doubling in selectivity. This framework is grounded in information theory: $\text{BoR} = \log_2(P_{\text{obs}}/P_{\text{rand}})$, where $P_{\text{obs}}$ is the observed success rate (we use Success@K). $P_{\text{rand}}$ represents the expected success rate of random selection. BoR is measured in bits. By studying reported system performance in the literature for the MS MARCO dataset and by testing two datasets (BIER SciFact and 20 Newsgroups classification), we demonstrate how to measure selectivity in retrieval and LLM-based systems. On MS MARCO at $K=1000$, we analyzed reported performance of 41 different retrieval systems spanning three decades of retrieval technology. BM25 baseline (85.7% recall) achieves 12.89 bits, while state-of-the-art SimLM (98.7% recall) achieves 13.09 bits. This is a difference of only 0.20 bits despite a 13-point recall gap. All 41 systems clustered close to the theoretical ceiling of 13.11 bits, suggesting diminishing returns from retriever improvements alone for this dataset, and at this scale and depth. We see similar results on BIER SciFact. In our 20 Newsgroups retrieval task, each query has over 500 relevant items on average. We perform this stress test because there are many similarities to agentic tool selection setups. Increasing retrieval depth from $K=10$ to $K=100$ raises Success@K to 100%, indicating near-perfect retrieval. However, LLM classification accuracy drops by 10-16%, and token costs increase tenfold. Traditional metrics fail to detect this failure, which resembles random chance retrieval. BoR clearly reveals the issue by dropping to nearly zero at this task and depth. The "collapse zone" is where meaningful selectivity becomes mathematically impossible regardless of system quality. This occurs when $\lambda = \frac{K \cdot \bar{R}_q}{N}$ reaches 3-5, where $K$ is retrieval depth, $\bar{R}_q$ is average relevant items per query, and $N$ is corpus size. When $\lambda$ exceeds this threshold, even perfect systems achieve near-zero BoR because random selection already succeeds most of the time. The collapse boundary reveals critical implications for LLM agent tool selection. Industry-reported case studies outline examples where systems present their full suite of tools to an LLM (for example, $N=58$, $K=58$, $R_q \approx 4$). This means that such a system operates at high $\lambda$ ($\lambda \approx 4.0$), deep into the collapse zone. Through the lens of BoR, we analyze such cases and conclude that even a perfect tool selector achieves low selectivity over random chance (in one case, only ~0.02 bits). This explains why "wrong tool selection" is the most common failure mode for tool agentic systems. This pattern also affects any selection problem with a small $N$ and relatively high $K$ and $R_q$, including API endpoints, agent skills, or multi-hop retrieval chains. We also establish the "doubling rule": when retrieval depth plateaus in success rate, doubling $K$ loses approximately 1 bit of selectivity, while $10\times$ increase loses ~3.3 bits. This quantifies the hidden cost of "just retrieve more", a common but potentially harmful strategy in LLM systems. BoR can work with various success conditions (Success@K, Recall@K, and rules requiring multiple relevant items). Our work reveals three critical insights: 1. Performance ceilings exist even for perfect systems, determined entirely by the random baseline. 2. The collapse zone makes selectivity impossible when $\lambda$ reaches 3-5. 3. Depth-selectivity trade-offs become explicit through measuring differences ($\Delta\text{BoR}$) between different depths. For practitioners, BoR offers operational guidance: monitor $\lambda$ to avoid the collapse zone, stop increasing $K$ when $\text{BoR}_{\text{max}}$ drops below ~0.1 bits, and use aggressive filtering for tool-based agents where small $N$ makes collapse inevitable.

View full details

Poster

Faster SVD via Accelerated Newton-Schulz Iteration

Askar Tsyganov ⋅ Uliana Parkina ⋅ Ekaterina Grishina ⋅ Sergey Samsonov ⋅ Maxim Rakhuba

Apr 23, 10:30 AM - 1:00 PM Pavilion 3

Traditional SVD algorithms rely heavily on QR factorizations, which scale poorly on GPUs. We show how the recently proposed Chebyshev-Accelerated Newton-Schulz (CANS) iteration can replace them and produce an SVD routine that is faster across a range of matrix types and precisions.

View full details

Poster

Where’s the Chicken? Unpacking Spatial Awareness in Vision-Language Models

Jiyoon Pyo ⋅ Yao-Yi Chiang

Apr 25, 10:30 AM - 1:00 PM Pavilion 4

Modern vision-language models (VLMs) have achieved impressive success in recognizing and describing visual content, yet they continue to struggle with understanding spatial relationships. The limitation persists despite massive data and model scaling, suggesting that the root of the problem lies in the architecture and training objective rather than data alone. This post examines the underlying causes and discusses why recent proposed fixes, while promising, remain insufficient to achieve robust spatial reasoning.

View full details

Poster

dLLM - Rethinking Generation Beyond Autoregressive Models

Suhas Pai ⋅ Xiaojun Ren

Apr 25, 10:30 AM - 1:00 PM Pavilion 3

Diffusion large language models (dLLMs) have emerged as a promising alternative to standard autoregressive (AR) Transformers, offering parallel token generation and flexible infilling instead of strict left-to-right decoding. This post walks through how masked discrete diffusion works at a high level: a forward process that randomly masks tokens, a reverse process that iteratively denoises them, and training setups that either start from scratch or adapt existing AR models. We then discuss how diffusion decoding differs from AR decoding, including its strengths for infilling, structured generation, and long-horizon planning, as well as practical challenges around length control, number of denoising steps, and blockwise generation. Finally, we examine both sides of the current evidence: where dLLMs shine in data-constrained regimes and where theoretical and empirical work suggests they can underperform or even collapse back toward autoregressive behavior, ending with some possible hybrid futures that combine diffusion for reasoning and autoregression for generation.

View full details

Poster

Rethinking the Diffusion Model from a Langevin Perspective

Candi Zheng ⋅ Yuan Lan

Apr 25, 3:15 PM - 5:45 PM Pavilion 3

Diffusion models are often introduced from multiple perspectives—such as VAEs, score matching, or flow matching—accompanied by dense and technically demanding mathematics that can be difficult for beginners to grasp. This article offers a fresh Langevin perspective on diffusion models to lower the technical barrier, aiming to present diffusion models in a simpler, clearer, and more intuitive way while addressing the following questions: 1. How does the reverse process invert the forward process to generate data from pure noise? 2. How can ODE-based and SDE-based diffusion models be unified under a single framework? 3. Why are diffusion models theoretically superior to ordinary VAEs? 4. How can Denoising, Score Matching, and Flow Matching training objectives be unified and derived from first principles? We demonstrate that the Langevin perspective offers clear and straightforward answers to these questions, providing pedagogical value for both learners and experienced researchers seeking deeper intuition.

View full details

Poster

Dissecting Non-Determinism in Large Language Models

Mateus da Silveira ⋅ Ronaldinho Vega Centeno Olivera ⋅ Alejandro Núñez Arroyo ⋅ Allan M. de Souza ⋅ JULIO DOS REIS

Apr 23, 10:30 AM - 1:00 PM Pavilion 4

The Large Language Models (LLMs) evolve into the backbone of complex decision-making systems, their inherent non-deterministic nature poses a significant threat to the validity of experimental results. This blog explores the impact of stochasticity, prompt brittleness, and LLM-as-a-Judge during both response generation and evaluation. We conclude that understanding these dynamics is essential to prevent misleading conclusions, advocating for consistency oriented practices that treat non-determinism as a critical variable in rigorous experimentation.

View full details

Poster

The human knowledge loophole in the 'bitter lesson' for LLMs

Anna Rogers

Apr 23, 10:30 AM - 1:00 PM Pavilion 3

Are LLMs a proof that the 'bitter lesson' holds for NLP? Perhaps the opposite is true: they work due to the scale of human data, and not just computation.

View full details

Poster

Tracing the Principles Behind Modern Diffusion Models

Chieh-Hsin Lai ⋅ Yang Song ⋅ Dongjun Kim ⋅ Yuki Mitsufuji ⋅ Stefano Ermon

Apr 23, 3:15 PM - 5:45 PM Pavilion 3

Diffusion models can feel like a jungle of acronyms, but the core idea is simple: start from noise and gradually move a cloud of samples until it looks like real data. This post gives an intuition-first tour showing that DDPMs, score-based models, and flow matching are the same recipe with different prediction targets, all rooted in the change-of-variable rule from calculus and powered by one shared “conditional trick” that turns learning into supervised regression. Finally, we zoom out to the speed problem and show how flow map models aim to replace many tiny denoising steps with a few big, accurate jumps toward real-time generation.

View full details

Poster

Diffusion as Infinite HVAEs: Do Diffusion Models Generalize Better than Deep VAEs?

François Bertholom ⋅ Khalid Oublal

Apr 23, 3:15 PM - 5:45 PM Pavilion 3

Denoising Diffusion Probabilistic Models (DDPMs) and Hierarchical Variational Autoencoders (HVAEs) are typically studied as distinct paradigms for high-dimensional generative modeling. In this work, we bridge this gap by establishing a formal equivalence between DDPMs and HVAEs in the limit of infinite depth with a fixed, Markovian inference process. We argue that this architectural isomorphism is not merely a mathematical curiosity but the structural key to understanding the superior generalization capabilities of diffusion models. By viewing the forward diffusion process as a fixed encoder, we elucidate how DDPMs circumvent the posterior collapse often observed in deep VAEs, effectively balancing the trade-off between structural guidance and texture synthesis. We support this theoretical unification with empirical analysis of the semantic phase transitions in latent space and demonstrate the invariance of the Variational Lower Bound under noise schedule reparameterizations, confirming the interpretation of diffusion as a continuous-time hierarchical variational framework.

View full details

Poster

The Adversarial Conditioning Paradox: Why Attacked Inputs Are More Stable, Not Less

Khazretgali Sapenov ⋅ Aidos Sapenov

Apr 24, 3:15 PM - 5:45 PM Pavilion 3

Adversarial attacks on NLP systems are designed to find inputs that fool models while minimizing perceptible changes, making them difficult to detect using similarity-based methods. We investigate whether Jacobian conditioning analysis can provide an orthogonal detection signal. Surprisingly, we find that adversarial inputs exhibit systematically lower condition numbers at early transformer layers—the opposite of our initial hypothesis that attacks exploit unstable, ill-conditioned regions. This “adversarial conditioning paradox” replicates across multiple attack types: TextFooler (AUC = 0.72, p = 0.001), DeepWordBug (AUC = 0.75, p = 0.001), and directionally for PWWS (AUC = 0.59, p = 0.29). The effect holds for both word-level and character-level perturbations, while embedding cosine distance fails completely (AUC ≈ 0.25). We propose that adversarial attacks succeed by finding wellconditioned directions that cross decision boundaries—smooth paths to misclassification rather than chaotic exploitation of instability. Our findings open new directions for adversarial detection using internal geometric properties invisible to embedding-based methods.

View full details

Poster

Performative Prediction made practical

Javier Sanguino Bautiste ⋅ Thomas Kehrenberg ⋅ Carlos Rosety ⋅ Jose A. Lozano ⋅ Novi Quadrianto

Apr 25, 10:30 AM - 1:00 PM Pavilion 4

Performative Prediction studies settings where deploying a model induces a distribution shift in the data with the aim of building robust and good-peforming models under these post-deployment effects. Most existing work in this area is theoretical and relies on strict assumptions to converge to those models, which makes the resulting techniques difficult to apply in practice and limits their accessibility to the broader Machine Learning (ML) community. In this blog post, we use visualization techniques 1) to provide an intuitive explanation of Performative Prediction and 2) to extract practical insights for studying convergence when theoretical assumptions do not hold.

View full details

Poster

Evaluating Machine Learned Inter-Atomic Potentials for a Practical Simulation Workflow

Richard Strunk ⋅ Karnik Ram ⋅ Daniel Cremers

Apr 25, 10:30 AM - 1:00 PM Pavilion 3

MLIPs are a promising paradigm in atomistic simulation, potentially offering the accuracy of ab-initio methods at the speed of empirical potentials. In this blog post, we give an overview of recent MLIP architectures, followed by an evaluation on a practical CO2 adsorption simulation. We find that as of today these models, though promising, are far from plug-and-play, requiring significant engineering effort to operate within established simulation frameworks, while also failing to produce physically consistent results.

View full details

Poster

Dynamic Parameter Reuse Augments Reasoning via Latent Chain of Thought

Kaitlin Maile ⋅ Joao Sacramento

Apr 24, 10:30 AM - 1:00 PM Pavilion 3

Standard language models often rely on massive parameter counts for their performance, utilizing each parameter only once per inference pass. This prompts consideration of recurrent structures, where models reuse parameters across sequential time, depth, or training progression to achieve improved performance and reduced training cost. We draw connections in the landscape of parameter reuse, from growing models via stacking to recurrent looping, and postulate that these architectural priors act as a form of Latent Chain of Thought (LCoT), allowing models to reason in a continuous state space. By shifting towards deeper and dynamic computation, grown and recurrent architectures offer a path toward improved reasoning in compact networks, ascending beyond scaling laws of standard architectures.

View full details

Poster

Destruction is a General Strategy to Learn Generation; Diffusion's Strength is to Take it Seriously; Exploration is the Future

Pierre-André Noël

Apr 24, 10:30 AM - 1:00 PM Pavilion 4

I present diffusion models as part of a family of machine learning techniques that withhold information from a model’s input and train it to guess the withheld information. I argue that diffusion's destroying approach to withholding is more flexible than typical hand-crafted information withholding techniques, providing a rich training playground that could be advantageous in some settings, notably data-scarce ones. I then address subtle issues that may arise when porting reinforcement learning techniques to the diffusion context, and wonder how such exploration problems could be addressed in more diffusion-native ways. I do not have definitive answers, but I do point my fingers in directions I deem interesting. A tutorial follows this thesis, expanding on the destroy-then-generate perspective. A novel kind of probabilistic graphical models is introduced to facilitate the tutorial's exposition.

View full details

Poster

Defining and quantifying compositional structure

Eric Elmoznino ⋅ Guillaume Lajoie

Apr 25, 3:15 PM - 5:45 PM Pavilion 4

Compositionality is thought to be crucial in human cognition and AI, but we lack a scientific understanding of what it is. What kind of data is compositionally structured? Can we mathematically quantify the amount and character of compositional structure? This blog post introduces a novel approach for doing so, building off of existing tools from algorithmic information theory that formalize notions of complexity and structure. The mathematical definition of compositionality that we'll come to is rigorous, precise, and general, and the hope is that it can inspire novel research directions in AI for uncovering compositional structure in natural data.

View full details

Poster

Model Misspecification in Simulation-Based Inference - Recent Advances and Open Challenges

Jan Boelts

Apr 24, 10:30 AM - 1:00 PM Pavilion 4

Model misspecification is a critical challenge in simulation-based inference (SBI), particularly in neural SBI methods that use simulated data to train flexible neural density estimators. These methods typically assume that simulators faithfully represent the true data-generating process, an assumption that is often violated in practice. Resulting discrepancies can make observed data effectively out-of-distribution relative to the simulations, leading to biased posterior distributions and misleading uncertainty quantification. This post reviews recent work on model misspecification in neural SBI, covering formal definitions, methods for detection and mitigation, and their underlying assumptions. It also discusses practical implications for SBI workflows and outlines open challenges for developing robust SBI methods that remain reliable in realistic, imperfectly specified applications.

View full details

Poster

Learning to Maximize Rewards via Reaching Goals

Chongyi Zheng ⋅ Mahsa Bastankhah ⋅ Grace Liu ⋅ Benjamin Eysenbach

Apr 23, 10:30 AM - 1:00 PM Pavilion 4

Goal-conditioned reinforcement learning learns to reach goals instead of optimizing hand-crafted rewards. Despite its popularity, the community often categorizes goal-conditioned reinforcement learning as a special case of reinforcement learning. In this post, we aim to build a direct conversion from any reward-maximization reinforcement learning problem to a goal-conditioned reinforcement learning problem, and to draw connections with the stochastic shortest path framework. Our conversion provides a new perspective on the reinforcement learning problem: *maximizing rewards is equivalent to reaching some goals*.

View full details

Poster

Discretisation invariance

Vladimir Fanaskov ⋅ Ivan Oseledets

Apr 24, 10:30 AM - 1:00 PM Pavilion 3

Discretisation invariance, a recent innovation in scientific machine learning, is a requirement that ensures an architecture can process inputs of different resolutions. In this post, we formally define this property, provide examples, generate datasets, train architectures, and discuss whether discretisation invariance is living up to its promise.

View full details

Poster

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Bingxiang He ⋅ Zekai Qu ⋅ Zeyuan Liu ⋅ Yinghao Chen ⋅ Yuxin Zuo ⋅ Cheng Qian ⋅ Kaiyan Zhang ⋅ Weize Chen ⋅ Chaojun Xiao ⋅ Ganqu Cui ⋅ Ning Ding ⋅ Zhiyuan Liu

Apr 25, 3:15 PM - 5:45 PM Pavilion 4

Training small reasoning models with RL has become a race toward complexity, using multi-stage pipelines, dynamic schedules, and curriculum learning. We ask whether this complexity necessary? We show that JustRL, a simple recipe with fixed hyperparameters, achieves state-of-the-art performance on two different 1.5B base models (54.5% and 64.3% across 9 math benchmarks) while using 2× less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training remains stable over thousands of steps without intervention. This suggests the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline.

View full details

Poster

Budget Alignment: Making Models Reason in the User's Language

Shan Chen ⋅ Jirui Qi ⋅ Zidi Xiong ⋅ Timothy Miller ⋅ Arianna Bisazza ⋅ Raquel Fernández ⋅ Danielle Bitterman

Apr 23, 3:15 PM - 5:45 PM Pavilion 3

LLMs often reason internally in English even for non-English queries, limiting faithfulness and weakening human oversight in multilingual settings. We study budget alignment: lightweight methods to align a model’s reasoning language with the user’s language under modest data and compute. Using a 7B model, we evaluate multilingual SFT, RL for accuracy recovery, and model merging. Across Japanese, French, and Spanish tasks, these approaches markedly increase language-consistent reasoning while preserving strong accuracy, showing that faithful and interpretable multilingual reasoning can be achieved with low-cost alignment.

View full details

Poster

UnigramLM: An Attempt at Writing The Missing Manual

Clara Meister

Apr 23, 3:15 PM - 5:45 PM Pavilion 3

This post is my attempt to write down the UnigramLM tokenization algorithm cleanly and explicitly because, well, I still haven't found such a derivation and I think understanding the theory behind the method could help us make it better. I'll formalize the generative model around which the algorithm is based, derive the EM updates, explain why pruning is needed (and how it's done), and point out the spots where the practical implementation defined by the SentencePiece library diverges from the pretty mathematical models.

View full details

Poster

Heuristic-Based Ideation for Guiding LLMs Toward Structured Creativity

Xiao Liu ⋅ Haokun Liu ⋅ Chenhao Tan

Apr 25, 10:30 AM - 1:00 PM Pavilion 3

Large Language Models (LLMs) hold immense promise for accelerating scientific discovery, yet current LLM-based ideation methods often rely on ad-hoc strategies rather than systematic frameworks. This blog introduces Ideation Heuristics, a systematic approach that formalizes 20 cognitive heuristics that structure how researchers generate new ideas. We show that researchers across disciplines find these heuristics highly useful, and we demonstrate how they can be operationalized through Claude skills.

View full details

Poster

Revisiting the NetHack Learning Environment

Michael Matthews ⋅ Pierluca D'Oro ⋅ Anssi Kanervisto ⋅ Scott Fujimoto ⋅ Jakob Foerster ⋅ Mikael Henaff

Apr 25, 3:15 PM - 5:45 PM Pavilion 4

The NetHack Learning Environment (NLE) was proposed as a challenging benchmark to test an agents abilities to perform complex reasoning over long time horizons in a stochastic, partially-observed, procedurally generated setting. To date, no approach, including those based on reinforcement learning, using large pretrained models, using handcoded symbolic agents, imitating expert trajectories or any hybrid method has achieved significant progress towards completing the game. We take a deeper look into the mechanics and interface of the NLE and show that much of the complexity of NetHack is inaccessible due to constraints on the observation and action spaces. We propose a series of modifications and show that they meaningfully improve performance on the NLE.

View full details

Poster

Content Promotion as a Strategic Game: How to Design Agentic Publishers for the Evolving Search Ecosystem in the GenAI Era?

Tommy Mordo ⋅ Sagie Dekel ⋅ Tomer Kordonsky ⋅ Omer Madmon ⋅ Moshe Tennenholtz ⋅ Oren Kurland

Apr 24, 10:30 AM - 1:00 PM Pavilion 3

With the rise of LLMs, publishers now operate in a dual world where traditional search and chat-like systems coexist. We propose a unified, game-theoretic view of this environment and highlight different tools, such as Multi-Agent Reinforcement Learning, that support the development of competitive content-optimization agents.

View full details

Poster

Understanding and Fixing Bottlenecks in State Space Models: What Recency and Over-Smoothing Tell Us

Adrita Das ⋅ Dantong Zhu

Apr 23, 10:30 AM - 1:00 PM Pavilion 4

State Space Models (SSMs), including Mamba, commonly suffer from two failure modes: recency bias, where the model biases strongly toward recent inputs, and over-smoothing, where hidden states become indistinguishable with depth. The paper argues that these issues originate from the learned state-transition matrix $A_t$, whose memory-decay values collapse into a narrow range, limiting the diversity of timescales the model can represent. To mitigate this, the authors introduce polarization, where one dimension of $A_t$ is fixed to serve as a perfect long-term memory channel and another as a pure short-term memory channel, while all other dimensions remain learnable. This enforces both a stable non-decaying pathway and a rapidly resetting pathway, preventing the system from collapsing into uniformly slow or fast decay behavior. Through associative-recall experiments, the polarized Mamba variants demonstrate substantially improved long-context retrieval. Overall, the findings indicate that standard parameterizations of $A_t$ fail to preserve sufficient memory diversity, whereas polarization offers a simple and effective mechanism for stabilizing long-range information flow in SSMs.

View full details

Poster

Square Peg, Round Hole: Plugging Non-Sequential Data into Sequential Language Models

Julia Balla ⋅ Hannah Lawrence

Apr 25, 10:30 AM - 1:00 PM Pavilion 4

Autoregressive (AR) models are central to modern generative AI systems, yet their sequential inductive bias clashes with modalities that lack an obvious ordering, such as images, graphs, and point clouds. Despite this mismatch, AR models are widely used beyond language, owing to their scalability and controllability. This post highlights the growing set of techniques that make non-sequential data amenable to autoregressive modeling. There are two broad directions: approaches that choose or optimize a generation order for a fixed tokenization, and approaches that redesign the tokenization itself to simplify each next-token prediction step. We emphasize the tradeoffs these methods face, particularly between compression and autoregressive ``modelability”. By drawing these connections, we aim to motivate future work on tokenizations tailored to the needs of autoregressive models for arbitrary datatypes.

View full details

Poster

Ready For General Agents? Let's test it.

Elron Bandel ⋅ Asaf Yehudai ⋅ Michal Shmueli-Scheuer

Apr 23, 3:15 PM - 5:45 PM Pavilion 4

Recent progress in LLMs has pushed the field from domain-specific systems toward increasingly general-purpose models. A similar shift is now emerging for AI agents: domain agents share reusable components and already operate across multiple domains with minimal adaptation. This capacity to integrate into new environments and tackle entirely new classes of tasks gives general agents the potential for effectively unbounded real-world value. However, current evaluation frameworks cannot measure this core capability. We organize prior evaluation efforts into a five-level taxonomy and highlight the currently missing fifth level: a unified evaluation of general agents. Unlike existing work—which typically assesses a single agent across many environments—this missing level must support comparison across agents with different architectures and determine how well they operate in unfamiliar settings. We outline the challenges that prevent such evaluation today and analyze why common benchmarks and protocols fall short. We then propose requirements for a protocol-agnostic evaluation framework capable of reliably measuring agent performance and adaptability across diverse environments.

View full details

Poster

Flow Where You Want

Scott Hawley

Apr 23, 3:15 PM - 5:45 PM Pavilion 4

This tutorial serves as an intuitive introduction to adding inference-time controls to pretrained flow matching and rectified flow generative models, to make them perform tasks they weren't trained to do. We take an unconditional flow model trained on MNIST digits and apply two types of guidance: classifier guidance to generate specific digits, and inpainting to fill in missing pixels. Both approaches work by adding velocity corrections during the sampling process to steer the model toward desired outcomes. Since modern generative models operate in compressed latent spaces, we examine guidance methods that work directly in latent space as well as those that decode to pixel space. We also explore PnP-Flow, which satisfies constraints by iteratively projecting samples backward and forward in time rather than correcting flow velocities. The approaches demonstrated here work with other flow models and control tasks, so you can guide flows where you want them to go.

View full details

Poster

Don't Look Up (Every Token): Escaping Quadratic Complexity via Geometric Patterns and Algorithms

Aryan Sood ⋅ Tanvi Sharma ⋅ Vansh Agrawal

Apr 24, 3:15 PM - 5:45 PM Pavilion 4

Large Language Models (LLMs) have brought about a significant change in the field of artificial intelligence, where they have transitioned in scope from being specialized research tools to common resources that drive the next generation of software. With increasing model parameters and training data, LLMs demonstrate new abilities in reasoning, code generation, and solving complex problems that were once considered unattainable. However, scaling these models effectively for long-context applications uniquely poses a challenge. This is primarily due to the inherent limitations of the self-attention mechanism, which has quadratic time complexity. This quadratic bottleneck hinders applications for long documents, high-resolution images, and large codebases, among others. However, what is interesting to observe is that effectively only a few parameters are used in token computation, and most calculations are sparse. Hence, sparsity emerges as an effective solution to this problem. Rather than relying on the entire attention matrix, one can utilize an approximate or sparse version of attention to achieve almost the same results much faster. The backbone of this approach is the idea that tokens do not require the entire context; they only need local context, and thus, most of the computation carried out is wasteful. In this blog, we analyze the types of attention patterns that emerge and how to use them to our advantage for faster and efficient LLMs.

View full details

Poster

Misalignments and RL Failure Modes in the Early Stage of Superintelligence

Shu Yang ⋅ Hanqi Yan ⋅ Di Wang

Apr 25, 3:15 PM - 5:45 PM Pavilion 3

With the rapid ability grokking of frontier Large Models (LMs), there is growing attention and research focus on aligning them with human values and intent via large scale reinforcement learning and other techniques. However, as LMs are getting stronger and more agentic, their misalignment and deceptive behaviors are also emerging and becoming increasingly difficult for humans to pre-detect and keep track of. This blog post discusses current misalignment patterns, deceptive behaviors, RL failure modes, and emergent traits in modern large models to further AI safety discussions and advance the development of mitigation strategies for LM misbehaviors.

View full details

Poster

Computer Use Survey - A Visual Survey of Computer Use Agents

Kenneth Marino ⋅ Farhan Ishmam ⋅ Ana Marasovic

Apr 25, 3:15 PM - 5:45 PM Pavilion 3

In recent years, AI systems operating on the web and in computer environments have become a major topic of interest for both academia and industry. The goal of this blog is to provide an interesting and interactive survey of historical and recent works on computer use agents. We define key terms used in the literature, catalogue the expansive list of environments and datasets, discuss the evolution of the methodologies, and assess both today’s landscape and possible paths forward.

View full details

Poster

Probabilistic Circuits for Uncertainty Quantification

Maternus Herold ⋅ Konstantin von Gaisberg-Helfenberg

Apr 23, 10:30 AM - 1:00 PM Pavilion 3

Deep learning models struggle with epistemic uncertainty quantification, often exhibiting blind confidence on out-of-distribution data. This work reviews on Probabilistic Circuits (PCs) as a versatile framework for rigorous, tractable reasoning. PCs model the joint probability distribution and by enforcing structural constraints, specifically smoothness, decomposability, and determinism, they allow for the exact computation of marginals, conditionals, and moments in polynomial time without retraining. We discuss on the suitability of PCs for Uncertainty Quantification, describing their advantages and highlighting their PCs for tractable UQ in high-dimensional problems.

View full details

Poster

Generative AI Archaeology

Desmond Elliott

Apr 25, 10:30 AM - 1:00 PM Pavilion 3

We document the rise of the Generative AI Archaeologist, whose tools include linear algebra and probability theory, jailbreaking, and debuggers, compared to the metal detectors, pickaxes, and radar surveys of traditional archaeology. GenAI Archaeologists have reported findings both through luck by observing unexpected behaviour in publicly accessible models, and by exploiting the mathematical properties of models. In this blog, we survey five types of findings unearthed by GenAI Archaeologists and discuss the status of those findings.

View full details

Poster

Loneliness as a Case Study for Social Reward Misalignment

Samantha Adorno ⋅ Akshata Kishore Moharir ⋅ Ratna Kandala

Apr 23, 3:15 PM - 5:45 PM Pavilion 3

The goal of this work is to use loneliness as a clear case study of proxy-reward misalignment in RL. We introduce a simulation where loneliness drifts over time and repeated short-term comfort increases an accumulated harm variable, then compare agents trained on engagement versus long-term well-being. We show that optimizing engagement leads to policies that prioritize immediate relief without improving the underlying state, motivating reward inference or well-being objectives over engagement proxies.

View full details

Poster

Why AI Evaluations Need Error Bars

Zairah Mustahsan

Apr 24, 3:15 PM - 5:45 PM Pavilion 3

As large language models (LLMs) and agentic systems advance, the field increasingly depends on fine-grained evaluation to compare models, guide research directions, and make deployment decisions. Yet evaluation pipelines often treat LLMs as deterministic functions, even though they are fundamentally stochastic systems with variability arising from sampling methods, hardware nondeterminism, environmental randomness, and evaluation procedures. This mismatch leads to unstable benchmarks, unreliable model comparisons, inconsistent agent outcomes, and significant uncertainty when using LLMs as judges. Recent research has begun to quantify this instability and propose statistical techniques, from frequentist error bars to Bayesian latent-state models, reliability metrics, and large-scale variance audits. But adoption is uneven, and the field lacks a cohesive statistical framework for evaluating stochastic intelligence. This post synthesizes existing research into a unified perspective and outlines practical recommendations for improving evaluation practice. The goal is not to introduce new methods, but to demonstrate that the tools already exist and that incorporating statistical thinking is both feasible and urgently needed.

View full details

Poster

AI Fundamentals: Valuing AI Agents & Data Assets

Qingyun Sun ⋅ Zhenheng Tang ⋅ Huacan Wang

Apr 24, 10:30 AM - 1:00 PM Pavilion 3

Large Language Model (LLM) agents now read the world through managed-context pipelines, write to it via tool-calling APIs, and continuously re-wire themselves with fresh experience. Stakeholders therefore need a Generally Accepted Accounting Principles (GAAP) compatible method to price both (i) the agent's labour-like output and (ii) the data traces that fuel learning. We formalise a single unifying metric - agent Economic Value (AEV)- and demonstrate that these metrics are measurable today. We then extend the template to reinforcement-learning regimes in which grounded rewards equal cash flows. Lastly, we propose a financial settlement layer, which transforms the agent from a passive software user into an active economic participant.

View full details