Skip to yearly menu bar Skip to main content


Poster Session

Poster Session 3 Pavilion 3

Pavilion 3
Fri 24 Apr 6:30 a.m. PDT — 9 a.m. PDT
Abstract:
Chat is not available.


Poster
P3-#101
Adjusting Prediction Model Through Wasserstein Geodesic for Causal Inference

Yuguang Yan ⋅ Haolin Yang ⋅ Zecong Chen ⋅ Weilin Chen ⋅ Ruichu Cai ⋅ Zhifeng Hao

Causal inference estimates the treatment effect by comparing the potential outcomes of the treated and control groups. Due to the existence of confounders, the distributions of treated and control groups are imbalanced, resulting in limited generalization ability of the outcome prediction model, \ie, the prediction model trained on one group cannot perform well on the other group. To tackle this, existing methods usually adjust confounders to learn balanced representations for aligning the distributions. However, these methods could suffer from the over-balancing issue that predictive information about outcomes is removed during adjustment. In this paper, we propose to adjust the outcome prediction model to improve its generalization ability on both groups simultaneously, so that the over-balancing issue caused by confounder adjustment can be avoided. To address the challenge of large distribution discrepancy between groups during model adjustment, we propose to generate intermediate groups through the Wasserstein geodesic, which smoothly connects the control and treated groups. Based on this, we gradually adjust the outcome prediction model between consecutive groups by a self-training paradigm. To further enhance the performance of the model, we filter the generated samples to select high-quality samples for learning. We provide the theoretical analysis regarding our method, and demonstrate the effectiveness of our method on several benchmark datasets in terms of multiple evaluation metrics.


Poster
P3-#103
Structure Learning from Time-Series Data with Lag-Agnostic Structural Prior

Taiyu Ban ⋅ Changxin Rong ⋅ Xiangyu Wang ⋅ Lyuzhou Chen ⋅ Yanze Gao ⋅ Xin Wang ⋅ Huanhuan Chen

Learning instantaneous and time-lagged causal relationships from time-series data is essential for uncovering fine-grained, temporally-aware interactions. Although this problem has been formulated as a continuous optimization task amenable to modern machine learning methods, existing approaches largely neglect the use of coarse-grained, lag-agnostic causal priors, an important form of prior knowledge that is often available in practice. To address this gap, we propose a novel framework for structure learning from time series to integrate lag-agnostic priors, enabling the discovery of lag-specific causal links without requiring precise temporal annotations. We introduce formulations to precisely characterize the lag-agnostic prior, and demonstrate their consequential and process-equivalence to priors, maintaining consistency with the intended semantics of the priors throughout optimization. We further analyze the challenge for optimization due to the increased non-convexity by lag-agnostic prior constraints, and introduce a data-driven initialization to mitigate this issue. Experiments on both synthetic and real-world datasets show that our method effectively incorporates lag-agnostic prior knowledge to enhance the recovery of fine-grained, lag-aware structures.


Poster
P3-#104
Statistical and structural identifiability in representation learning

Walter Nelson ⋅ Marco Fumero ⋅ THEOFANIS KARALETSOS ⋅ Francesco Locatello

Representation learning models exhibit a surprising stability in their internal representations. Whereas most prior work treats this stability as a single property, we formalize it as two distinct concepts: **statistical identifiability** (consistency of representations across runs) and **structural identifiability** (alignment of representations with some unobserved ground truth). Recognizing that perfect pointwise identifiability is generally unrealistic for modern representation learning models, we propose new model-agnostic definitions of statistical and structural near-identifiability of representations up to some error tolerance $\epsilon$. Leveraging these definitions, we prove a statistical $\epsilon$-**near-identifiability** result for the representations of models with nonlinear decoders, generalizing existing identifiability theory beyond last-layer representations in e.g. generative pre-trained transformers (GPTs) to near-identifiability of the intermediate representations of a broad class of models including (masked) autoencoders (MAEs) and supervised learners. Although these weaker assumptions confer weaker identifiability, we show that independent components analysis (ICA) can resolve much of the remaining linear ambiguity for this class of models, and validate and measure our near-identifiability claims empirically. With additional assumptions on the data-generating process, statistical identifiability extends to structural identifiability, yielding a simple and practical recipe for disentanglement: ICA post-processing of latent representations. On synthetic benchmarks, this approach achieves state-of-the-art disentanglement using a vanilla autoencoder. With a foundation model-scale MAE for cell microscopy, it disentangles biological variation from technical batch effects, substantially improving downstream generalization.


Poster
P3-#105
On Measuring Influence in Avoiding Undesired Future

Lue Tao ⋅ Tian-Zuo Wang ⋅ Yuan Jiang ⋅ Zhi-Hua Zhou

When a predictive model anticipates an undesired future event, a question arises: What can we do to avoid it? Resolving this forward-looking challenge requires determining the variables that positively influence the future, moving beyond the statistical association typically exploited for prediction. In this paper, we introduce a novel measure for evaluating the influence of actionable variables in successfully avoiding the undesired future. We quantify influence as the degree to which the success probability can be increased by altering variables under the principle of maximum expected utility. Our analysis demonstrates a counterintuitive insight: while related to causality, influential variables may not necessarily be those with strong intrinsic causal effects on the target event. In fact, it can be highly beneficial to alter a weak causal factor, or even a variable that is not an intrinsic factor at all. We provide a practical implementation for estimating the proposed measure and validate its utility through experiments on synthetic and real-world tasks.


Poster
P3-#106
Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning

Xiangkun Wu ⋅ Qianglin Wen ⋅ Yingying Zhang ⋅ Hongtu Zhu ⋅ Ting Li ⋅ Chengchun Shi

A/B testing has become a gold standard for modern technological companies to conduct policy evaluation. Yet, its application to time series experiments, where treatments are sequentially assigned over time, remains challenging. Existing designs suffer from two limitations: (i) they do not fully leverage the entire history for treatment allocation; (ii) they rely on strong assumptions to approximate the objective function (e.g., the mean squared error of the estimated treatment effect) for optimizing the design. We first establish an impossibility theorem showing that failure to condition on the full history leads to suboptimal designs, due to the dynamic dependencies in time series experiments. To address both limitations simultaneously, we next propose a transformer reinforcement learning (RL) approach which leverages transformers to condition treatment allocation on the entire history and employs RL to directly optimize the MSE without relying on restrictive assumptions. Empirical evaluations on synthetic data, a publicly available dispatch simulator, and a real-world ridesharing dataset demonstrate that our proposal consistently outperforms existing designs.


Poster
P3-#107
An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes

Emil Javurek ⋅ Valentyn Melnychuk ⋅ Jonas Schweisthal ⋅ Konstantin Hess ⋅ Dennis Frauen ⋅ Stefan Feuerriegel

Predicting individualized potential outcomes in sequential decision-making is central for optimizing therapeutic decisions in personalized medicine (e.g., which dosing sequence to give to a cancer patient). However, predicting potential out- comes over long horizons is notoriously difficult. Existing methods that break the curse of the horizon typically lack strong theoretical guarantees such as orthogonality and quasi-oracle efficiency. In this paper, we revisit the problem of predicting individualized potential outcomes in sequential decision-making (i.e., estimating Q-functions in Markov decision processes with observational data) through a causal inference lens. In particular, we develop a comprehensive theoretical foundation for meta-learners in this setting with a focus on beneficial theoretical properties. As a result, we yield a novel meta-learner called DRQ-learner and establish that it is: (1) doubly robust (i.e., valid inference under model misspecification), (2) Neyman-orthogonal (i.e., insensitive to first-order estimation errors in the nuisance functions), and (3) achieves quasi-oracle efficiency (i.e., behaves asymptotically as if the ground-truth nuisance functions were known). Our DRQ-learner is applicable to settings with both discrete and continuous state spaces. Further, our DRQ-learner is flexible and can be used together with arbitrary machine learning models (e.g., neural networks). We validate our theoretical results through numerical experiments, thereby showing that our meta-learner outperforms state-of-the-art baselines.


Poster
P3-#108
TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations

Gideon Stein ⋅ Niklas Penzel ⋅ Tristan Piater ⋅ Joachim Denzler

Causal Discovery (CD) is a powerful framework for scientific inquiry. Yet, its practical adoption is hindered by a reliance on strong, often unverifiable assumptions and a lack of robust performance assessment. To address these limitations and advance empirical CD evaluation, we present TCD-Arena, a modularized, highly customizable, and extendable testing kit to assess the robustness of time series CD algorithms against stepwise more severe assumption violations. For demonstration, we conduct an extensive empirical study comprising around 30 million individual CD attempts and reveal nuanced robustness profiles for 33 distinct assumption violations. Further, we investigate CD ensembles and find that they have the potential to improve general robustness, which has implications for real-world applications. With this, we strive to ultimately facilitate the development of CD methods that are reliable for a diverse range of synthetic and potentially real-world data conditions.


Poster
P3-#109
Direct Doubly Robust Estimation of Conditional Quantile Contrasts

Josh Givens ⋅ Song Liu ⋅ Henry W J Reeve ⋅ Katarzyna Reluga

Within heterogeneous treatment effect (HTE) analysis, various estimands have been proposed to capture the effect of a treatment conditional on covariates. Recently, the conditional quantile comparator (CQC) has emerged as a promising estimand, offering quantile-level summaries akin to the conditional quantile treatment effect (CQTE) while preserving some interpretability of the conditional average treatment effect (CATE). It achieves this by summarising the treated response conditional on both the covariates and the untreated response. Despite these desirable properties, the CQC's current estimation is limited by the need to first estimate the difference in conditional cumulative distribution functions and then invert it. This inversion obscures the CQC estimate, hampering our ability to both model and interpret it. To address this, we propose the first direct estimator of the CQC, allowing for explicit modelling and parameterisation. This explicit parameterisation enables better interpretation of our estimate while also providing a means to constrain and inform the model. We show, both theoretically and empirically, that our estimation error depends directly on the complexity of the CQC itself, improving upon the existing estimation procedure. Furthermore, it retains the desirable double robustness property with respect to nuisance parameter estimation. We further show our method to outperform existing procedures in estimation accuracy across multiple data scenarios while varying sample size and nuisance error. Finally, we apply it to real-world data from an employment scheme, uncovering a reduced range of potential earnings improvement as participant age increases.


Poster
P3-#110
Overlap-weighted orthogonal meta-learner for treatment effect estimation over time

Konstantin Hess ⋅ Dennis Frauen ⋅ Mihaela van der Schaar ⋅ Stefan Feuerriegel

Estimating heterogeneous treatment effects (HTEs) in time-varying settings is particularly challenging, as the probability of observing certain treatment sequences decreases exponentially with longer prediction horizons. Thus, the observed data contain little support for many plausible treatment sequences, which creates severe overlap problems. Existing meta-learners for the time-varying setting typically assume adequate treatment overlap, and thus suffer from exploding estimation variance when the overlap is low. To address this problem, we introduce a novel overlap-weighted orthogonal WO meta-learner for estimating HTEs that targets regions in the observed data with high probability of receiving the interventional treatment sequences. This offers a fully data-driven approach through which our WO-learner can counteract instabilities as in existing meta-learners and thus obtain more reliable HTE estimates. Methodologically, we develop a novel Neyman-orthogonal population risk function that minimizes the overlap-weighted oracle risk. We show that our WO-learner has the favorable property of Neyman-orthogonality, meaning that it is robust against misspecification in the nuisance functions. Further, our WO-learner is fully model-agnostic and can be applied to any machine learning model. Through extensive experiments with both transformer and LSTM backbones, we demonstrate the benefits of our novel WO-learner.


Poster
P3-#111
Token-Efficient Item Representation via Images for LLM Recommender Systems

Kibum Kim ⋅ Sein Kim ⋅ HongSeok Kang ⋅ Jiwan Kim ⋅ Heewoong Noh ⋅ Yeonjun In ⋅ Kanghoon Yoon ⋅ Jinoh Oh ⋅ Julian McAuley ⋅ Chanyoung Park

Large Language Models (LLMs) have recently emerged as a powerful backbone for recommender systems. Existing LLM-based recommender systems take two different approaches for representing items in natural language, i.e., Attribute-based Representation and Description-based Representation. In this work, we aim to address the trade-off between efficiency and effectiveness that these two approaches encounter, when representing items consumed by users. Based on our observation that there is a significant information overlap between images and descriptions associated with items, we propose a novel method, Item representation for LLM-based Recommender system (I-LLMRec). Our main idea is to leverage images as an alternative to lengthy textual descriptions for representing items, aiming at reducing token usage while preserving the rich semantic information of item descriptions. Through extensive experiments on real-world Amazon datasets, we demonstrate that I-LLMRec outperforms existing methods that leverage textual descriptions for representing items in both efficiency and effectiveness by leveraging images. Moreover, a further appeal of I-LLMRec is its ability to reduce sensitivity to noise in descriptions, leading to more robust recommendations.


Poster
P3-#112
Lossy Common Information in a Learnable Gray-Wyner Network

Anderson de Andrade ⋅ Alon Harell ⋅ Ivan Bajic

Many computer vision tasks share substantial overlapping information, yet conventional codecs tend to ignore this, leading to redundant and inefficient representations. The Gray-Wyner network, a classical concept from information theory, offers a principled framework for separating common and task-specific information. Inspired by this idea, we develop a learnable three-channel codec that disentangles shared information from task-specific details across multiple vision tasks. We characterize the limits of this approach through the notion of lossy common information, and propose an optimization objective that balances inherent tradeoffs in learning such representations. Through comparisons of three codec architectures on two-task scenarios spanning six vision benchmarks, we demonstrate that our approach substantially reduces redundancy and consistently outperforms independent coding. These results highlight the practical value of revisiting Gray-Wyner theory in modern machine learning contexts, bridging classic information theory with task-driven representation learning.


Poster
P3-#113
Learning is Forgetting; LLM Training As Lossy Compression

Henry Conklin ⋅ Tom Hosking ⋅ Yi-Chern Tan ⋅ Jonathan Cohen ⋅ Sarah-Jane Leslie ⋅ Thomas L. Griffiths ⋅ Max Bartolo ⋅ Seraphina Goldfarb-Tarrant

Despite the increasing prevalence of large language models (LLMs), we still have a limited understanding of how their representational spaces are structured. This limits our ability to interpret how and what they learn or relate them to learning in humans. We argue LLMs are best seen as an instance of lossy compression, where over training they learn by retaining only information in their training data relevant to their objective(s). We show pre-training results in models that are optimally compressed for next-sequence prediction, approaching the Information Bottleneck bound on compression. Across an array of open weights models, each compresses differently, likely due to differences in the data and training recipes used. However even across different families of LLMs the optimality of a model's compression, and the information present in it, can predict downstream performance on across a wide array of benchmarks, letting us directly link representational structure to actionable insights about model performance. In the general case the work presented here offers a unified Information-Theoretic framing for how these models learn that is deployable at scale.


Poster
P3-#114
DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities

Hedi Zisling ⋅ Ilan Naiman ⋅ Nimrod Berman ⋅ Supasorn Suwajanakorn ⋅ Omri Azencot

Unsupervised representation learning, particularly sequential disentanglement, aims to separate static and dynamic factors of variation in data without relying on labels. This remains a challenging problem, as existing approaches based on variational autoencoders and generative adversarial networks often rely on multiple loss terms, complicating the optimization process. Furthermore, sequential disentanglement methods face challenges when applied to real-world data, and there is currently no established evaluation protocol for assessing their performance in such settings. Recently, diffusion models have emerged as state-of-the-art generative models, but no theoretical formalization exists for their application to sequential disentanglement. In this work, we introduce the Diffusion Sequential Disentanglement Autoencoder (DiffSDA), a novel, modal-agnostic framework effective across diverse real-world data modalities, including time series, video, and audio. DiffSDA leverages a new probabilistic modeling, latent diffusion, and efficient samplers, while incorporating a challenging evaluation protocol for rigorous testing. Our experiments on diverse real-world benchmarks demonstrate that DiffSDA outperforms recent state-of-the-art methods in sequential disentanglement.


Poster
P3-#115
Building Massively Multimodal Foundation Models with Interaction-aware Mixture-of-Experts

Xing Han ⋅ Hsing-Huan Chung ⋅ Joydeep Ghosh ⋅ Paul Liang ⋅ Suchi Saria

Modern applications increasingly involve many heterogeneous input streams, such as clinical sensors, wearable device data, imaging, and text, each with distinct measurement models, sampling rates, and noise characteristics. We define this as massively multimodal setting, where each sensor constitutes a separate modality. As modality counts grow, capturing their complex, time-varying interactions such as delayed physiological cascades between sensors, has becomes essential yet challenging. Mixture-of-Experts (MoE) architectures are naturally suited for this setting since their sparse routing mechanism enables efficient scaling across many modalities. However, existing MoE architectures route tokens based on similarity alone, overlooking the rich temporal dependencies across modalities: this prevents the model from capturing delayed cross-modal effects, leading to suboptimal expert specialization and reduced accuracy. We propose a framework that explicitly quantifies temporal dependencies between modality pairs across multiple discrete time intervals, defined as delays between an event in one input stream and its manifested effect in another, and uses these to guide MoE routing. A interaction-aware router dispatches tokens to specialized experts based on interaction type. This principled routing enables experts to learn generalizable interaction-processing skills. Experiments across healthcare, activity recognition, and affective computing benchmarks demonstrate substantial performance gains and interpretable routing patterns aligned with domain knowledge.


Poster
P3-#116
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

Junlin Han ⋅ Shengbang Tong ⋅ David Fan ⋅ Yufan Ren ⋅ Koustuv Sinha ⋅ Philip Torr ⋅ Filippos Kokkinos

Large Language Models (LLMs), despite being trained on text alone, surprisingly develop rich visual priors. These priors allow latent visual capabilities to be unlocked for vision tasks with a relatively small amount of multimodal data, and to perform symbolic visual generation tasks without ever having seen an image. Through systematic analysis, we reveal that visual priors—the implicit, emergent knowledge about the visual world acquired during language pre-training—are composed of separable perception and reasoning priors with unique scaling trends and origins. We show that an LLM's latent visual reasoning ability is predominantly developed by pre-training on reasoning-centric data (\eg, code, math, academia) and scales progressively. This reasoning prior acquired from language pre-training is transferable and universally applicable to visual reasoning. In contrast, the perception prior emerges more diffusely from broad corpora, and perception ability is more sensitive to the vision encoder and visual instruction tuning data. In parallel, text describing the visual world proves crucial, though its performance impact saturates rapidly. Leveraging these insights, we propose a data-centric recipe for pre-training vision-aware LLMs and verify it in 1T token scale pre-training. Our findings are grounded in over 100 controlled experiments consuming 500,000 GPU-hours, spanning the full MLLM construction pipeline—from LLM pre-training to visual alignment and supervised multimodal fine-tuning—across five model scales, a wide range of data categories and mixtures, and multiple adaptation setups. Along with our main findings, we also propose and investigate several hypotheses, and introduce a Multi-Level Existence Bench (MLE-Bench) to facilitate future research. Together, this work provides a new way of deliberately cultivating visual priors from language pre-training, paving the way for the next generation of multimodal LLMs. We recommend a visit to our project page (https://junlinhan.github.io/projects/lsbs/) for an interactive reading.


Poster
P3-#117
$\ell_1$ Latent Distance based Continuous-time Graph Representation

Zhao-Rong Lai ⋅ Zheng-Sen Zhou ⋅ Liangda Fang ⋅ Yongsen Zheng ⋅ Ziliang Chen

Continuous-time graph representation (CTGR) is a widely-used methodology in machine learning, physics, bioinformatics, and social networks. The sequential survival process in a latent space with the squared $\ell_2$ distance is an important ultra-low-dimensional embedding for CTGR. However, the squared $\ell_2$ distance violates the triangle inequality, which may cause distortion of the relative node positions in the latent space and thus deteriorates in social, contact, and collaboration networks. Reverting to the $\ell_2$ distance is infeasible because the corresponding integral computation is intractable. To solve these problems, we propose a theoretically-sound $\ell_1$ latent distance based continuous-time graph representation ($\ell_1$LD-CTGR). It facilitates a true latent metric space for the sequential survival process. Moreover, the integral of the hazard function is found to be a closed-form piece-wise exponential integral, which well fits the ultra-low-dimensional embedding. To handle the non-differentiable $\ell_1$ norm, we successfully find a descent direction of the hazard function to replace the gradient, enabling mainstream learning architectures to learn the parameters. Extensive experiments using both synthetic and real-world data show the competitive performance of $\ell_1$LD-CTGR.


Poster
P3-#119
Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

Tal Daniel ⋅ Carl Qi ⋅ Dan Haramati ⋅ Amir Zadeh ⋅ Chuan Li ⋅ Aviv Tamar ⋅ Deepak Pathak ⋅ David Held

We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, data, pre-trained models and video rollouts are available: https://taldatech.github.io/lpwm-web


Poster
P3-#120
Decoupling Primitive with Experts: Dynamic Feature Alignment for Compositional Zero-Shot Learning

Xiao Zhang ⋅ Haodong Jing ⋅ YONGQIANG MA ⋅ Nanning Zheng

Compositional Zero-Shot Learning (CZSL) investigates compositional generalization capacity to recognize unknown state-object pairs based on learned primitive concepts. Existing CZSL methods typically derive primitives features through a simple composition-prototype mapping, which is suboptimal for a set of individuals that can be divided into distinct semantic subsets. Moreover, the one-to-all cross-modal primitives matching neglects compositional divergence within identical states or objects, limiting fine-grained image-composition alignment. In this study, we propose EVA, a Mixture-of-Experts Framework for Semantic Variant Alignment. Specifically, we introduce domain-expert adaption, leveraging multiple experts to achieve token-aware learning and model high-quality primitive representations. To enable accurate compositional generalization, we further present semantic variant alignment to select semantically relevant representation for image-primitives matching. Our method significantly outperforms other state-of-the-art CZSL methods on three popular benchmarks in both closed- and open-world settings, demonstrating the efficacy of the proposed insight.


Poster
P3-#121
Why Prototypes Collapse: Diagnosing and Preventing Partial Collapse in Prototypical Self-Supervised Learning

Gabriel Y. Arteaga ⋅ Marius Aasan ⋅ Rwiddhi Chakraborty ⋅ Martine Hjelkrem-Tan ⋅ Thalles Santos Silva ⋅ Michael Kampffmeyer ⋅ Adín Ramírez Rivera

Prototypical self-supervised learning methods consistently suffer from partial prototype collapse, where multiple prototypes converge to nearly identical representations. This undermines their central purpose—providing diverse and informative targets to guide encoders toward rich representations—and has led practitioners to over-parameterize prototype sets or add ad-hoc regularizers, which mitigate symptoms rather than address the root cause. We empirically trace the collapse to the joint optimization of encoders and prototypes, which encourages a type of shortcut learning: early in training prototypes drift toward redundant representations that minimize loss without necessarily enhancing representation diversity. To break the joint optimization, we introduce a fully decoupled training strategy that learns prototypes and encoders under separate objectives. Concretely, we model prototypes as a Gaussian mixture updated with an online EM-style procedure, independent of the encoder's loss. This simple yet principled decoupling eliminates prototype collapse without explicit regularization and yields consistently diverse prototypes, which in several settings translate to improved downstream performance.


Poster
P3-#122
SiNGER: A Clearer Voice Distills Vision Transformers Further

Geunhyeok Yu ⋅ Sunjae Jeong ⋅ Yoonyoung Choi ⋅ Jaeseung Kim ⋅ Hyoseok Hwang

Vision Transformers are widely adopted as the backbone of vision foundation models, but they are known to produce high-norm artifacts that degrade representation quality. When knowledge distillation transfers these features to students, high-norm artifacts dominate the objective, so students overfit to artifacts and underweight informative signals, diminishing the gains from larger models. Prior work attempted to remove artifacts but encountered an inherent trade-off between artifact suppression and preserving informative signals from teachers. To address this, we introduce Singular Nullspace-Guided Energy Reallocation (SiNGER), a novel distillation framework that suppresses artifacts while preserving informative signals. The key idea is principled teacher feature refinement: during refinement, we leverage the nullspace-guided perturbation to preserve information while suppressing artifacts. Then, the refined teacher's features are distilled to a student. We implement this perturbation efficiently with a LoRA-based adapter that requires minimal structural modification. Extensive experiments show that SiNGER consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer and more interpretable representations.


Poster
P3-#123
Entropy-Based Block Pruning for Efficient Large Language Models

Liangwei Yang ⋅ Yuhui Xu ⋅ Juntao Tan ⋅ Doyen Sahoo ⋅ silvio savarese ⋅ Caiming Xiong ⋅ Huan Wang ⋅ Shelby Heinecke

As large language models continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an entropy-based pruning strategy to enhance efficiency while maintaining performance. Empirical analysis reveals that the entropy of hidden representations decreases in the early blocks but progressively increases across most subsequent blocks. This trend suggests that entropy serves as a more effective measure of information richness within computation blocks. Unlike cosine similarity, which primarily captures geometric relationships, entropy directly quantifies uncertainty and information content, making it a more reliable criterion for pruning. Extensive experiments demonstrate that our entropy-based pruning approach surpasses cosine similarity-based methods in reducing model size while preserving accuracy, offering a promising direction for efficient model deployment.


Poster
P3-#124
Domain Expansion: A Latent Space Construction Framework for Multi-Task Learning

Chi-Yao Huang ⋅ Khoa Vo ⋅ Aayush Verma ⋅ Duo Lu ⋅ Yezhou Yang

Training a single network with multiple objectives often leads to conflicting gradients that degrade shared representations, forcing them into a compromised state that is suboptimal for any single task—a problem we term latent representation collapse. We introduce Domain Expansion, a framework that prevents these conflicts by restructuring the latent space itself. Our framework uses a novel orthogonal pooling to construct a latent space where each objective is assigned to a mutually orthogonal subspace. We validate our approach on the ShapeNet benchmark, simultaneously training a model for object classification and pose estimation. Our experiments demonstrate that this structure not only prevents collapse but also yields an explicit, interpretable, and compositional latent space where concepts can be directly manipulated.


Poster
P3-#224
Why We Need New Benchmarks for Local Intrinsic Dimension Estimation

Piotr Tempczyk ⋅ Dominik Filipiak ⋅ Łukasz Garncarek ⋅ Ksawery Smoczynski ⋅ Adam Kurpisz

Neural Local Intrinsic Dimension (LID) estimators are typically bound to domain-specific architectures whose inductive biases can yield inconsistent estimates for the same underlying manifold. Existing evaluations either use overly simple synthetic data (with known LID) or real datasets (with unknown LID), obscuring true performance. We introduce a principled benchmarking framework that (i) maps the same manifold into multiple domain representations while preserving its structure, enabling like-for-like cross-architecture tests; (ii) designs harder variants of popular datasets that target key manifold properties; and (iii) applies controlled transformations with known LID shifts to stress-test methods even when absolute LID is unknown. Across this suite, including non-trivial synthetic datasets, we show that accuracy on simple manifolds does not transfer across domains and that state-of-the-art methods fail under targeted stressors, revealing clear failure modes and areas for improvement. Data and code are available: https://github.com/DominikFilipiak/LID-Benchmarks.


Poster
P3-#223
Polynomial, trigonometric, and tropical activations

Ismail Khalfaoui Hassani ⋅ Stefan Kesselheim

Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through a simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. Our work addresses the issue of exploding and vanishing activations and gradients, particularly prevalent with polynomial activations, and opens the door for improving the efficiency of large-scale learning tasks. Furthermore, our approach provides insight into the structure of neural networks, revealing that networks with polynomial activations can be interpreted as multivariate polynomial mappings. Finally, using Hermite interpolation, we show that our activations can closely approximate classical ones in pre-trained models by matching both the function and its derivative, making them especially useful for fine-tuning tasks. These activations are available in the torchortho library.


Poster
P3-#222
LORE: Jointly Learning The Intrinsic Dimensionality and Relative Similarity Structure from Ordinal Data

Vivek Anand ⋅ Alec Helbling ⋅ Mark Davenport ⋅ Gordon Berman ⋅ Sankaraleengam Alagapan ⋅ Christopher Rozell

Learning the intrinsic dimensionality of subjective perceptual spaces such as taste, smell, or aesthetics from ordinal data is a challenging problem. We introduce LORE (Low Rank Ordinal Embedding), a scalable framework that jointly learns both the intrinsic dimensionality and an ordinal embedding from noisy triplet comparisons of the form, "Is A more similar to B than C?". Unlike existing methods that require the embedding dimension to be set apriori, LORE regularizes the solution using the nonconvex Schatten-$p$ quasi norm, enabling automatic joint recovery of both the ordinal embedding and its dimensionality. We optimize this joint objective via an iteratively reweighted algorithm and establish convergence guarantees. Extensive experiments on synthetic datasets, simulated perceptual spaces, and real world crowdsourced ordinal judgements show that LORE learns compact, interpretable and highly accurate low dimensional embeddings that recover the latent geometry of subjective percepts. By simultaneously inferring both the intrinsic dimensionality and ordinal embeddings, LORE enables more interpretable and data efficient perceptual modeling in psychophysics and opens new directions for scalable discovery of low dimensional structure from ordinal data in machine learning.


Poster
P3-#221
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

Changli Tang ⋅ Qinfan Xiao ⋅ Ke Mei ⋅ Tianyi Wang ⋅ Fengyun Rao ⋅ Chao Zhang

While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified & \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code and checkpoints are released at \href{https://github.com/TCL606/WAVE}{https://github.com/TCL606/WAVE}.


Poster
P3-#1910
Compositional-ARC: Assessing Systematic Generalization in Abstract Spatial Reasoning

Philipp Mondorf ⋅ Shijia Zhou ⋅ Monica Riedler ⋅ Barbara Plank

Systematic generalization refers to the capacity to understand and generate novel combinations from known components. Despite recent progress by large language models (LLMs) across various domains, these models often fail to extend their knowledge to novel compositional scenarios, revealing notable limitations in systematic generalization. There has been an ongoing debate about whether neural networks possess the capacity for systematic generalization, with recent studies suggesting that meta-learning approaches designed for compositionality can significantly enhance this ability. However, these insights have largely been confined to linguistic problems, leaving their applicability to other tasks an open question. In this study, we extend meta-learning for compositionality to the domain of abstract spatial reasoning. To this end, we introduce $\textit{Compositional-ARC}\textemdash{}$a dataset designed to evaluate the capacity of models to systematically generalize from known geometric transformations (e.g., translation, rotation) of abstract two-dimensional objects to novel combinations of these transformations (e.g., translation+rotation). Our results show that a small transformer-based encoder-decoder model, trained via meta-learning for compositionality, can systematically generalize to previously unseen transformation compositions. Notably, despite having only 5.7M parameters, this model significantly outperforms state-of-the-art LLMs$\textemdash{}$including o3-mini, GPT-4o, and Gemini 2.0 Flash, which fail to exhibit similar systematic behavior$\textemdash{}$and performs on par with the winning model of the ARC prize 2024, an 8B-parameter LLM trained via test-time training. Our findings highlight the effectiveness of meta-learning in promoting systematicity beyond linguistic tasks, suggesting a promising direction toward more robust and generalizable models.

We propose a novel unsupervised framework for Invariant Risk Minimization (IRM), extending the concept of invariance to settings where labels are unavailable. Traditional IRM methods rely on labeled data to learn representations that are robust to distributional shifts across environments. In contrast, our approach redefines invariance through feature distribution alignment, enabling robust representation learning from unlabeled data. We introduce two methods within this framework: Principal Invariant Component Analysis (PICA), a linear method that extracts invariant directions under Gaussian assumptions, and Variational Invariant Autoencoder (VIAE), a deep generative model that separates environment-invariant and environment-dependent latent factors. Our approach is based on a novel ``unsupervised'' structural causal model and supports environment-conditioned sample-generation and intervention. Empirical evaluations on synthetic dataset, modified versions of MNIST, and CelebA demonstrate the effectiveness of our methods in capturing invariant structure, preserving relevant information, and generalizing across environments without access to labels.


Poster
P3-#219
Locality-Attending Vision Transformer

Sina Hajimiri ⋅ Farzad Beizaee ⋅ Fereshteh Shakeri ⋅ Christian Desrosiers ⋅ Ismail Ayed ⋅ Jose Dolz

Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance segmentation performance of vision transformers after standard image-level classification training. More specifically, we present a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers' image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications encourage tokens to focus on local surroundings and ensure meaningful representations at spatial positions, while still preserving the model's ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at https://github.com/sinahmr/LocAtViT/.

Multi-view clustering aims to segment the view-specific data into the corresponding clusters. There have been a large number of works for multi-view clustering in recent years. As representive methods in multi-view clustering, works built on the graph make use of a view-consistent and discriminative graph while utilizing graph partitioning for the final clustering results. Despite the achieved significant success, these methods usually construct full graphs and the efficiency is not well guaranteed for the multi-view datasets with large scales. To handle the large-scale data, multi-view clustering methods based on anchor have been developed by learning the anchor graph with smaller size. However, the existing works neglect the interpretability of multi-view clustering based on anchor from the probabilistic perspective. These methods also ignore analyzing the relationship between the input data and the final clustering results based on the assigned meaningful probability associations in a unified manner. In this work, we propose a novel method termed Unified and Efficient Multi-view Clustering from Probabilistic perspective(UEMCP). It aims to improve the explanation ability of multi-view clustering based on anchor from the probabilistic perspective in an end-to-end manner. It ensures the consistent inherent structures among these views by learning the common transition probability from data points to categories in one step. With the guidance of the common transition probability matrix from data points to categories, the soft label of data points can be achieved based on the common transition probability matrix from anchor points to categories in the learning framework. Experiments on different challenging multi-view datasets confirm the superiority of UEMCP compared with the representative ones.


Poster
P3-#217
Contrastive Predictive Coding Done Right for Mutual Information Estimation

Jongha Ryu ⋅ Pavan Yeddanapudi ⋅ Xiangxiang Xu ⋅ Gregory Wornell

The InfoNCE objective, originally introduced for contrastive representation learning, has become a popular choice for mutual information (MI) estimation, despite its indirect connection to MI. In this paper, we demonstrate why InfoNCE should not be regarded as a valid MI estimator, and we introduce a simple modification, which we refer to as *InfoNCE-anchor*, for accurate MI estimation. Our modification introduces an auxiliary \emph{anchor} class, enabling consistent density ratio estimation and yielding a plug-in MI estimator with significantly reduced bias. Beyond this, we generalize our framework using proper scoring rules, which recover InfoNCE-anchor as a special case when the log score is employed. This formulation unifies a broad spectrum of contrastive objectives, including NCE, InfoNCE, and $f$-divergence variants, under a single principled framework. Empirically, we find that InfoNCE-anchor with the log score achieves the most accurate MI estimates; however, in self-supervised representation learning experiments, we find that the anchor does not improve the downstream task performance. These findings corroborate that contrastive representation learning benefits not from accurate MI estimation per se, but from the learning of structured density ratios.

Domain-specific embedding models have shown promise for applications that require specialized semantic understanding, such as coding agents and financial retrieval systems, often achieving higher performance gains than general models. However, state-of-the-art embedding models are typically based on LLMs, which contain billions of parameters, making deployment challenging in resource-constrained environments. Model compression through pruning offers a promising solution, but existing pruning methods treat all parameters uniformly, failing to distinguish between general semantic representations and domain-specific patterns, leading to suboptimal pruning decisions. Thus, we propose GAPrune, a pruning framework that addresses this challenge by considering both domain importance and preserving general linguistic foundation. Our method uses Fisher Information to measure importance and general-domain gradient alignment to assess parameter behavior, then combines these signals using our Domain Alignment Importance (DAI) scoring. Lower DAI scores indicate that the parameter is either less important for the domain task or creates conflicts between domain and general objectives. Experiments on two domain benchmarks, FinMTEB and ChemTEB, show that GAPrune maintains performance within 2.5\% of dense models in one-shot pruning at 50\% sparsity, while outperforming all baselines. With retraining in 100 steps, GAPrune achieves +4.51\% improvement on FinMTEB and +1.73\% on ChemTEB, demonstrating that our pruning strategy not only preserves but enhances domain-specific capabilities. Our findings demonstrate that principled pruning strategies can achieve model compression and enhanced domain specialization, providing the research community with a new approach for development.


Poster
P3-#215
StyliTruth : Unlocking Stylized yet Truthful LLM Generation via Disentangled Steering

Chenglei Shen ⋅ Zhongxiang Sun ⋅ Teng Shi ⋅ Xiao Zhang ⋅ Jun Xu

Generating stylized large language model (LLM) responses via representation editing is a promising way for fine-grained output control. However, there exists an inherent trade-off: imposing a distinctive style often degrades truthfulness. Existing representation editing methods, by naively injecting style signals, overlook this collateral impact and frequently contaminate the model’s core truthfulness representations, resulting in reduced answer correctness. We term this phenomenon stylization-induced truthfulness collapse. We attribute this issue to latent coupling between style and truth directions in certain key attention heads, and propose \textbf{StyliTruth}, a mechanism that preserves stylization while keeping truthfulness intact. StyliTruth separates the style-relevant and truth-relevant subspaces in the model’s representation space via an orthogonal deflation process. This decomposition enables independent control of style and truth in their own subspaces, minimizing interference. By designing adaptive, token-level steering vectors within each subspace, we dynamically and precisely control the generation process to maintain both stylistic fidelity and truthfulness. We validate our method on multiple styles and languages. Extensive experiments and analyses show that StyliTruth significantly reduces stylization-induced truthfulness collapse and outperforms existing inference-time intervention methods in balancing style adherence with truthfulness.


Poster
P3-#214
Learning Explicit Single-Cell Dynamics Using ODE Representations

Jan-Philipp von Bassewitz ⋅ Adeel Pervez ⋅ Marco Fumero ⋅ Matthew Robinson ⋅ THEOFANIS KARALETSOS ⋅ Francesco Locatello

Modeling the dynamics of cellular differentiation is fundamental to advancing the understanding and treatment of diseases associated with this process, such as cancer. With the rapid growth of single-cell datasets, this has also become a particularly promising and active domain for machine learning. Current state-of-the-art models, however, rely on computationally expensive optimal transport preprocessing and multi-stage training, while also not discovering explicit gene interactions. To address these challenges we propose Cell-Mechanistic Neural Networks (Cell-MNN), an encoder-decoder architecture whose latent representation is a locally linearized ODE governing the dynamics of cellular evolution from stem to tissue cells. Cell-MNN is fully end-to-end (besides a standard PCA pre-processing) and its ODE representation learns interpretable gene interactions. Empirically, we show that Cell-MNN achieves competitive performance on single-cell benchmarks, surpasses state-of-the-art baselines in scaling to larger datasets and joint training across multiple datasets, while also learning interpretable gene interactions that we validate against the TRRUST database of gene interactions.


Poster
P3-#213
A Single Architecture for Representing Invariance Under Any Space Group

Cindy Zhang ⋅ Elif Ertekin ⋅ Peter Orbanz ⋅ Ryan P Adams

Incorporating known symmetries in data into machine learning models has consistently improved predictive accuracy, robustness, and generalization. However, achieving exact invariance to specific symmetries typically requires designing bespoke architectures for each group of symmetries, limiting scalability and preventing knowledge transfer across related symmetries. In the case of the space groups—symmetries critical to modeling crystalline solids in materials science and condensed matter physics—this challenge is particularly salient as there are 230 such groups in three dimensions. In this work we present a new approach to crystallographic symmetries by developing a single machine learning architecture that is capable of adapting its weights automatically to enforce invariance to any input space group. Our approach is based on constructing symmetry-adapted Fourier bases through an explicit characterization of constraints that group operations impose on Fourier coefficients. Encoding these constraints into a neural network layer enables weight sharing across different space groups, allowing the model to leverage structural similarities between groups and overcome data sparsity when limited measurements are available for specific groups. We demonstrate the effectiveness of this approach in achieving competitive performance on material property prediction tasks and performing zero-shot learning to generalize to unseen groups.


Poster
P3-#212
Multi-View Encoders for Performance Prediction in LLM-Based Agentic Workflows

Patara Trirat ⋅ Wonyong Jeong ⋅ Sung Ju Hwang

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but optimizing LLM-based agentic systems remains challenging due to the vast search space of agent configurations, prompting strategies, and communication patterns. Existing approaches often rely on heuristic-based tuning or exhaustive evaluation, which can be computationally expensive and suboptimal. This paper proposes Agentic Predictor, a lightweight predictor for efficient agentic workflow evaluation. Agentic Predictor is equipped with a multi-view workflow encoding technique that leverages multi-view representation learning of agentic systems by incorporating code architecture, textual prompts, and interaction graph features. To achieve high predictive accuracy while significantly reducing the number of required workflow evaluations for training a predictor, Agentic Predictor employs cross-domain unsupervised pretraining. By learning to approximate task success rates, Agentic Predictor enables fast and accurate selection of optimal agentic workflow configurations for a given task, significantly reducing the need for expensive trial-and-error evaluations. Experiments on a carefully curated benchmark spanning three domains show that our predictor outperforms several strong graph-based baselines in both predictive accuracy and workflow utility, highlighting the potential of performance predictors in streamlining the design of LLM-based agentic workflows.


Poster
P3-#211
DeepFRC: An End-to-End Deep Learning Model for Functional Registration and Classification

Siyuan Jiang ⋅ Yihan Hu ⋅ Wenjie Li ⋅ Pengcheng Zeng

Functional data, representing curves or trajectories, are ubiquitous in fields like biomedicine and motion analysis. A fundamental challenge is phase variability—temporal misalignments that obscure underlying patterns and degrade model performance. Current methods often address registration (alignment) and classification as separate, sequential tasks. This paper introduces DeepFRC, an end-to-end deep learning framework that jointly learns diffeomorphic warping functions and a classifier within a unified architecture. DeepFRC combines a neural deformation operator for elastic alignment, a spectral representation using Fourier basis for smooth functional embedding, and a class-aware contrastive loss that promotes both intra-class coherence and inter-class separation. We provide the first theoretical guarantees for such a joint model, proving its ability to approximate optimal warpings and establishing a data-dependent generalization bound that formally links registration fidelity to classification performance. Extensive experiments on synthetic and real-world datasets demonstrate that DeepFRC consistently outperforms state-of-the-art methods in both alignment quality and classification accuracy, while ablation studies validate the synergy of its components. DeepFRC also shows notable robustness to noise, missing data, and varying dataset scales. Code is available at https://github.com/Drivergo-93589/DeepFRC.


Poster
P3-#210
Learning a distance measure from the information-estimation geometry of data

Guy Ohayon ⋅ Pierre-Etienne Fiquet ⋅ Florentin Guth ⋅ Jona Ballé ⋅ Eero Simoncelli

We introduce the Information-Estimation Metric (IEM), a novel form of distance function derived from an underlying continuous probability density over a domain of signals. The IEM is rooted in a fundamental relationship between information theory and estimation theory, which links the log-probability of a signal with the errors of an optimal denoiser, applied to noisy observations of the signal. In particular, the IEM between a pair of signals is obtained by comparing their denoising error vectors over a range of noise amplitudes. Geometrically, this amounts to comparing the score vector fields of the blurred density around the signals over a range of blur levels. We prove that the IEM is a valid global distance metric and derive a closed-form expression for its local second-order approximation, which yields a Riemannian metric. For Gaussian-distributed signals, the IEM coincides with the Mahalanobis distance. But for more complex distributions, it adapts, both locally and globally, to the geometry of the distribution. In practice, the IEM can be computed using a learned denoiser (analogous to generative diffusion models) and solving a one-dimensional integral. To demonstrate the value of our framework, we learn an IEM on the ImageNet database. Experiments show that this IEM is competitive with or outperforms state-of-the-art supervised image quality metrics in predicting human perceptual judgments.


Poster
P3-#208
The Lie of the Average: How Class Incremental Learning Evaluation Deceives You?

Guannan Lai ⋅ Da-Wei Zhou ⋅ Xin Yang ⋅ Han-Jia Ye

Class Incremental Learning (CIL) requires models to continuously learn new classes without forgetting previously learned ones, while maintaining stable performance across all possible class sequences. In real-world settings, the order in which classes arrive is diverse and unpredictable, and model performance can vary substantially across different sequences. Yet mainstream evaluation protocols calculate mean and variance from only a small set of randomly sampled sequences. Our theoretical analysis and empirical results demonstrate that this sampling strategy fails to capture the full performance range, resulting in biased mean estimates and a severe underestimation of the true variance in the performance distribution. We therefore contend that a robust CIL evaluation protocol should accurately characterize and estimate the entire performance distribution. To this end, we introduce the concept of extreme sequences and provide theoretical justification for their crucial role in the reliable evaluation of CIL. Moreover, we observe a consistent positive correlation between inter-task similarity and model performance, a relation that can be leveraged to guide the search for extreme sequences. Building on these insights, we propose EDGE (Extreme case–based Distribution & Generalization Evaluation), an evaluation protocol that adaptively identifies and samples extreme class sequences using inter-task similarity, offering a closer approximation of the ground-truth performance distribution. Extensive experiments demonstrate that EDGE effectively captures performance extremes and yields more accurate estimates of distributional boundaries, providing actionable insights for model selection and robustness checking. Our code is available at https://github.com/AIGNLAI/EDGE.


Poster
P3-#207
Quantized Gradient Projection for Memory-Efficient Continual Learning

Dongjun Kim ⋅ Seohyeon Cha ⋅ Huancheng Chen ⋅ Chaining Wang ⋅ Haris Vikalo

Real-world deployment of machine learning models requires the ability to continually learn from non-stationary data while preserving prior knowledge and user privacy. Therefore, storing knowledge acquired from past data in a resource- and privacy-friendly manner is a crucial consideration in determining their viability. We introduce Quantized Gradient Projection Memory (QGPM), a systematic framework for continual learning that compresses and preserves the previous gradient subspace. QGPM integrates three key components: (i) distribution-aware, basis-wise quantization to minimize storage overhead, (ii) a Quantization Error-Aware (QEA) gradient projection that selectively relaxes orthogonality to mitigate gradient drift caused by accumulated quantization noise, and (iii) an on-the-fly sparse sketching strategy that improves runtime memory and computational efficiency. Experiments across multiple benchmarks demonstrate that QGPM achieves state-of-the-art performance under fixed memory budgets, highlighting its effectiveness in scalable, privacy-preserving continual learning.


Poster
P3-#206
Knowledge Fusion of Large Language Models via Modular SkillPacks

Guodong DU ⋅ Zhuo Li ⋅ Xuanning Zhou ⋅ Junlin Li ⋅ Zesheng Shi ⋅ Wanyu LIN ⋅ Ho-Kin Tang ⋅ Xiucheng Li ⋅ Fangming Liu ⋅ Wenya Wang ⋅ Min Zhang ⋅ Jing Li

Cross-capability transfer represents a key challenge in large language model (LLM) research, particularly in multi-task integration, model compression, and knowledge fusion. Recent works such as FuseLLM and FuseChat have shown the potential of transferring multiple model capabilities to lightweight models, thereby enhancing adaptability and efficiency. This motivates our investigation into more efficient methods for cross-capability transfer. However, existing merging approaches primarily focus on small, homogeneous models, limiting their applicability. For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student model’s inherent capability and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs. To address these issues, we introduce GraftLLM, a novel grafting-based method that stores source model capabilities in a target model + SkillPack format. This approach preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning and model fusion. We employ a module-aware adaptive compression strategy for parameter updates, ensuring efficient storage while preserving task-specific knowledge. The resulting SkillPack serves as a compact and transferable knowledge carrier, ideal for heterogeneous LLM fusion. Experiments across various scenarios demonstrate that GraftLLM outperforms existing techniques in knowledge transfer, knowledge fusion, and forget-free learning, providing a scalable and efficient solution for cross-capability transfer.


Poster
P3-#205
NEO — No-Optimization Test-Time Adaptation through Latent Re-Centering

Alexander Murphy ⋅ Michal Danilowski ⋅ Soumyajit Chatterjee ⋅ Abhirup Ghosh

Test-Time Adaptation (TTA) methods are often computationally expensive, require a large amount of data for effective adaptation, or are brittle to hyperparameters. Based on a theoretical foundation of the geometry of the latent space, we are able to significantly improve the alignment between source and distribution-shifted samples by re-centering target data embeddings at the origin. This insight motivates NEO – a hyperparameter-free fully TTA method, that adds no significant compute compared to vanilla inference. NEO is able to improve the classification accuracy of ViT-Base on ImageNet-C from 55.6\% to 59.2\% after adapting on just one batch of 64 samples. When adapting on 512 samples NEO beats all 7 TTA methods we compare against on ImageNet-C, ImageNet-R and ImageNet-S and beats 6/7 on CIFAR-10-C, while using the least amount of compute. NEO performs well on model calibration metrics and additionally is able to adapt from 1 class to improve accuracy on 999 other classes in ImageNet-C. On Raspberry Pi and Jetson Orin Nano devices, NEO reduces inference time by 63\% and memory usage by 9\% compared to baselines. Our results based on 3 ViT architectures and 4 datasets show that NEO can be used efficiently and effectively for TTA.


Poster
P3-#204
Minimax-Optimal Aggregation for Density Ratio Estimation

Lukas Gruber ⋅ Markus Holzleitner ⋅ Sepp Hochreiter ⋅ Werner Zellinger

Density ratio estimation (DRE) is fundamental in machine learning and statistics, with applications in domain adaptation and two-sample testing. However, DRE methods are highly sensitive to hyperparameter selection, with suboptimal choices often resulting in poor convergence rates and empirical performance. To address this issue, we propose a novel model aggregation algorithm for DRE that trains multiple models with different hyperparameter settings and aggregates them. Our aggregation provably achieves minimax-optimal error convergence without requiring prior knowledge of the smoothness of the unknown density ratio. Our method surpasses cross-validation-based model selection and model averaging baselines for DRE on standard benchmarks for DRE and large-scale domain adaptation tasks, setting a new state of the art on image and text data.


Poster
P3-#203
Navigating the Accuracy-Size Trade-Off with Flexible Model Merging

Akash Dhasade ⋅ Divyansh Jhunjhunwala ⋅ Milos Vujasinovic ⋅ Gauri Joshi ⋅ Anne-Marie Kermarrec

Model merging has emerged as an efficient method to combine multiple single-task fine-tuned models. The merged model can enjoy multi-task capabilities without expensive training. While promising, merging into a single model often suffers from an accuracy gap with respect to individual fine-tuned models. On the other hand, deploying all individual fine-tuned models incurs high storage costs. We propose FlexMerge, a novel data-free model merging framework that: (a) flexibly generates merged models of varying sizes, spanning the full spectrum from a single merged model to retaining all individual fine-tuned models; and (b) supports multiple merging algorithms in a unified framework. Using FlexMerge, we systematically characterize the accuracy–size trade-off of different algorithms. Our study reveals two key findings: first, even modestly larger merged models can yield steep accuracy gains (up to 13.5% when just doubling the size); second, algorithm rankings are not consistent as size increases, with some methods overtaking others beyond the one-model regime. These results uncover a new design dimension for model merging: developing and comparing algorithms across the full spectrum of sizes rather than only at the single-model limit. Extensive experiments on vision and NLP benchmarks, with up to 30 tasks, confirm the generality and practicality of FlexMerge.


Poster
P3-#202
How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation.

Prabhant Singh ⋅ Sibylle Hess ⋅ Joaquin Vanschoren

Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task without fine-tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset-agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real-world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.

Vision–Language Models (VLMs) show promise as zero-shot goal-conditioned value functions, but their frozen pre-trained representations limit generalization and temporal reasoning. We introduce VITA, a zero-shot value function learning method that enhances both capabilities via test-time adaptation. At inference, a lightweight adaptation module is updated via a gradient step on a meta-learned self-supervised loss, such that each test-time update improves value estimation. By updating sequentially over a trajectory, VITA encodes history into its parameters, addressing the temporal reasoning limitations. To mitigate shortcut learning, we propose a dissimilarity-based sampling strategy that selects semantically diverse segments of the trajectory during training. In real-world robotic manipulation tasks, VITA generalizes from a single training environment to diverse out-of-distribution tasks, environments, and embodiments, outperforming the state-of-the-art zero-shot method using autoregressive VLMs. Furthermore, we demonstrate that VITA’s zero-shot value estimates can be utilized for reward shaping in offline reinforcement learning, resulting in multi-task policies on the Meta-World benchmark that exceed the performance of those trained with the simulation’s fuzzy-logic dense rewards. Project website: https://chziakas.github.io/vita/.


Poster
P3-#301
Following the Navigation: Enhancing Small Language Models Contextual Reasoning with LLM Guidance

Xiaoqi Ni ⋅ Jie Wang ⋅ Lin Yang ⋅ Yiyang Lu ⋅ Hanzhu Chen ⋅ Rui Liu ⋅ Jianye Hao

Large language models (LLMs), such as OpenAI o1 and DeepSeek-R1, excel in contextual reasoning by leveraging extensive world knowledge and deep contextual understanding. However, their high computational costs limit deployment in resource-constrained settings. Conversely, small language models (SLMs) are more computationally efficient but often struggle with contextual reasoning due to limited parameter capacity and challenges like catastrophic forgetting. Existing enhancement methods for SLMs—such as knowledge distillation and data synthesis—still depend on additional training and face inherent limitations. To address this, we propose Navigation, a novel training-free framework that improves SLMs’ contextual reasoning by distilling LLM-derived contextual processing expertise into generalizable navigation templates. These templates, stored in a scalable Navigation database, guide SLMs through a three-stage process—Generation, Utilization, and Update—to locate and process critical information within complex contexts. Experiments demonstrate that our approach yields an average 10.7\% accuracy gain with a template count equivalent to no more than 2.1\% of the dataset size, enabling models such as Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct to outperform GPT-3.5-Turbo on diverse contextual reasoning tasks.


Poster
P3-#302
Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching

Zhengyan Wan ⋅ Yidong Ouyang ⋅ Liyan Xie ⋅ Fang Fang ⋅ Hongyuan Zha ⋅ Guang Cheng

Guidance provides a simple and effective framework for posterior sampling by steering the generation process towards the desired distribution. When modeling discrete data, existing approaches mostly focus on guidance with the first-order approximation to improve the sampling efficiency. However, such an approximation is inappropriate in discrete state spaces since the approximation error could be large. A novel guidance framework for discrete data is proposed to address this problem: we derive the exact transition rate for the desired distribution given a learned discrete flow matching model, leading to guidance that only requires a single forward pass in each sampling step, significantly improving efficiency. This unified novel framework is general enough, encompassing existing guidance methods as special cases, and it can also be seamlessly applied to the masked diffusion model. We demonstrate the effectiveness of our proposed guidance on energy-guided simulations and preference alignment on text-to-image generation and multimodal understanding tasks.


Poster
P3-#303
TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design

Geonwoo Cho ⋅ Jaegyun Im ⋅ Jihwan Lee ⋅ Hojun Yi ⋅ Sejin Kim ⋅ Sundong Kim

Generalizing deep reinforcement learning agents to unseen environments remains a significant challenge. One promising solution is Unsupervised Environment Design (UED), a co‑evolutionary framework in which a teacher adaptively generates tasks with high learning potential, while a student learns a robust policy from this evolving curriculum. Existing UED methods typically measure learning potential via regret, the gap between optimal and current performance, approximated solely by value‑function loss. Building on these approaches, we introduce the transition-prediction error as an additional term in our regret approximation. To capture how training on one task affects performance on others, we further propose a lightweight metric called Co-Learnability. By combining these two measures, we present Transition‑aware Regret Approximation with Co‑learnability for Environment Design (TRACED). Empirical evaluations show that TRACED produces curricula that improve zero-shot generalization over strong baselines across multiple benchmarks. Ablation studies confirm that the transition-prediction error drives rapid complexity ramp‑up and that Co‑Learnability delivers additional gains when paired with the transition-prediction error. These results demonstrate how refined regret approximation and explicit modeling of task relationships can be leveraged for sample-efficient curriculum design in UED. https://geonwoo.me/traced


Poster
P3-#304
One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning

Minh Le ⋅ Bao-Ngoc Dao ⋅ Huy Nguyen ⋅ Quyen Tran ⋅ Anh Nguyen ⋅ Nhat Ho

Prompt-based methods have recently gained prominence in Continual Learning (CL) due to their strong performance and memory efficiency. A prevalent strategy in this paradigm assigns a dedicated subset of prompts to each task, which, while effective, incurs substantial computational overhead and causes memory requirements to scale linearly with the number of tasks. Conversely, approaches employing a single shared prompt across tasks offer greater efficiency but often suffer from degraded performance due to knowledge interference. To reconcile this trade-off, we propose SMoPE, a novel framework that integrates the benefits of both task-specific and shared prompt strategies. Inspired by recent findings on the relationship between Prefix Tuning and Mixture of Experts (MoE), SMoPE organizes a shared prompt into multiple "prompt experts" within a sparse MoE architecture. For each input, only a select subset of relevant experts is activated, effectively mitigating interference. To facilitate expert selection, we introduce a prompt-attention score aggregation mechanism that computes a unified proxy score for each expert, enabling dynamic and sparse activation. Additionally, we propose an adaptive noise mechanism to encourage balanced expert utilization while preserving knowledge from prior tasks. To further enhance expert specialization, we design a prototype-based loss function that leverages prefix keys as implicit memory representations. Extensive experiments across multiple CL benchmarks demonstrate that SMoPE consistently outperforms task-specific prompt methods and achieves performance competitive with state-of-the-art approaches, all while significantly reducing parameter counts and computational costs.


Poster
P3-#305
DUET: Optimizing LLM Training Data Mixtures via Noisy Feedback from Unseen, Downstream Evaluation Tasks

Zhiliang Chen ⋅ Gregory Kang Ruey Lau ⋅ Chuan Sheng Foo ⋅ Bryan Kian Hsiang Low

The performance of an LLM depends heavily on how well the training data matches the downstream evaluation task. However, in many practical settings, we typically do not know the data in the evaluation task (e.g., conversations between a chatbot and users are end-to-end encrypted). We refer to such tasks as unseen evaluation tasks. We can only deploy the LLM on these unseen evaluation tasks to gather multiple rounds of feedback on how well the model performs (e.g., gathering user ratings from a chatbot). In addition, this feedback can be noisy. How can we exploit such noisy feedback efficiently to optimize the LLM training data-mixture? Our paper presents DUET, a novel global-to-local algorithm that optimizes training data mixtures by interleaving data selection with Bayesian optimization to exploit coarse and noisy feedback from a downstream evaluation task. DUET is flexible enough to incorporate different data selection methods, each with different performance-compute tradeoffs. By analyzing DUET's cumulative regret, we theoretically show that DUET converges to the optimal training data mixture even without any fine-grained data information from an unseen task. Finally, our experiments across a variety of language tasks demonstrate that DUET attains substantial performance improvements over existing data selection and mixing methods in the unseen-task setting. Our library, which is flexible enough to optimize different LLM training ingredients, can be found at https://github.com/chenzhiliang94/BO-for-LLM.


Poster
P3-#306
Out-of-Distribution Graph Models Merging

Yidi Wang ⋅ Ziyue Qiao ⋅ Jiawei Gu ⋅ Xubin Zheng ⋅ Pengyang Wang ⋅ pei Xiaobing ⋅ Xiao Luo

This paper studies a novel problem of out-of-distribution graph models merging, which aims to construct a generalized model from multiple graph models pre-trained on different domains with distribution discrepancy. This problem is challenging because of the difficulty in learning domain-invariant knowledge implicitly in model parameters and consolidating expertise from potentially heterogeneous GNN backbones. In this work, we propose a graph generation strategy that instantiates the mixture distribution of multiple domains. Then, we merge and fine-tune the pre-trained graph models via a MoE module and a masking mechanism for generalized adaptation. Our framework is architecture-agnostic and can operate without any source/target domain data. Both theoretical analysis and experimental results demonstrate the effectiveness of our approach in addressing the model generalization problem.


Poster
P3-#422
Probabilistic Kernel Function for Fast Angle Testing

Kejing Lu ⋅ Chuan Xiao ⋅ Yoshiharu Ishikawa

In this paper, we study the angle testing problem in the context of similarity search in high-dimensional Euclidean spaces and propose two projection-based probabilistic kernel functions, one designed for angle comparison and the other for angle thresholding. Unlike existing approaches that rely on random projection vectors drawn from Gaussian distributions, our approach leverages reference angles and adopts a deterministic structure for the projection vectors. Notably, our kernel functions do not require asymptotic assumptions, such as the number of projection vectors tending to infinity, and can be theoretically and experimentally shown to outperform Gaussian-distribution-based kernel functions. We apply the proposed kernel function to Approximate Nearest Neighbor Search (ANNS) and demonstrate that our approach achieves a 2.5x--3x higher query-per-second (QPS) throughput compared to the widely-used graph-based search algorithm HNSW. Our code and data are available at https://github.com/KejingLu-810/KS.


Poster
P3-#2003
Path Matters: Unveiling Geometric Implicit Bias via Curvature-Aware Sparse View Optimization

Canran Xiao ⋅ Liaoyuan Fan ⋅ Yanbin Li ⋅ Jing Tang ⋅ Peilai Yu

3D Gaussian Splatting (3DGS) has recently emerged as a powerful approach for novel view synthesis by reconstructing scenes as sets of Gaussian ellipsoids. Despite its success in scenarios with dense input images, 3DGS faces critical challenges in sparse view settings, often resulting in geometric inaccuracies, inconsistencies across views, and degraded rendering quality. In this paper, we uncover and address two key implicit biases of 3DGS reconstruction algorithm in sparse-view: (1) the model has a stronger demand for supervision signal toward regions of high curvature, and (2) the model is sensitive to the smoothness of the trajectory of the input views. To tackle these issues, we propose a novel framework that optimizes camera trajectories to maximize curvature coverage while enforcing smooth motion, and we further enhance the informativeness of data through a synthetic view generation process. Extensive experiments on Mip-NeRF 360, DTU, Blender, Tanks & Temples, and LLFF datasets show that our method substantially outperforms state-of-the-art solutions in sparse-view scenarios, both in rendering quality and geometric fidelity. Beyond these empirical gains, our investigation uncovers the subtle ways in which data representation and trajectory planning interact to shape 3DGS performance, offering deeper theoretical insights into the algorithm’s inherent biases.

Continual learning (CL) seeks models that acquire new skills without erasing prior knowledge. In exemplar-free class-incremental learning (EFCIL), this challenge is amplified because past data cannot be stored, making representation drift for old classes particularly harmful. Prototype-based EFCIL is attractive for its efficiency, yet prototypes drift as the embedding space evolves; thus, projection-based drift compensation has become a popular remedy. We show, however, that existing one-directional projections introduce systematic bias: they either retroactively distort the current feature geometry or align past classes only locally, leaving cycle inconsistencies that accumulate across tasks. We introduce bidirectional projector alignment during training: two maps, old$\to$new and new$\to$old, are trained during each new task with stop-gradient gating and a cycle-consistency objective so that transport and representation co-evolve. Analytically, we prove that the cycle loss contracts the singular spectrum toward unity in whitened space and that improved transport of class means/covariances yields smaller perturbations of classification log-odds, preserving old-class decisions and directly mitigating catastrophic forgetting. Empirically, across standard EFCIL benchmarks, our method achieves unprecedented reductions in forgetting while maintaining very high accuracy on new tasks, consistently outperforming state-of-the-art approaches. The code is available at https://github.com/HXuSz11/BiCyc_ICLR2026.


Poster
P3-#308
Prompt-Robust Vision-Language Models via Meta-Finetuning

Haohui Liang ⋅ Runlin Huang ⋅ Yingjun Du ⋅ Yujia Hu ⋅ Weifeng Su ⋅ Cees G Snoek

Vision-language models (VLMs) have demonstrated remarkable generalization across diverse tasks by leveraging large-scale image-text pretraining. However, their performance is notoriously unstable under variations in natural language prompts, posing a considerable challenge for reliable real-world deployment. To address this prompt sensitivity, we propose Promise, a meta-learning framework for prompt-Robust vision-language models via meta-finetuning, which explicitly learns to generalize across diverse prompt formulations. Our method operates in a dual-loop meta-finetuning setting: the inner loop adapts token embeddings based on a set of varied prompts, while the outer loop optimizes for generalization on unseen prompt variants. To further improve robustness, we introduce an adaptive prompt weighting mechanism that dynamically emphasizes more generalizable prompts and a token-specific learning rate module that fine-tunes individual prompt tokens based on contextual importance. We further establish that Promise’s weighted and preconditioned inner update provably (i) yields a one-step decrease of the outer empirical risk together with a contraction of across-prompt sensitivity, and (ii) tightens a data-dependent generalization bound evaluated at the post-inner initialization. Across 15 benchmarks spanning base-to-novel generalization, cross-dataset transfer, and domain shift, our approach consistently reduces prompt sensitivity and improves performance stability over existing prompt learning methods.


Poster
P3-#309
Clustering by Denoising: Latent plug-and-play diffusion for single-cell embeddings

Dominik Meier ⋅ Shixing Yu ⋅ Sagnik Nandy ⋅ PROMIT GHOSAL ⋅ Kyra Gan

Single-cell RNA sequencing (scRNA-seq) enables the study of cellular heterogeneity. Yet, clustering accuracy, and with it downstream analyses based on cell labels, remain challenging due to measurement noise and biological variability. In standard latent spaces (e.g., obtained through PCA), data from different cell types can be projected close together, making accurate clustering difficult. We introduce a latent plug-and-play diffusion framework that separates the observation and denoising space. This separation is operationalized through a novel Gibbs sampling procedure: the learned diffusion prior is applied in a low-dimensional latent space to perform denoising, while to steer this process, noise is reintroduced into the original high-dimensional observation space. This unique ``input-space steering'' ensures the denoising trajectory remains faithful to the original data structure. Our approach offers three key advantages: (1) adaptive noise handling via a tunable balance between prior and observed data; (2) uncertainty quantification through principled uncertainty estimates for downstream analysis; and (3) generalizable denoising by leveraging clean reference data to denoise noisier datasets, and via averaging, improve quality beyond the training set. We evaluate robustness on both synthetic and real single-cell genomics data. Our method improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real-world single-cell data, our method demonstrates improved biological coherence in the resulting cell clusters, with cluster boundaries that better align with known cell type markers and developmental trajectories.


Poster
P3-#310
COMPASS: Robust Feature Conformal Prediction for Medical Segmentation Metrics

Matt Cheung ⋅ Ashok Veeraraghavan ⋅ Guha Balakrishnan

In clinical applications, the utility of segmentation models is often based on the accuracy of derived downstream metrics such as organ size, rather than by the pixel-level accuracy of the segmentation masks themselves. Thus, uncertainty quantification for such metrics is crucial for decision-making. Conformal prediction (CP) is a popular framework to derive such principled uncertainty guarantees, but applying CP naively to the final scalar metric is inefficient because it treats the complex, non-linear segmentation-to-metric pipeline as a black box. We introduce COMPASS, a practical framework that generates efficient, metric-based CP intervals for image segmentation models by leveraging the inductive biases of their underlying deep neural networks. COMPASS performs calibration directly in the model's representation space by perturbing intermediate features along low-dimensional subspaces maximally sensitive to the target metric. We prove that COMPASS achieves valid marginal coverage under the assumption of exchangeability. Empirically, we demonstrate that COMPASS produces significantly tighter intervals than traditional CP baselines on four medical image segmentation tasks for area estimation of skin lesions and anatomical structures. Furthermore, we show that leveraging learned internal features to estimate importance weights allows COMPASS to also recover target coverage under covariate shifts. COMPASS paves the way for practical, metric-based uncertainty quantification for medical image segmentation.


Poster
P3-#311
DiffBED: Scaling Bayesian Experimental Design to High-Dimensions

Adhithya Saravanan ⋅ Rik Knowles ⋅ Gavin Kerrigan ⋅ Tom Rainforth

Bayesian experimental design (BED) is a principled framework for intelligent data acquisition. However, current approaches do not scale to problems with high–dimensional designs, impeding its uptake. We show that this limitation arises predominantly from the difficulty in specifying a likelihood model that remains accurate throughout the design space, and that without this, standard design optimisation procedures lead to a reward-hacking-like behaviour that exploits deficiencies in the likelihood, producing implausible or unrealistic designs. To overcome this, we introduce DiffBED, an approach based on a novel BED objective that explicitly rewards realistic designs. Realism is captured by a diffusion model, which we guide using information-theoretic experimental design criteria to generate highly informative yet realistic designs. This enables BED at an unprecedented scale: while existing applications of BED have been restricted to design spaces with a handful of dimensions, we show that DiffBED can successfully scale to designing high–resolution images.


Poster
P3-#312
Neural Posterior Estimation with Latent Basis Expansions

Declan McNamara ⋅ Yicun Duan ⋅ Jeffrey Regier

Neural posterior estimation (NPE) is a likelihood-free amortized variational inference method that approximates projections of the posterior distribution. To date, NPE variational families have been either simple and interpretable (such as the Gaussian family) or highly flexible but black-box and potentially difficult to optimize (such as normalizing flows). In this work, we parameterize variational families via basis expansions of the latent variables. The log density of our variational distribution is a linear combination of latent basis functions (LBFs), which may be fixed a priori or adapted to the problem class of interest. Our training and inference procedures are computationally efficient even for problems with high-dimensional latent spaces, provided only a low-dimensional projection of the posterior is of interest, owing to NPE's automatic marginalization capabilities. In numerous inference problems, the proposed variational family exhibits better performance than existing variational families used with NPE, including mixtures of Gaussians (mixture density networks) and normalizing flows, as well as outperforming an existing basis expansion method for variational inference.


Poster
P3-#313
Post-hoc Probabilistic Vision-Language Models

Anton Baumann ⋅ Rui Li ⋅ Marcus Klasson ⋅ Santeri Mentu ⋅ Shyamgopal Karthik ⋅ Zeynep Akata ⋅ Arno Solin ⋅ Martin Trapp

Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.


Poster
P3-#314
Compositional amortized inference for large-scale hierarchical Bayesian models

Jonas Arruda ⋅ Vikas Pandey ⋅ Catherine Sherry ⋅ Margarida Barroso ⋅ Xavier Intes ⋅ Jan Hasenauer ⋅ Stefan Radev

Amortized Bayesian inference (ABI) with neural networks has emerged as a powerful simulation-based approach for estimating complex mechanistic models. However, extending ABI to hierarchical models, a cornerstone of modern Bayesian analysis, has been a major hurdle due to the need to simulate and process massive datasets. Our study tackles these challenges by extending compositional score matching (CSM), a divide-and-conquer strategy for Bayesian updating using diffusion models. We develop a new error-damping estimator to address previous stability issues of CSM when aggregating large numbers of data points. We first verified the numerical stability with up to 100,000 data points on a controlled benchmark. We then evaluated our method on a hierarchical AR model, achieving competitive performance to direct ABI baselines on smaller problem sizes while using less than one full model simulation for larger problem sizes. Finally, we address a large-scale inverse problem in advanced microscopy with over 750,000 parameters, demonstrating its relevance to real scientific applications.


Poster
P3-#315
Efficient Credal Prediction through Decalibration

Paul Hofman ⋅ Timo Löhr ⋅ Maximilian Muschalik ⋅ Yusuf Sale ⋅ Eyke Hüllermeier

A reliable representation of uncertainty is essential for the application of modern machine learning methods in safety-critical settings. In this regard, the use of credal sets (i.e., convex sets of probability distributions) has recently been proposed as a suitable approach to representing epistemic uncertainty. However, as with other approaches to epistemic uncertainty, training credal predictors is computationally complex and usually involves (re-)training an ensemble of models. The resulting computational complexity prevents their adoption for complex models such as foundation models and multi-modal systems. To address this problem, we propose an efficient method for credal prediction that is grounded in the notion of relative likelihood and inspired by techniques for the calibration of probabilistic classifiers. For each class label, our method predicts a range of plausible probabilities in the form of an interval. To produce the lower and upper bounds of these intervals, we propose a technique that we refer to as decalibration. Extensive experiments show that our method yields credal sets with strong performance across diverse tasks, including coverage–efficiency evaluation, out-of-distribution detection, and in-context learning. Notably, we demonstrate credal prediction on models such as TabPFN and CLIP—architectures for which the construction of credal sets was previously infeasible.


Poster
P3-#316
Conformal Prediction for Long-Tailed Classification

Tiffany Ding ⋅ Jean-Baptiste Fermanian ⋅ Joseph Salmon

Many real-world classification problems, such as plant identification, have extremely long-tailed class distributions. In order for prediction sets to be useful in such settings, they should (i) provide good class-conditional coverage, ensuring that rare classes are not systematically omitted from the prediction sets, and (ii) be a reasonable size, allowing users to easily verify candidate labels. Unfortunately, existing conformal prediction methods, when applied to the long-tailed setting, force practitioners to make a binary choice between small sets with poor class-conditional coverage or sets that have very good class-conditional coverage but are extremely large. We propose methods with marginal coverage guarantees that smoothly trade off set size and class-conditional coverage. First, we introduce a new conformal score function called prevalence-adjusted softmax that optimizes for macro-coverage, defined as the average class-conditional coverage across classes. Second, we propose a new procedure that interpolates between marginal and class-conditional conformal prediction by linearly interpolating their conformal score thresholds. We demonstrate our methods on Pl@ntNet-300K and iNaturalist-2018, two long-tailed image datasets with 1,081 and 8,142 classes, respectively.

Bayesian Optimization (BO) is a powerful tool for black-box optimization, but its application to high-dimensional permutation spaces is severely limited by the challenge of defining scalable representations. The current state-of-the-art BO approach for permutation spaces relies on an exhaustive $\Omega(n^2)$ pairwise comparison, inducing a dense representation that is impractical for large-scale permutations. To break this barrier, we introduce a novel framework for generating efficient permutation representations via kernel functions derived from sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from enumeration sort. Further, we introduce the \textbf{Merge Kernel} , which leverages the divide-and-conquer structure of merge sort to produce a compact, $\Theta(n\log n)$ to achieve the lowest possible complexity with no information loss and effectively capture permutation structure. Our central thesis is that the Merge Kernel performs competitively with the Mallows kernel in low-dimensional settings, but significantly outperforms it in both optimization performance and computational efficiency as the dimension $n$ grows. Extensive evaluations on various permutation optimization benchmarks confirm our hypothesis, demonstrating that the Merge Kernel provides a scalable and more effective solution for Bayesian optimization in high-dimensional permutation spaces, thereby unlocking the potential for tackling previously intractable problems such as large-scale feature ordering and combinatorial neural architecture search.


Poster
P3-#318
Efficient Autoregressive Inference for Transformer Probabilistic Models

Conor Hassan ⋅ Nasrulloh Satrio ⋅ Cen-You Li ⋅ Daolang Huang ⋅ Paul Chang ⋅ Yang Yang ⋅ Francesco Silvestrin ⋅ Samuel Kaski ⋅ Luigi Acerbi

Set-based transformer models for amortized probabilistic inference and meta-learning, such as neural processes, prior-fitted networks, and tabular foundation models, excel at single-pass _marginal_ prediction. However, many applications require _joint distributions_ over multiple predictions. Purely autoregressive architectures generate these efficiently but sacrifice flexible set-conditioning. Obtaining joint distributions from set-based models requires re-encoding the entire context at each autoregressive step, which scales poorly. We introduce a _causal autoregressive buffer_ that combines the strengths of both paradigms. The model encodes the context once and caches it; a lightweight causal buffer captures dependencies among generated targets, with each new prediction attending to both the cached context and all previously predicted targets added to the buffer. This enables efficient batched autoregressive sampling and joint predictive density evaluation. Training integrates set-based and autoregressive modes through masked attention at minimal overhead. Across synthetic functions, EEG time series, a Bayesian model comparison task, and tabular regression, our method closely matches the performance of full context re-encoding while delivering up to $20\times$ faster joint sampling and density evaluation, and up to $7\times$ lower memory usage.

Time-series forecasting increasingly demands not only accurate observational predictions but also causal forecasting under interventional and counterfactual queries in multivariate systems. We present DoFlow, a flow-based generative model defined over a causal Directed Acyclic Graph (DAG) that delivers coherent observational and interventional predictions, as well as counterfactuals through the natural encoding–decoding mechanism of continuous normalizing flows (CNFs). We also provide a supporting counterfactual recovery theory under certain assumptions. Beyond forecasting, DoFlow provides explicit likelihoods of future trajectories, enabling principled anomaly detection. Experiments on synthetic datasets with various causal DAG structures and real-world hydropower and cancer-treatment time series show that DoFlow achieves accurate system-wide observational forecasting, enables causal forecasting over interventional and counterfactual queries, and effectively detects anomalies. This work contributes to the broader goal of unifying causal reasoning and generative modeling for complex dynamical systems.

This blogpost clarifies the practical usefulness of having a model with calibrated probabilities, something that is not often clearly stated in the calibration literature. We show that a calibrated model can be relied on to estimate average loss/reward, however, good calibration does not mean that a model is useful for per-sample decision making.


Poster
P3-#321
PU-BENCH: A UNIFIED BENCHMARK FOR RIGOROUS AND REPRODUCIBLE PU LEARNING

Qiuyi Chen ⋅ Haiyang Zhang ⋅ Leqi Zhang ⋅ Changchun Li ⋅ Jia Wang ⋅ Wei Wang

Positive-Unlabeled (PU) learning, a challenging paradigm for training binary classifiers from only positive and unlabeled samples, is fundamental to many applications. While numerous PU learning methods have been proposed, the research is systematically hindered by the lack of a standardized and comprehensive benchmark for rigorous evaluation. Inconsistent data generation, disparate experimental settings, and divergent metrics have led to irreproducible findings and unsubstantiated performance claims. To address this foundational challenge, we introduce PU-Bench, the first unified open-source benchmark for PU learning. PU-Bench provides: 1) a unified data generation pipeline to ensure consistent input across configurable sampling schemes, label ratios and labeling mechanisms; 2) an integrated framework of 18 state-of-the-art PU methods; and 3) standardized protocols for reproducible assessment. Through a large-scale empirical study on 8 diverse datasets (2880 evaluations in total), PU-Bench reveals a complex yet intuitive performance landscape, uncovering critical trade-offs between effectiveness and efficiency, and systematically mapping method robustness against variations in label frequency and selection bias. It is anticipated to serve as a foundational resource to catalyze reproducible, rigorous, and impactful research in the PU learning community. The source code is publicly available at .


Poster
P3-#322
When Shift Happens - Confounding Is to Blame

Gowtham Reddy Abbavaram ⋅ Celia Rubio-Madrigal ⋅ Rebekka Burkholz ⋅ Krikamol Muandet

Distribution shifts introduce uncertainty that undermines the robustness and generalization capabilities of machine learning models. While conventional wisdom suggests that learning causal-invariant representations enhances robustness to such shifts, recent empirical studies present a counterintuitive finding: (i) empirical risk minimization (ERM) can rival or even outperform state-of-the-art out-of-distribution (OOD) generalization methods, and (ii) OOD generalization performance improves when all available covariates, including non-causal ones, are utilized. We present theoretical and empirical explanations that attribute this phenomenon to hidden confounding. Shifts in hidden confounding induce changes in data distributions that violate assumptions commonly made by existing approaches. Under such conditions, we prove that generalization requires learning environment-specific relationships, rather than relying solely on invariant ones. Furthermore, we explain why models augmented with non-causal but informative covariates can mitigate the challenges posed by hidden confounding shifts. These findings offer new theoretical insights and practical guidance, serving as a roadmap for future research on OOD generalization and principled covariate-selection strategies.


Poster
P3-#323
Delving into Spectral Clustering with Vision-Language Representations

Bo Peng ⋅ Yuanwei Hu ⋅ Bo Liu ⋅ Ling Chen ⋅ Jie Lu ⋅ Zhen Fang

Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf{16} benchmarks---including classical, large-scale, fine-grained and domain-shifted datasets---manifest that our method consistently outperforms the state-of-the-art by a large margin.


Poster
P3-#324
Do LLM Agents Know How to Ground, Recover, and Assess? Evaluating Epistemic Competence in Information-Seeking Agents

Jiaqi Shao ⋅ Yuxiang Lin ⋅ Munish P Lohani ⋅ Yufeng Miao ⋅ Bing Luo

Recent work has explored training Large Language Model (LLM) search agents with reinforcement learning (RL) for open-domain question answering. However, most evaluations focus solely on final answer accuracy, overlooking how these agents reason with and act on external evidence. We introduce SeekBench, the first process-level evaluation framework for LLM search agents that operationalize epistemic competence through metrics derived from an annotation schema. We develop and validate our annotation schema using an expert-annotated dataset of 190 traces (over 1,800 steps). To evaluate at scale, we introduce an LLM-as-judge pipeline. Our framework provides granular analysis of whether agents demonstrate: (1) groundedness, by generating reasoning steps supported by observed evidence; (2) recovery, by adaptively reformulating searches to recover from low-quality results; and (3) calibration, by correctly assessing whether current evidence is sufficient to provide an answer. By applying our evaluation framework to state-of-the-art search agents tuned on Qwen2.5-7B, we uncover critical behavioral gaps that answer-only metrics miss, as well as specialized skills such as Search-R1's synthesis abilities. These analyses highlight distinct epistemic competencies, offering actionable insights for the development of more capable and trustworthy agents. Code is available at https://github.com/SHAO-Jiaqi757/SeekBench.

We propose a simple yet novel data augmentation method for general data modalities based on energy-based modeling and principles from information geometry. Unlike most existing learning-based data augmentation methods, which rely on learning latent representations with generative models, our proposed framework enables an intuitive construction of a geometrically aware latent space that represents the structure of the data itself, supporting efficient and explicit encoding and decoding procedures. We then present and discuss how to design latent spaces that will subsequently control the augmentation with the proposed algorithm. Empirical results demonstrate that our data augmentation method achieves competitive performance in downstream tasks compared to other baselines, while offering fine-grained controllability that is lacking in the existing literature.

We introduce k-NLPmeans and k-LLMmeans, text-clustering variants of k-means that periodically replace numeric centroids with textual summaries. The key idea—summary-as-centroid—retains k-means assignments in embedding space while producing human-readable, auditable cluster prototypes. The method is LLM-optional: k-NLPmeans uses lightweight, deterministic summarizers, enabling offline, low-cost, and stable operation; k-LLMmeans is a drop-in upgrade that uses an LLM for summaries under a fixed per-iteration budget whose cost does not grow with dataset size. We also present a mini-batch extension for real-time clustering of streaming text. Across diverse datasets, embedding models, and summarization strategies, our approach consistently outperforms classical baselines and approaches the accuracy of recent LLM-based clustering without extensive LLM calls. Finally, we provide a case study on sequential text streams and release a StackExchange-derived benchmark for evaluating streaming text clustering.


Poster
P3-#421
SONATA: Synergistic Coreset Informed Adaptive Temporal Tensor Factorization

Maolin WANG ⋅ Zhiqi Li ⋅ Binhao Wang ⋅ Xuhui Chen ⋅ Tianshuo Wei ⋅ Wanyu Wang ⋅ Shikai Fang ⋅ Ruocheng Guo ⋅ Zenglin Xu ⋅ Xiangyu Zhao

Analyzing dynamic tensor streams is fundamentally challenged by complex, evolving temporal dynamics and the need to identify informative data from high-velocity streams. Existing methods often lack the expressiveness to model multi-scale temporal dependencies, limiting their ability to capture evolving patterns. We propose SONATA, a novel framework that unifies expressive dynamic embedding modeling with adaptive coreset selection. SONATA leverages principled machine learning techniques for efficient evaluation of each observation for uncertainty, novelty, influence, and information gain, and dynamically prioritizes learning from the most valuable data using Bellman-inspired optimization. Entity dynamics are modeled with Linear Dynamical Systems and expressive temporal kernels for fine-grained temporal representation. Experiments on synthetic and real-world datasets show that SONATA consistently outperforms state-of-the-art methods in modeling complex temporal patterns and improving predictive accuracy for dynamic tensor streams.


Poster
P3-#420
AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Jonathan Bragg ⋅ Mike D'Arcy ⋅ Nishant Balepur ⋅ Dan Bareket ⋅ Bhavana Dalvi Mishra ⋅ Sergey Feldman ⋅ Dany Haddad ⋅ Jena Hwang ⋅ Peter Jansen ⋅ Varsha Kishore ⋅ Bodhisattwa Prasad Majumder ⋅ Aakanksha Naik ⋅ Sigal Rahamimov ⋅ Kyle Richardson ⋅ Amanpreet Singh ⋅ Harshit Surana ⋅ Aryeh Tiktinsky ⋅ Rosni Vasu ⋅ Guy Wiener ⋅ Chloe Anastasiades ⋅ Stefanus Candra ⋅ Jason Dunkelberger ⋅ Daniel Emery ⋅ Rob Evans ⋅ Malachi Hamada ⋅ Regan Huff ⋅ Rodney Kinney ⋅ Matt Latzke ⋅ Jaron Lochner ⋅ Ruben Lozano-Aguilera ⋅ Ngoc-Uyen Nguyen ⋅ Smita Rao ⋅ Amber Tanaka ⋅ Brooke Vlahos ⋅ Peter Clark ⋅ Doug Downey ⋅ Yoav Goldberg ⋅ Ashish Sabharwal ⋅ Daniel Weld

AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose "deep research" systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they often (1) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (2) do not account for confounding variables such as model cost and tool access; (3) do not provide standardized interfaces for quick agent prototyping and evaluation; (4) fail to provide holistic, product-informed measures of real-world use cases such as science research; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides a holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.


Poster
P3-#419
Sharing State Between Prompts and Programs

Ellie Cheng ⋅ Logan Weber ⋅ Tian Jin ⋅ Michael Carbin

The rise of large language models (LLMs) has introduced a new type of programming: natural language programming. Users write prompts, which are instructions in natural language, to direct LLMs to perform tasks such as natural language processing, code generation, reasoning, etc. An emerging area of research enables interoperability between prompts and programs. We present a novel programming abstraction, shared program state, that removes the manual work required to enable interoperability between prompts and program states. With shared program state, programmers can write prompts that directly access program variables, compute with program objects, and implement control flow in the program. We present a schema for specifying natural function interfaces that extend programming systems to support programs with prompts and leverage this schema to specify shared program state as a natural function interface. We implement shared program state in the Nightjar programming system. Nightjar enables programmers to write Python programs containing prompts that share the Python program state. We show that Nightjar programs achieve comparable or higher task accuracy than manually written implementations (+4-19\%), while decreasing the lines of code by 39.6\% on average. The tradeoff is that Nightjar may incur runtime overhead (0.4-4.3x manual implementations).


Poster
P3-#418
Generalizing Linear Autoencoder Recommenders with Decoupled Expected Quadratic Loss

Ruixin Guo ⋅ Xinyu Li ⋅ Hao Zhou ⋅ Yang Zhou ⋅ Ruoming Jin

Linear autoencoders (LAEs) have gained increasing popularity in recommender systems due to their simplicity and strong empirical performance. Most LAE models, including the Emphasized Denoising Linear Autoencoder (EDLAE) introduced by (Steck, 2020), use quadratic loss during training. However, the original EDLAE only provides closed-form solutions for the hyperparameter choice $b = 0$, which limits its capacity. In this work, we generalize EDLAE objective function into a Decoupled Expected Quadratic Loss (DEQL). We show that DEQL simplifies the process of deriving EDLAE solutions and reveals solutions in a broader hyperparameter range $b > 0$, which were not derived in Steck’s original paper. Additionally, we propose an efficient algorithm based on Miller’s matrix inverse theorem to ensure the computational tractability for the $b > 0$ case. Empirical results on benchmark datasets show that the $b > 0$ solutions provided by DEQL outperform the $b = 0$ EDLAE baseline, demonstrating that DEQL expands the solution space and enables the discovery of models with better testing performance.


Poster
P3-#417
FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

Shibo Hong ⋅ jiahao ying ⋅ Haiyuan Liang ⋅ Mengdi Zhang ⋅ Jun Kuang ⋅ Jiazheng Zhang ⋅ Yixin Cao

Evaluating open-ended outputs of Multimodal Large Language Models has become a bottleneck as model capabilities, task diversity, and modality rapidly expand. Existing ``MLLM-as-a-Judge'' evaluators, though promising, remain constrained to specific tasks and aspects (i.e., specific evaluation criteria such as fluency for text and image quality for images). In this paper, we argue that, on one hand, based on the interconnected nature of criteria, learning specific aspects can generalize to unseen aspects; on the other hand, jointly learning to assess multiple visual criteria and tasks may foster a synergistic effect. To this end, we propose UFEval, the first unified fine-grained evaluator with task and aspect generalization for four evaluation tasks --- Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. However, training such a unified evaluator is hindered by the lack of a large-scale, multi-modal, and aspect-level resource. To address this gap, we introduce FRABench, a comprehensive fine-grained evaluation dataset. Specifically, (1) We first construct a hierarchical aspect taxonomy encompassing 112 distinct aspects across the aforementioned four tasks. (2) Based on this taxonomy, we create FRABench, comprising 60.4k pairwise samples with 325k evaluation labels obtained from a combination of human and GPT-4o annotations. (3) Finally, leveraging FRABench, we develop UFEval, a unified fine-grained evaluator. Experiments show that learning on specific aspects enables UFEval to generalize to unseen aspects, and joint learning to assess diverse visual tasks and aspects can lead to substantial mutual benefits.


Poster
P3-#416
LLM as an Algorithmist: Enhancing Anomaly Detectors via Programmatic Synthesis

Hangting Ye ⋅ Jinmeng Li ⋅ He Zhao ⋅ Mingchen Zhuge ⋅ Dandan Guo ⋅ Yi Chang ⋅ Hongyuan Zha

Existing anomaly detection (AD) methods for tabular data usually rely on some assumptions about anomaly patterns, leading to inconsistent performance in real-world scenarios. While Large Language Models (LLMs) show remarkable reasoning capabilities, their direct application to tabular AD is impeded by fundamental challenges, including difficulties in processing heterogeneous data and significant privacy risks. To address these limitations, we propose LLM-DAS, a novel framework that repositions the LLM from a data processor to an algorithmist. Instead of being exposed to raw data, our framework leverages the LLM's ability to reason about algorithms. It analyzes a high-level description of a given detector to understand its intrinsic weaknesses and then generates detector-specific, data-agnostic Python code to synthesize ``hard-to-detect'' anomalies that exploit these vulnerabilities. This generated synthesis program, which is reusable across diverse datasets, is then instantiated to augment training data, systematically enhancing the detector's robustness by transforming the problem into a more discriminative two-class classification task. Extensive experiments on 36 TAD benchmarks show that LLM-DAS consistently boosts the performance of mainstream detectors. By bridging LLM reasoning with classic AD algorithms via programmatic synthesis, LLM-DAS offers a scalable, effective, and privacy-preserving approach to patching the logical blind spots of existing detectors.

With the rapid development of modern Artificial Intelligence, especially the emergence of Large Language Models (LLMs), we face a growing epistemological crisis: our engineering capabilities have far surpassed our philosophical vocabulary. We have built systems that demonstrate emergent reasoning abilities, yet we struggle to articulate exactly what we have built. The traditional naming convention, _e.g._, lumping code, parameters, and behaviors together as a "Model", is no longer sufficient. It fails to capture the widening gap between human design intent and the resulting behavioral artifacts. Current discussions often oscillate between two extremes: a reductionist view that dismisses these systems as merely "stochastic parrots," and an anthropomorphic view that prematurely attributes consciousness to them. Both views stem from a lack of structural granularity when defining the ontological status of AI agents. This paper proposes to solve this problem through a "Five-Layer Model Hierarchy Ontology." Inspired by systems theory and cognitive science, we deconstruct the concept of a "Model" into five distinct layers: the Noumenal Model ($\mathcal{M}_N$), the Conceptual Model ($\mathcal{M}_C$), the Instantiated Model ($\mathcal{M}_I$), the Reachable Model ($\mathcal{M}_R$), and the Observable Model ($\mathcal{M}_O$). By tracing the evolution of these layers from classical machine learning to foundation models, we reveal how the transition from "Tabula Rasa" (blank slate) to "Artifact" has fundamentally changed. Furthermore, we apply this framework to reconstruct two classic philosophical problems, namely the nature of meaning (via the "Stochastic Chinese Room") and the nature of truth (via the "Paradox of the Two Poetics"), demonstrating that the essence of synthetic intelligence lies not in biological mimicry, but in the topological structure of statistical manifolds.


Poster
P3-#414
RADAR: Learning to Route with Asymmetry-aware Distance Representations

Hang Yi ⋅ Ziwei Huang ⋅ Yining Ma ⋅ Zhiguang Cao

Recent neural solvers have achieved strong performance on vehicle routing problems (VRPs), yet they mainly assume symmetric Euclidean distances, restricting applicability to real-world scenarios. A core challenge is encoding the relational features in asymmetric distance matrices of VRPs. Early attempts directly encoded these matrices but often failed to produce compact embeddings and generalized poorly at scale. In this paper, we propose RADAR, a scalable neural framework that augments existing neural VRP solvers with the ability to handle asymmetric inputs. RADAR addresses asymmetry from both static and dynamic perspectives. It leverages Singular Value Decomposition (SVD) on the asymmetric distance matrix to initialize compact and generalizable embeddings that inherently encode the static asymmetry in the inbound and outbound costs of each node. To further model dynamic asymmetry in embedding interactions during encoding, it replaces the standard softmax with Sinkhorn normalization that imposes joint row and column distance awareness in attention weights. Extensive experiments on synthetic and real-world benchmarks across various VRPs show that RADAR outperforms strong baselines on both in-distribution and out-of-distribution instances, demonstrating robust generalization and superior performance in solving asymmetric VRPs.


Poster
P3-#413
Combination-of-Experts with Knowledge Sharing for Cross-Task Vehicle Routing Problems

Zikang Yu ⋅ Jinbiao Chen ⋅ Jiahai Wang

Recent neural methods have shown promise in generalizing across various vehicle routing problems (VRPs). These methods adopt either a fully-shared dense model across all VRP tasks (i.e., variants) or a mixture-of-experts model that assigns node embeddings within each task instance to different experts. However, they both struggle to generalize from training tasks with basic constraints to out-of-distribution (OOD) tasks involving unseen constraint combinations and new basic constraints, as they overlook the fact that each VRP task is defined by a combination of multiple basic constraints. To address this, this paper proposes a novel model, combination-of-experts with knowledge sharing (CoEKS), which leverages the structural characteristic of VRP tasks. CoEKS enhances generalization to constraint combinations via two complementary components: a combination-of-experts architecture enabling flexible combinations via prior assignment of constraint-specific experts, and a knowledge sharing strategy strengthening generalization via automatic learning of transferable general knowledge across constraints. Moreover, CoEKS allows new experts to be plugged into the trained model for rapid adaptation to new constraints. Experiments demonstrate that CoEKS outperforms state-of-the-art methods on in-distribution tasks and delivers greater gains on OOD tasks, including unseen constraint combinations (relative improvement of 12\% over SOTA) and new constraints (25\% improvement).


Poster
P3-#412
Learning to Solve Orienteering Problem with Time Windows and Variable Profits

Songqun Gao ⋅ Zanxi Ruan ⋅ Patrick Floor ⋅ Marco Roveri ⋅ Luigi Palopoli ⋅ Daniele Fontanelli

The orienteering problem with time windows and variable profits (OPTWVP) is common in many real-world applications and involves continuous time variables. Current approaches fail to develop an efficient solver for this orienteering problem variant with discrete and continuous variables. In this paper, we propose a learning-based two-stage DEcoupled discrete-Continuous optimization with Service-time-guided Trajectory (DeCoST), which aims to effectively decouple the discrete and continuous decision variables in the OPTWVP problem, while enabling efficient and learnable coordination between them. In the first stage, a parallel decoding structure is employed to predict the path and the initial service time allocation. The second stage optimizes the service times through a linear programming (LP) formulation and provides a long-horizon learning of structure estimation. We rigorously prove the global optimality of the second-stage solution. Experiments on OPTWVP instances demonstrate that DeCoST outperforms both state-of-the-art constructive solvers and the latest meta-heuristic algorithms in terms of solution quality and computational efficiency, achieving up to 6.6x inference speedup on instances with fewer than 500 nodes. Moreover, the proposed framework is compatible with various constructive solvers and consistently enhances the solution quality for OPTWVP.


Journal Track Poster
P3-#411
Celo: Training Versatile Learned Optimizers on a Compute Diet

Abhinav Moudgil · Boris Knyazev · Guillaume Lajoie · Eugene Belilovsky

Learned optimization has emerged as a promising alternative to hand-crafted optimizers, with the potential to discover stronger learned update rules that enable faster, hyperparameter-free training of neural networks. A critical element for practically useful learned optimizers, that can be used off-the-shelf after meta-training, is strong meta-generalization: the ability to apply the optimizers to new tasks. Recent state-of-the-art work in learned optimizers, VeLO (Metz et al., 2022), requires a large number of highly diverse meta-training tasks along with massive computational resources, 4000 TPU months, to achieve meta-generalization. This makes further improvements to such learned optimizers impractical. In this work, we identify several key elements in learned optimizer architectures and meta-training procedures that can lead to strong meta-generalization. We also propose evaluation metrics to reliably assess quantitative performance of an optimizer at scale on a set of evaluation tasks. Our proposed approach, Celo, makes a significant leap in improving the meta-generalization performance of learned optimizers and also outperforms tuned state-of-the-art optimizers on a diverse set of out-of-distribution tasks, despite being meta-trained for just 24 GPU hours.


Poster
P3-#410
Bilevel Optimization with Lower-Level Uniform Convexity: Theory and Algorithm

Yuman Wu ⋅ Xiaochuan Gong ⋅ Jie Hao ⋅ Mingrui Liu

Bilevel optimization is a hierarchical framework where an upper-level optimization problem is constrained by a lower-level problem, commonly used in machine learning applications such as hyperparameter optimization. Existing bilevel optimization methods typically assume strong convexity or Polyak-Łojasiewicz (PL) conditions for the lower-level function to establish non-asymptotic convergence to a solution with a small hypergradient. However, these assumptions may not hold in practice, and recent work (Chen et al. 2024) has shown that bilevel optimization is inherently intractable for general convex lower-level functions with the goal of finding small hypergradients. In this paper, we identify a tractable class of bilevel optimization problems that interpolates between lower-level strong convexity and general convexity via lower-level uniform convexity. For uniformly convex lower-level functions with exponent $p\geq 2$, we establish a novel implicit differentiation theorem characterizing the hyperobjective's smoothness property. Building on this, we design a new stochastic algorithm, termed UniBiO, with provable convergence guarantees, based on an oracle that provides stochastic gradient and Hessian-vector product information for the bilevel problems. Our algorithm achieves $\widetilde{O}(\epsilon^{-5p+6})$ oracle complexity bound for finding $\epsilon$-stationary points. Notably, our complexity bounds match the optimal rates in terms of the $\epsilon$ dependency for strongly convex lower-level functions ($p=2$), up to logarithmic factors. Our theoretical findings are validated through experiments on synthetic tasks and data hyper-cleaning, demonstrating the effectiveness of our proposed algorithm.


Poster
P3-#409
LogART: Pushing the Limit of Efficient Logarithmic Post-Training Quantization

Jiawei Xu ⋅ Yi Zheng ⋅ Chenghe Sun ⋅ Taiyu Zhou ⋅ Zuqi Zhang ⋅ Jie Li ⋅ Lirong Zheng ⋅ Zhuo Zou

Efficient deployment of deep neural networks increasingly relies on Post-Training Quantization (PTQ). Logarithmic PTQ, in particular, promises multiplier-free hardware efficiency, but its performance is often limited by the nonlinear and symmetric quantization grid and standard rounding-to-nearest (RTN) approach. While learnable rounding has significantly advanced linear PTQ, its application to the non-linear and often discrete nature of logarithmic domain remains unexplored. This paper introduces learnable Logarithmic Adaptive Rounding Techniques (LogART) that pioneer task-aware learnable rounding specifically for the logarithmic domain. LogART further extends the learnable rounding strategy to flexibly support outlier-aware, asymmetric, and hardware-friendly dynamic logarithmic bases, determined in a distribution-aware manner using an efficient search strategy. Extensive experiments demonstrate that LogART achieves state-of-the-art accuracy while maintaining efficiency in quantizing models across various architectures and ultra-low bitwidths, outperforming existing logarithmic PTQ methods and paving the way for more effective hardware deployment. The code is available at https://github.com/logart-lab/logart.

Oblique decision trees combine the transparency of trees with the power of multivariate decision boundaries—but learning high-quality oblique splits is NP-hard, and practical methods still rely on slow search or theory-free heuristics. We present the Hinge Regression Tree (HRT), which reframes each split as a non-linear least-squares problem over two linear predictors whose max/min envelope induces ReLU-like expressive power. The resulting alternating fitting procedure is exactly equivalent to a damped Newton (Gauss–Newton) method within fixed partitions. We analyze this node-level optimization and, for a backtracking line-search variant, prove that the local objective decreases monotonically and converges; in practice, both fixed and adaptive damping yield fast, stable convergence and can be combined with optional ridge regularization. We further prove that HRT’s model class is a universal approximator with an explicit $O(\delta^2)$ approximation rate, and show on synthetic and real-world benchmarks that it matches or outperforms single-tree baselines with more compact structures.


Poster
P3-#407
On Smoothness Bounds for Non-Clairvoyant Scheduling with Predictions

Tianming Zhao ⋅ Albert Zomaya

Algorithms with predictions leverage predictions for unknown inputs in online decision-making. These algorithms are analyzed by consistency, i.e., competitive ratio under perfect predictions, and robustness, i.e., competitive ratio under worst-case predictions. Smooth degrading performance with an increased prediction error is also desirable. This paper refines the notion of smoothness, a function of prediction error, defined as the competitive ratio over the problem instances where predictions are guaranteed to provide additional information. With our refined smoothness metric, we establish smoothness bounds for a few scheduling problems, including online total completion time minimization and makespan minimization. For a single machine to minimize the total completion time, we show a lower bound of $\eta$ and a $\eta^2$-smooth algorithm, where $\eta$ is the prediction error ($\eta \geq 1$); the bound holds for small errors. For parallel identical machines to minimize the makespan, we show a lower bound of $2 - O(\eta^{-2})$ and present an $O(\eta^2)$-smooth algorithm for small errors. Both bounds are tighter than the existing ones. For uniformly-related machines to minimize the makespan, we show a tight lower bound of $\lceil \log \eta \rceil$, matched by an $O(\log \eta)$-smooth algorithm.


Poster
P3-#406
Generative Bayesian Optimization: Generative Models as Acquisition Functions

Rafael Oliveira ⋅ Dan Steinberg ⋅ Edwin Bonilla

We present a general strategy for turning generative models into candidate solution samplers for batch Bayesian optimization (BO). The use of generative models for BO enables large batch scaling as generative sampling, optimization of non-continuous design spaces, and high-dimensional and combinatorial design. Inspired by the success of direct preference optimization (DPO), we show that one can train a generative model with noisy, simple utility values directly computed from observations to then form proposal distributions whose densities are proportional to the expected utility, i.e., BO's acquisition function values. Furthermore, this approach is generalizable beyond preference-based feedback to general types of reward signals and loss functions. This perspective avoids the construction of surrogate (regression or classification) models, common in previous methods that have used generative models for black-box optimization. Theoretically, we show that the generative models within the BO process follow a sequence of distributions which asymptotically approximate an optimal target under certain conditions. We also evaluate the performance through experiments on challenging optimization problems involving large batches in high dimensions.


Poster
P3-#405
Submodular Function Minimization with Dueling Oracle

Huaiyuan Xiao ⋅ Shinji Ito

We consider submodular function minimization using a *dueling oracle*, a noisy pairwise comparison oracle that provides relative feedback on function values between two queried sets. The oracle's responses are governed by a *transfer function*, which characterizes the relationship between differences in function values and the parameters of the response distribution. For a linear transfer function, we propose an algorithm that achieves an error rate of $O(n^{\frac{3}{2}}/\sqrt{T})$, where $n$ is the size of the ground set and $T$ denotes the number of oracle calls. We establish a lower bound: Under the constraint that differences between queried sets are bounded by a constant, any algorithm incurs an error of at least $\Omega(n^{\frac{3}{2}}/\sqrt{T})$. Without such a constraint, the lower bound becomes $\Omega(n/\sqrt{T})$. These results show that our algorithm is optimal up to constant factors for constrained algorithms. For a sigmoid transfer function, we design an algorithm with an error rate of $O(n^{\frac{7}{5}}/T^{\frac{2}{5}})$, and establish lower bounds analogous to the linear case.


Poster
P3-#404
Energy-Efficient Random Variate Generation via Compressed Lookup Tables

Johann Ukrow ⋅ Anna Kazachkova ⋅ Nicolas Alder ⋅ Sven Köhler ⋅ Rainer Schlosser ⋅ Ralf Herbrich

Generating (pseudo-)random variates lies at the core of probabilistic machine learning and prediction algorithms and yet remains a major bottleneck due to its high computational and energy cost. In this paper, we introduce a general and scalable sampling strategy that enables fast and energy-efficient random variate generation from arbitrary distributions. Our approach is based on compressed lookup tables (cLUT) combined with a fast index sampling scheme. Using only a handful of fast and energy-efficient compute operations on simple array structures, we achieve superior speed, energy efficiency, and precision at near-optimal entropy cost compared to state-of-the-art techniques. Microbenchmarking our approach with a C implementation shows up to 40\% savings in time and 50\% in energy compared to state-of-the-art approaches. Compared to commonly employed Python samplers, we achieve a 100$\times$ time improvement.


Poster
P3-#403
Constrained Decoding of Diffusion LLMs with Context-Free Grammars

Niels Mündler ⋅ Jasper Dekoninck ⋅ Martin Vechev

Large language models (LLMs) have shown promising performance across diverse domains. Many practical applications of LLMs, such as code completion and structured data extraction, require adherence to syntactic constraints specified by a formal language. Yet, due to their probabilistic nature, LLM output is not guaranteed to adhere to such formal languages. To address this, prior work has proposed constrained decoding to restrict LLM generation to particular formal languages. However, existing works are not applicable to the emerging paradigm of diffusion LLMs, as this requires supporting token generation in arbitrary order instead of the traditional left-to-right order. In this paper, we address this challenge and present the first constrained decoding method for diffusion models, one that can handle formal languages captured by context-free grammars. We begin by reducing constrained decoding to the more general additive infilling problem, which asks whether a partial output with holes can be completed to a valid word in the target language. This problem also naturally subsumes the previously unaddressed multi-region infilling constrained decoding. We then reduce this problem to the task of deciding whether the intersection of the target language and a regular language is empty and present an efficient algorithm to solve this task for context-free languages. Empirical results on various applications, such as C++ code infilling and structured data extraction in JSON, demonstrate that our method achieves near-perfect syntactic correctness while consistently preserving or improving functional correctness. Importantly, our efficiency optimizations ensure that the computational overhead remains practical.


Poster
P3-#402
Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE

Frank Wan ⋅ Lucheng Fu ⋅ Haoxin Liu ⋅ Yiqiao Jin ⋅ Hui Leong ⋅ Eric Jiang ⋅ Hejia Geng ⋅ Jinhe Bi ⋅ Yunpu Ma ⋅ Xiangru Tang ⋅ B. Aditya Prakash ⋅ Yizhou Sun ⋅ Wei Wang

The performance of Large Language Models (LLMs) hinges on carefully engineered prompts. However, prevailing prompt optimization methods, ranging from heuristic edits and reinforcement learning to evolutionary search, primarily target point-wise accuracy. They seldom enforce paraphrase invariance or searching stability, and therefore cannot remedy this brittleness in practice. Automated prompt search remains brittle: small, semantically preserving paraphrases often cause large performance swings. We identify this brittleness as the textual sharpness of the prompt landscape. In this work, we provide the first formal treatment of textual sharpness in the discrete, semantic space of prompts, together with an operational robustness criterion over a semantic neighborhood; the design is black-box or API-only, requiring no gradients to update the model's parameters. Then we introduce TARE (Textual Sharpness-Aware Evolving), a derivative-free framework that alternates between an inner, sampling-based adversarial search that stresses a prompt with hard paraphrases and an outer, robust selection that prefers candidates whose neighborhoods remain strong. We further propose ATARE, which learns anisotropic weights to shape the semantic neighborhood and adapts its radius over time to balance exploration and fidelity. Diverse tasks evaluate our methods, whose design for minimizing textual sharpness gap leads to prompts that preserve accuracy under paraphrasing, outperforming accuracy-only prompt search while remaining computationally practical. The code is available for anonymous access at https://anonymous.4open.science/r/ATARE_TARE/.


Poster
P3-#2002
Reversible Primitive–Composition Alignment for Continual Vision–Language Learning

Canran Xiao ⋅ Tianxiang Xu ⋅ SiYuan Ma ⋅ Yiyang Jiang ⋅ Haoyu Gao ⋅ Yuhan Wu

Vision-language (VL) models are increasingly deployed in non-stationary settings, yet under sequential adaptation they often preserve primitive recognition while losing compositional structure, especially with tight rehearsal budgets and no task IDs. We address this gap by asking how a continual VL system can maintain structurally dependable behaviour while safeguarding zero-shot performance. We introduce Compo-ReAlign, a structure-first recipe built around three components: a reversible composer that maps primitive embeddings to compositions by design, a multi-positive InfoNCE that jointly aligns textual and composed views of the same target, and a spectral trust region that clips updates when alignment sensitivity inflates. Across compositional DIL and multi-domain MTIL retrieval, Compo-ReAlign sets a new state of the art, improves over the strongest prior by +2.4 R@1, and reduces forgetting by 40%. We provide a compact, reversible alignment head with geometry-aware training for compositionally robust VL continual learning.


Poster
P3-#401
DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

Xinwei Qiang ⋅ Hongmin chen ⋅ Shixuan Sun ⋅ Jingwen Leng ⋅ Xin Liu ⋅ Minyi Guo

Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non‑deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient‑reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies: (i) Descending Q‑Tile Iteration, a reversed query‑block traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks. Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28$\times$ compared to the baseline, significantly advancing the efficiency of reproducible LLM training. Our code is open-sourced at https://github.com/SJTU-Liquid/deterministic-FA3.

We consider centralized distributed optimization in the classical federated learning setup, where $n$ workers jointly find an $\varepsilon$-stationary point of an $L$-smooth, $d$-dimensional nonconvex function $f$, having access only to unbiased stochastic gradients with variance $\sigma^2$. Each worker requires at most $h$ seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are $\tau_{\textnormal{s}}$ and $\tau_{\textnormal{w}}$ seconds per coordinate, respectively. One of the main motivations for distributed optimization is to achieve scalability with respect to $n$. For instance, it is well known that the distributed version of \algname{SGD} has a variance-dependent runtime term $\frac{h \sigma^2 L \Delta}{n \varepsilon^2},$ which improves with the number of workers $n,$ where $\Delta := f(x^0) - f^*,$ and $x^0 \in \mathbb{R}^d$ is the starting point. Similarly, using unbiased sparsification compressors, it is possible to reduce \emph{both} the variance-dependent runtime term and the communication runtime term from $\tau_{\textnormal{w}} d \frac{L \Delta}{\varepsilon}$ to $\frac{\tau_{\textnormal{w}} d L \Delta}{n \varepsilon} + \sqrt{\frac{\tau_{\textnormal{w}} d h \sigma^2}{n \varepsilon}} \cdot \frac{L \Delta}{\varepsilon},$ which also benefits from increasing $n.$ However, once we account for the communication from the server to the workers $\tau_{\textnormal{s}}$, we prove that it becomes infeasible to design a method using unbiased random sparsification compressors that scales both the server-side communication runtime term $\tau_{\textnormal{s}} d \frac{L \Delta}{\varepsilon}$ and the variance-dependent runtime term $\frac{h \sigma^2 L \Delta}{\varepsilon^2},$ better than poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same function or distribution. Indeed, when $\tau_{\textnormal{s}} \simeq \tau_{\textnormal{w}},$ our lower bound is $\tilde{\Omega}(\min[h (\frac{\sigma^2}{n \varepsilon} + 1) \frac{L \Delta}{\varepsilon} + {\tau_{\textnormal{s}} d \frac{L \Delta}{\varepsilon}},\; h \frac{L \Delta}{\varepsilon} + {h \frac{\sigma^2 L \Delta}{\varepsilon^2}}]).$ To establish this result, we construct a new ``worst-case'' function and develop a new lower bound framework that reduces the analysis to the concentration of a random sum, for which we prove a concentration bound. These results reveal fundamental limitations in scaling distributed optimization, even under the homogeneous (i.i.d.) assumption.


Poster
P3-#502
Bi-LoRA: Efficient Sharpness-Aware Minimization for Fine-Tuning Large-Scale Models

Yuhang Liu ⋅ Tao Li ⋅ Zhehao Huang ⋅ Zuopeng Yang ⋅ Xiaolin Huang

Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning of large pre-trained models. Yet LoRA can face generalization challenges. One promising way to improve the generalization is Sharpness-Aware Minimization (SAM), which has proven effective for small-scale training scenarios. In this paper, we propose Bi-directional Low-Rank Adaptation (Bi-LoRA), which introduces an auxiliary adversarial LoRA module. This design explicitly decouples sharpness optimization, handled by the auxiliary module, from task adaptation, performed by the primary module. Such a separation yields two key benefits. First, it transforms the sequential computation of primary LoRA update and adversarial perturbation into a parallel form, which roughly halves the time and conquers the main obstacle of applying SAM in LoRA. Second, it provides perturbations from the auxiliary module that do not collapse into the restricted optimization subspace of the primary module, enabling broader sharpness exploration and flatter minima. Bi-LoRA simultaneously achieves both efficiency and effectiveness within a single framework, as verified by extensive experiments across diverse architectures and tasks.


Poster
P3-#503
HiFo-Prompt: Prompting with Hindsight and Foresight for LLM-based Automatic Heuristic Design

ChentongChen ⋅ Mengyuan Zhong ⋅ Jialong Shi ⋅ Jianyong Sun ⋅ Ye Fan

This paper investigates the application of Large Language Models (LLMs) in Automated Heuristic Design (AHD), where their integration into evolutionary frameworks reveals a significant gap in global control and long-term learning. We propose the Hindsight-Foresight Prompt (HiFo-Prompt), a novel framework for LLM-based AHD designed to overcome these limitations. This is achieved through two synergistic strategies: Foresight and Hindsight. Foresight acts as a high-level meta-controller, monitoring population dynamics(e.g., stagnation and diversity collapse) to switch the global search strategy between exploration and exploitation explicitly. Hindsight builds a persistent knowledge base by distilling successful design principles from past generations, making this knowledge reusable. This dual mechanism ensures that the LLM is not just a passive operator but an active reasoner, guided by a global plan (Foresight) while continuously improving from its cumulative experience (Hindsight). Empirical results demonstrate that HiFo-Prompt significantly outperforms a comprehensive suite of state-of-the-art AHD methods, discovering higher-quality heuristics with substantially improved convergence speed and query efficiency. Our code is available at https://github.com/Challenger-XJTU/HiFo-Prompt.


Poster
P3-#504
The Unseen Frontier: Pushing the Limits of LLM Sparsity with Surrogate-Free ADMM

Kwanhee Lee ⋅ Hyeondo Jang ⋅ Dongyeop Lee ⋅ Dan Alistarh ⋅ Namhoon Lee

Neural network pruning is a promising technique to mitigate the excessive computational and memory requirements of large language models (LLMs). Despite its promise, however, progress in this area has diminished, as conventional methods are seemingly unable to surpass moderate sparsity levels (50-60\%) without severely degrading model accuracy. This work breaks through the current impasse, presenting a principled and effective method called $ \text{Elsa}$, which achieves extreme sparsity levels of up to 90\% while retaining high model fidelity. This is done by identifying several limitations in current practice, all of which can be traced back to their reliance on a surrogate objective formulation. $ \text{Elsa}$ tackles this issue directly and effectively via standard and well-established constrained optimization techniques based on ADMM. Our extensive experiments across a wide range of models and scales show that $ \text{Elsa}$ achieves substantial improvements over existing methods; e.g., it achieves 7.8$ \times$ less perplexity than the best existing method on LLaMA-2-7B at 90\% sparsity. Moreover, we show that $ \text{Elsa}$ remains stable even at extreme sparsity (e.g., 95\%), yielding up to $\times$3.98 inference speedup and $\times$7.80 memory compression over its dense counterpart. We also present $ \text{Elsa}_ {-L}$, a quantized variant that scales to extremely large models (27B), and establish its theoretical convergence guarantees. These results highlight meaningful progress in advancing the frontier of LLM sparsity, while promising that significant opportunities for further advancement may remain in directions that have so far attracted limited exploration.


Poster
P3-#505
RepSpec: Structural Re-parameterized Draft Model Training for Speculative Decoding

FEIYE HUO ⋅ Jianchao Tan ⋅ Jiahao Liu ⋅ Zixu Jiang ⋅ Jiacheng Li ⋅ Jingang Wang ⋅ Xunliang Cai ⋅ Shengli Sun

As the parameter size of large language models (LLMs) continues to grow, the latency of autoregressive inference increases due to memory-bound computational inefficiency. To address this, speculative decoding has been proposed, where a large target model verifies multiple tokens generated in parallel by a smaller draft model. However, the performance of speculative decoding is fundamentally limited by the draft model’s capacity, which stems from the parameter gap between the two models. To overcome this limitation, we propose RepSpec, which combines structural re-parameterization with draft model training. During training, redundant linear structures are introduced and later merged into the backbone network during inference, thus enhancing the draft model’s training effectiveness without increasing inference cost. By applying our method to improve the current state-of-the-art approach, EAGLE, we achieve a significant improvement in accepted sequence length. Furthermore, considering the specific characteristics of the speculative decoding scenario, we explore a hybrid training strategy that combines linear and nonlinear structures, which yields a further improvement in acceptance length.


Poster
P3-#506
Tighter Performance Theory of FedExProx

Wojciech Anyszka ⋅ Kaja Gruntkowska ⋅ Alexander Tyurin ⋅ Peter Richtarik

We revisit FedExProx -- a recently proposed distributed optimization method designed to enhance convergence properties of parallel proximal algorithms via extrapolation. In the process, we uncover a surprising flaw: its known theoretical guarantees on quadratic optimization tasks are no better than those offered by the vanilla Gradient Descent (GD) method. Motivated by this observation, we develop a novel analysis framework, establishing a tighter linear convergence rate for non-strongly convex quadratic problems. By incorporating both computation and communication costs, we demonstrate that FedExProx can indeed provably outperform GD, in stark contrast to the original analysis. Furthermore, we consider partial participation scenarios and analyze two adaptive extrapolation strategies -- based on gradient diversity and Polyak stepsizes -- again significantly outperforming previous results. Moving beyond quadratics, we extend the applicability of our analysis to general functions satisfying the Polyak-Łojasiewicz condition, outperforming the previous strongly convex analysis while operating under weaker assumptions. Backed by empirical results, our findings point to a new and stronger potential of FedExProx, paving the way for further exploration of the benefits of extrapolation in federated learning.


Poster
P3-#507
Sobolev Gradient Ascent for Optimal Transport: Barycenter Optimization and Convergence Analysis

Kaheon Kim ⋅ Bohan Zhou ⋅ Changbo Zhu ⋅ Xiaohui Chen

This paper introduces a new constraint-free concave dual formulation for the Wasserstein barycenter. Tailoring the vanilla dual gradient ascent algorithm to the Sobolev geometry, we derive a scalable Sobolev gradient ascent (SGA) algorithm to compute the barycenter for input distributions supported on a regular grid. Despite the algorithmic simplicity, we provide a global convergence analysis that achieves the same rate as the classical subgradient descent methods for minimizing nonsmooth convex functions in the Euclidean space. A central feature of our SGA algorithm is that the computationally expensive $c$-concavity projection operator enforced on the Kantorovich dual potentials is unnecessary to guarantee convergence, leading to significant algorithmic and theoretical simplifications over all existing primal and dual methods for computing the exact barycenter. Our numerical experiments demonstrate the superior empirical performance of SGA over the existing optimal transport barycenter solvers.


Poster
P3-#508
AdaCache: Adaptive Caching and Context Augmentation for Efficient LLM Serving

Zihao Zeng ⋅ Siyi Li ⋅ Xinyu Yan ⋅ Lei Xiao ⋅ Wei Yang Bryan Lim

Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models by integrating external knowledge sources, but at the cost of substantial computational overhead from extended input sequences. Current RAG systems exhibit two fundamental inefficiencies: redundant processing of frequently retrieved text chunks across multiple queries, and uniform deep retrieval that over-provisions context regardless of query complexity. We present AdaCache, an adaptive caching framework that addresses these limitations through dual optimization strategies. First, we introduce a cache-aware partial recomputation mechanism that profiles attention patterns to construct selective cache variants, enabling flexible reuse while preserving cross-chunk dependencies. Second, we develop adaptive context augmentation that dynamically determines optimal retrieval depth via lightweight confidence estimation, avoiding unnecessary overhead on simple queries. Comprehensive experiments across diverse datasets and LLMs demonstrate that AdaCache delivers substantial improvements in Time-To-First-Token compared to state-of-the-art RAG caching systems, while preserving generation quality.


Poster
P3-#509
MnemoDyn: Learning Resting State Dynamics from $40$K FMRI sequences

Sourav Pal ⋅ Viet Luong ⋅ Hoseok Lee ⋅ Tingting Dan ⋅ Guorong Wu ⋅ Richard Davidson ⋅ Won Hwa Kim ⋅ Vikas Singh

We present a dynamical-systems based model for resting-state functional magnetic resonance imaging (rs-fMRI), trained on a dataset of roughly $40$K rs-fMRI sequences covering a wide variety of public and available-by-permission datasets. While most existing proposals use transformer backbones, we utilize multi-resolution temporal modeling of the dynamics across parcellated brain regions. We show that MnemoDyn is compute efficient and generalizes very well across diverse populations and scanning protocols. When benchmarked against current state-of-the-art transformer-based approaches, MnemoDyn consistently delivers superior reconstruction quality. Overall, we find that with such large-scale pre-training on (non-proprietary) rs-fMRI datasets, we get a highly performant model for various downstream tasks. Our results also provide evidence of the efficacy of the model on small sample size studies which has implications for neuroimaging studies at large where resting state fMRI is a commonly acquired imaging modality.


Poster
P3-#510
Iterative Training of Physics-Informed Neural Networks with Fourier-enhanced Features

yulun wu ⋅ Miguel Aguiar ⋅ Karl H. Johansson ⋅ Matthieu Barreau

Spectral bias, the tendency of neural networks to learn low-frequency features first, is a well-known issue with many training algorithms for physics-informed neural networks (PINNs). To overcome this issue, we propose IFeF-PINN, an algorithm for iterative training of PINNs with Fourier-enhanced features. The key idea is to enrich the latent space using high-frequency components through Random Fourier Features. This creates a two-stage training problem: (i) estimate a basis in the feature space, and (ii) perform regression to determine the coefficients of the enhanced basis functions. For an underlying linear model, it is shown that the latter problem is convex, and we prove that the iterative training scheme converges. Furthermore, we empirically establish that Random Fourier Features enhance the expressive capacity of the network, enabling accurate approximation of high-frequency PDEs. Through extensive numerical evaluation on classical benchmark problems, the superior performance of our method over state-of-the-art algorithms is shown, and the improved approximation across the frequency domain is illustrated.


Poster
P3-#511
Sign-SGD via Parameter-Free Optimization

Daniil Medyakov ⋅ Stanko Sergey ⋅ Gleb Molodtsov ⋅ Philip Zmushko ⋅ Grigoriy Evseev ⋅ Egor Petrov ⋅ Aleksandr Beznosikov

Large language models have achieved major advances across domains, yet training them remains extremely resource-intensive. We revisit Sign-SGD, which serves both as a memory-efficient optimizer for single-node training and as a gradient compression mechanism for distributed learning. This paper addresses a central limitation: the effective stepsize cannot be determined a priori because it relies on unknown, problem-specific quantities. We present a parameter-free Sign-SGD that removes manual stepsize selection. We analyze the deterministic single-node case, and extend the method to stochastic single-node training and multi-node settings. We also incorporate the momentum technique into our algorithms and propose a memory-efficient variant that stores only gradient signs instead of full gradients. We evaluate our methods on pre-training LLaMA models (130M and 350M) and fine-tuning a Swin Transformer (28M). Across considered tasks, the proposed methods match the performance of tuned Sign-SGD and AdamW (grid-searched stepsizes with a cosine schedule), while avoiding tuning overhead. Employing parameter-free training yields approximately $1.5\times$ end-to-end speedup compared to runs with grid-searched stepsizes.


Poster
P3-#512
Stable-LoRA: Stabilizing Feature Learning of Low-Rank Adaptation

Yize Wu ⋅ KE GAO ⋅ Ling Li ⋅ Yanjun WU

Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient method for fine-tuning Large Langauge Models. It updates the weight matrix as $W=W_0+sBA$, where $W_0$ is the original frozen weight, $s$ is a scaling factor and $A$,$B$ are trainable low-rank matrices. Despite its robust empirical effectiveness, the theoretical foundations of LoRA remain insufficiently understood, particularly with respect to feature learning stability. In this paper, we first establish that, LoRA can, in principle, naturally achieve and sustain stable feature learning (i.e., be self-stabilized) under appropriate hyper-parameters and initializations of $A$ and $B$. However, we also uncover a fundamental limitation that the necessary non-zero initialization of $A$ compromises self-stability, leading to suboptimal performances. To address this challenge, we propose Stable-LoRA, a weight-shrinkage optimization strategy that dynamically enhances stability of LoRA feature learning. By progressively shrinking $A$ during the earliest training steps, Stable-LoRA is both theoretically and empirically validated to effectively eliminate instability of LoRA feature learning while preserving the benefits of the non-zero start. Experiments show that Stable-LoRA consistently outperforms other baselines across diverse models and tasks, with no additional memory usage and only negligible computation overheads. The code is available at https://github.com/Yize-Wu/Stable-LoRA.


Poster
P3-#513
Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Ziyan Wang ⋅ Zheng Wang ⋅ Xingwei Qu ⋅ Qi Cheng ⋅ Jie Fu ⋅ Shengpu Tang ⋅ Minjia Zhang ⋅ Xiaoming Huo

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient mechanism to address the above limitations via decomposing each iteration into three stages: a short fast trajectory of inner steps on the same batch, a reposition step to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces number of rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points on math reasoning benchmarks. It also achieves up to 4.93$\times$ fewer rollouts and a 4.19$\times$ reduction in wall-clock time to match GRPO’s best accuracy.


Poster
P3-#514
SkillFactory: Self-Distillation for Learning Cognitive Behaviors

Zayne Sprague ⋅ Jack Lu ⋅ Manya Wadhwa ⋅ Sedrick Keh ⋅ Mengye Ren ⋅ Greg Durrett

Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.


Poster
P3-#515
RLAP-CLIP: Continual Multimodal Learning with Prototype Adaptation and Difficulty-Aware Routing

Ruikun Luo ⋅ Jiarui Wang ⋅ Yuan Gao ⋅ Jing Yang ⋅ Jieming Yang ⋅ Song Wu ⋅ Hai Jin ⋅ Xiaoyu Xia

Vision-language models, such as CLIP, achieve strong zero-shot performance through contrastive pre-training but face significant challenges in class-incremental image classification scenarios. When learning new tasks sequentially, current methods suffer from degradation in prototype quality due to passive averaging and underutilize their visual adaptation capabilities. We propose RLAP-CLIP, which addresses these limitations through three components. First, Reinforcement Learning-based Prototype Optimization (RLPO) formulates prototype construction as a reinforcement learning problem to actively optimize class separability rather than relying on simple averaging. Second, difficulty-aware cross-modal fusion uses a mixture-of-experts to route samples through specialized processing pathways based on complexity. Third, dual-modal prompting balances visual and textual adaptation. Experiments on eight image classification benchmarks demonstrate consistent improvements, with RLAP-CLIP achieving average accuracy gains of 3.72-4.46 points and final accuracy improvements of 0.49-4.48 points over other methods, validating that RLAP-CLIP achieves state-of-the-art performance.


Poster
P3-#516
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

Houyi Li ⋅ Ka Man Lo ⋅ Shijie Xuyang ⋅ Ziqi Wang ⋅ Wenzhen Zheng ⋅ Haocheng Zhang ⋅ Zhao Li ⋅ Shuigeng Zhou ⋅ Xiangyu Zhang ⋅ Daxin Jiang

Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints — that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens. All code and models will be released publicly.


Poster
P3-#517
ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Federico Danieli ⋅ Pau Rodriguez ⋅ Miguel Sarabia ⋅ Xavier Suau ⋅ Luca Zappella

Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton's iterations combined with custom parallel reductions. Our implementation achieves speedups of up to $665\times$ over na\"ive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.


Poster
P3-#518
Equilibrium Language Models

Yikun Jiang ⋅ Huanyu Wang ⋅ Tianhong Ding ⋅ Wenhu Zhang ⋅ Yiming Wu ⋅ Hanbin Zhao ⋅ John C.S. Lui

Large Language Models (LLMs) excel across diverse applications but remain impractical for edge deployment due to severe memory bottlenecks at the edge devices. We propose Equilibrium Language Models (ELMs), a novel compression framework that replaces groups of Transformer layers with a lightweight fixed-point network, reinterpreting deep computation as solving for an equilibrium state. To achieve ELMs, We introduce Group Pruning Policy Optimization, which automatically learns optimal pruning intervals. Moreover, we propose One-Step KV-Cache, which drastically reduces memory overhead by storing only the final iteration cache without compromising the accuracy, to enable effective deployment at the edge devices. Across different tasks such as common sense reasoning, mathematical problem solving, and code generation, ELMs prune 28\% of parameters while retaining 99\% of the accuracy of dense fine-tuned LLMs, establishing a new direction for memory-efficient edge deployment of large models.


Poster
P3-#519
DiaBlo: Diagonal Blocks Are Sufficient For Finetuning

Selcuk Gurses ⋅ Aozhong Zhang ⋅ Yanxia Deng ⋅ Xun Dong ⋅ Xin Li ⋅ Naigang Wang ⋅ Penghang Yin ⋅ Zi Yang

Fine-tuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low-Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low-rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. Moreover, we provide theoretical guarantees showing that, under mild low-rank conditions, DiaBlo is more expressive than LoRA in the linear problem and converges to a stationary point of the general nonlinear full fine-tuning. Through extensive experiments across a range of tasks—including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment—we show that fine-tuning only diagonal blocks is sufficient for strong and consistent performance. DiaBlo not only achieves competitive accuracy but also preserves high memory efficiency and fast fine-tuning speed. Codes are available at https://github.com/ziyangjoy/DiaBlo.


Poster
P3-#520
Towards Understanding The Calibration Benefits of Sharpness-Aware Minimization

Chengli Tan ⋅ Yubo Zhou ⋅ Haishan Ye ⋅ Guang Dai ⋅ Junmin Liu ⋅ Zengjie Song ⋅ Jiangshe Zhang ⋅ Zixiang Zhao ⋅ Yunda Hao ⋅ Yong Xu

Deep neural networks have been increasingly used in safety-critical applications such as medical diagnosis and autonomous driving. However, many studies suggest that they are prone to being poorly calibrated and have a propensity for overconfidence, which may have disastrous consequences. In this paper, unlike standard training such as stochastic gradient descent, we show that the recently proposed sharpness-aware minimization (SAM) counteracts this tendency towards overconfidence. The theoretical analysis suggests that SAM allows us to learn models that are already well-calibrated by implicitly maximizing the entropy of the predictive distribution. Inspired by this finding, we further propose a variant of SAM, coined as CSAM, to ameliorate model calibration. Extensive experiments on various datasets, including ImageNet-1K, demonstrate the benefits of SAM in reducing calibration error. Meanwhile, CSAM performs even better than SAM and consistently achieves lower calibration error than other approaches.


Poster
P3-#521
Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Divyat Mahajan ⋅ Sachin Goyal ⋅ Badr Youbi Idrissi ⋅ Mohammad Pezeshki ⋅ Ioannis Mitliagkas ⋅ David Lopez-Paz ⋅ Kartik Ahuja

Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.


Poster
P3-#522
Toward Principled Flexible Scaling for Self-Gated Neural Activation

Sudong Cai ⋅ Shuyuan Zheng ⋅ Bingzhi Chen ⋅ Shuai Yuan ⋅ Chuan Xiao ⋅ Jianbin Qin ⋅ Bing WANG

Neural networks necessitate nonlinearities to achieve universal approximability. Traditional activation functions introduce nonlinearities through rigid feature rectifications. Recent self-gated variants improve traditional methods in fitting flexibility by incorporating learnable content-aware factors and non-local dependencies, enabling dynamic adjustments to activation curves via adaptive translation and scaling. While SOTA approaches achieve notable gains in conventional CNN layers, they struggle to enhance Transformer layers, where fine-grained context is inherently modeled, severely reducing the effectiveness of non-local dependencies leveraged in activation processes. We refer to this critical yet unexplored challenge as the non-local tension of activation. Drawing on a decision-making perspective, we systematically analyze the origins of the non-local tension problem and explore the initial solution to foster a more discriminative and generalizable neural activation methodology. This is achieved by rethinking how non-local cues are encoded and transformed into adaptive scaling coefficients, which in turn recalibrate the contributions of features to filter updates through neural activation. Grounded in these insights, we present FleS, a novel self-gated activation model for discriminative pattern recognition. Extensive experiments on various popular benchmarks validate our interpretable methodology for improving neural activation modeling.


Poster
P3-#523
UNITE: Universal kNowledge Integration from Task-specific Experts

Shuxia Lin ⋅ Qiufeng Wang ⋅ xu yang ⋅ Xin Geng

Large language models (LLMs) with Mixture-of-Experts (MoE) architectures achieve strong performance under sparse activation. However, their expertise is often fragmented across experts and redundant across layers. Prior studies primarily diagnosed redundancy or parameter importance, revealing overlaps but lacking mechanisms to transform them into reusable knowledge. In contrast, human learning succeeds not by memorizing isolated facts but by reusing shared strategies across domains, which motivates the question: do MoE models similarly encode universal knowledge that can be systematically extracted and reused? We propose Universal kNowledge Integration from Task-specific Experts (UNITE), a framework that consolidates experts through Fisher-weighted fusion and then applies Tucker decomposition to disentangle shared low-rank input/output subspaces as universal knowledge from layer-specific variations. This universal component provides a compact basis for reconstructing target models with flexible depth, enabling lightweight yet competitive adaptation across tasks. To assess effectiveness, we evaluate data efficiency, convergence speed, and generalization across multiple MoE-based LLMs and diverse datasets. The results show that UNITE not only extracts universal knowledge, but also flexibly enabling once-for-all extraction and flexible target model construction that generalize across domains.


Poster
P3-#524
Compute-Optimal Quantization-Aware Training

Aleksandr Dremov ⋅ David Grangier ⋅ Angelos Katharopoulos ⋅ Awni Hannun

Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy compared to QAT alone. However, the optimal allocation of compute between the FP and QAT phases remains unclear. We conduct extensive experiments with various compute budgets, QAT bit widths, and model sizes from 86.0M to 2.2B to investigate how different QAT durations impact final performance. We demonstrate that, contrary to previous findings, the loss-optimal ratio of QAT to FP training increases with the total amount of compute. Moreover, the optimal fraction can be accurately predicted for a wide range of model sizes and quantization widths using the tokens-per-parameter-byte statistic. From experimental data, we derive a loss scaling law that predicts both optimal QAT ratios and final model performance across different QAT/FP compute allocation strategies and QAT bit widths. We use the scaling law to make further predictions, which we verify experimentally, including which QAT bit width is optimal under a given memory constraint and how QAT accuracy with different bit widths compares to full-precision model accuracy. Additionally, we propose a novel cooldown and QAT fusion approach that performs learning rate decay jointly with quantization-aware training, eliminating redundant full-precision model updates and achieving significant compute savings. These findings provide practical insights into efficient QAT planning and enable the training of higher-quality quantized models with the same compute budget.


Poster
P3-#525
LANE: Label-Aware Noise Elimination for Fine-Grained Text Classification

Tiberiu Sosea ⋅ Cornelia Caragea

In this paper, we propose Label-Aware Noise Elimination (LANE), a new approach to learning with noisy labels. At its core, LANE introduces a new metric---label-aware margin---aimed at quantifying the degree of noise of each training example (or quality thereof). LANE leverages the semantic relations between classes and monitors the training dynamics of the model on each training example to dynamically lower the weight of training examples that are perceived to have noisy labels. We test the effectiveness of LANE on multiple text classification tasks and benchmark our approach on a wide variety of datasets with various numbers of classes and amounts of label noise. LANE considerably outperforms strong baselines on all datasets and settings, obtaining significant improvements ranging from an average improvement of 2.88% in F1 on manually annotated datasets to a considerable average improvement of 4.75% F1 on datasets with high level of injected label noise. We carry out a comprehensive analysis of LANE and identify the key components that lead to its success.


Poster
P3-#526
Intrinsic Lorentz Neural Network

Xianglong Shi ⋅ Ziheng Chen ⋅ Yunhan Jiang ⋅ Nicu Sebe

Real-world data frequently exhibit latent hierarchical structures, which can be naturally represented by hyperbolic geometry. Although recent hyperbolic neural networks have demonstrated promising results, many existing architectures remain partially intrinsic, mixing Euclidean operations with hyperbolic ones or relying on extrinsic parameterizations. To address it, we propose the \emph{Intrinsic Lorentz Neural Network} (ILNN), a fully intrinsic hyperbolic architecture that conducts all computations within the Lorentz model. At its core, the network introduces a novel \emph{point-to-hyperplane} fully connected layer (FC), replacing traditional Euclidean affine logits with closed-form hyperbolic distances from features to learned Lorentz hyperplanes, thereby ensuring that the resulting geometric decision functions respect the inherent curvature. Around this fundamental layer, we design intrinsic modules: GyroLBN, a Lorentz batch normalization that couples gyro-centering with gyro-scaling, consistently outperforming both LBN and GyroBN while reducing training time. We additionally proposed a gyro-additive bias for the FC output, a Lorentz patch-concatenation operator that aligns the expected log-radius across feature blocks via a digamma-based scale, and a Lorentz dropout layer. Extensive experiments conducted on CIFAR-10/100 and two genomic benchmarks (TEB and GUE) illustrate that ILNN achieves state-of-the-art performance and computational cost among hyperbolic models and consistently surpasses strong Euclidean baselines.


Poster
P3-#626
Boolean Satisfiability via Imitation Learning

Zewei Zhang ⋅ Huan Liu ⋅ YUANHAO YU ⋅ Jun Chen ⋅ Xiangyu Xu

We propose ImitSAT, a branching policy for conflict-driven clause learning (CDCL) solvers based on imitation learning for the Boolean satisfiability problem (SAT). Unlike previous methods that predict instance-level signals to improve CDCL branching indirectly, or rely on reinforcement learning and insufficient CDCL information to enhance branching, ImitSAT learns from expert KeyTrace that collapses a full run into the sequence of surviving decisions. Replaying a KeyTrace on the same instance is nearly conflict-free, providing dense decision- level supervision and directly reducing propagations—the dominant contributor to wall-clock time. This prefix-conditioned supervision enables ImitSAT to reproduce high-quality branches without exploration, yielding faster convergence, stable training, and seamless integration into CDCL. Extensive experiments demonstrate that ImitSAT reduces propagation counts and runtime, outperforming state-of-the-art learned approaches. We released the source code and trained model at https://github.com/zewei-Zhang/ImitSAT.


Poster
P3-#625
SAES-SVD: Self-Adaptive Suppression of Accumulated and Local Errors for SVD-based LLM Compression

Xing Hu ⋅ Dawei Yang ⋅ Yuan Cheng ⋅ Zhixuan Chen ⋅ Zukang Xu

The rapid growth in the parameter scale of large language models (LLMs) has created a high demand for efficient compression techniques. As a hardware-agnostic and highly compatible technique, low-rank compression has been widely adopted. However, existing methods typically compress each layer independently by minimizing per-layer reconstruction error, overlooking a critical limitation: the reconstruction error propagates and accumulates through the network, which leads to amplified global deviations from the full-precision baseline. To address this, we propose Self-Adaptive Error Suppression SVD (SAES-SVD), a LLMs compression framework that jointly optimizes intra-layer reconstruction and inter-layer error compensation. SAES-SVD is composed of two novel components: Cumulative Error-Aware Layer Compression (CEALC), which formulates the compression objective as a combination of local reconstruction and weighted cumulative error compensation. Based on it, we derive a closed-form low-rank solution relied on second-order activation statistics, which explicitly aligns each layer's output with its full-precision counterpart to compensate for accumulated errors. \ding{183} Adaptive Collaborative Error Suppression (ACES), which automatically adjusts the weighting coefficient to enhance the low-rank structure of the compression objective in CELAC. Specifically, the coefficient is optimized to maximize the ratio between the Frobenius norm of the compressed layer's output and that of the compression objective under a fixed rank, thus ensuring that the rank budget is utilized effectively. Extensive experiments across multiple LLM architectures and tasks show that, without fine-tuning or additional tricks, SAES-SVD consistently improves post-compression performance. For example, at a 0.2 compression ratio on LLaMA-7B, existing methods exhibit an average accuracy drop exceeding 0.05, whereas SAES-SVD restricts the drop to only 0.02. These improvements underscore the potential of SAES-SVD to effectively narrow the gap between compressed models and their full-precision counterparts, paving the way for more reliable compression of LLMs.


Poster
P3-#623
Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models

Jinsong Li ⋅ Xiaoyi Dong ⋅ Yuhang Zang ⋅ Yuhang Cao ⋅ Jiaqi Wang ⋅ Dahua Lin

Diffusion Large Language Models (DLLMs) are emerging as a powerful alternative to the dominant Autoregressive Large Language Models, offering efficient parallel generation and capable global context modeling. However, the practical application of DLLMs is hindered by a critical architectural constraint: the need for a statically predefined generation length. This static length allocation leads to a problematic trade-off: insufficient lengths cripple performance on complex tasks, while excessive lengths incur significant computational overhead and sometimes result in performance degradation. While the inference framework is rigid, we observe that the model itself possesses internal signals that correlate with the optimal response length for a given task. To bridge this gap, we leverage these latent signals and introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion for Diffusion Large Language Models. DAEDAL operates in two phases: 1) Before the denoising process, DAEDAL starts from a short initial length and iteratively expands it to a coarse task-appropriate length, guided by a sequence completion metric. 2) During the denoising process, DAEDAL dynamically intervenes by pinpointing and expanding insufficient generation regions through mask token insertion, ensuring the final output is fully developed. Extensive experiments on DLLMs demonstrate that DAEDAL achieves performance comparable, and in some cases superior, to meticulously tuned fixed-length baselines, while simultaneously enhancing computational efficiency by achieving a higher effective token ratio. By resolving the static length constraint, DAEDAL unlocks new potential for DLLMs, bridging a critical gap with their Autoregressive counterparts and paving the way for more efficient and capable generation.


Poster
P3-#622
(U)NFV: (Un)Supervised Neural Finite Volume Methods for Solving Hyperbolic PDEs

Nathan Lichtlé ⋅ Alexi Canesse ⋅ Zhe Fu ⋅ HOSSEIN MATIN ⋅ Maria Delle Monache ⋅ Alexandre M Bayen

We introduce (U)NFV, a modular neural network architecture that generalizes classical finite volume (FV) methods for solving hyperbolic conservation laws. Hyperbolic partial differential equations (PDEs) are challenging to solve, particularly conservation laws whose physically relevant solutions contain shocks and discontinuities. FV methods are widely used for their mathematical properties: convergence to entropy solutions, flow conservation, or total variation diminishing, but often lack accuracy and flexibility in complex settings. Neural Finite Volume addresses these limitations by learning update rules over extended spatial and temporal stencils while preserving conservation structure. It supports both supervised training on solution data (NFV) and unsupervised training via weak-form residual loss (UNFV). Applied to first-order conservation laws, (U)NFV achieves up to 10x lower error than Godunov's method, outperforms ENO/WENO, and rivals discontinuous Galerkin solvers with lower implementation burden. On traffic modeling problems, both from PDEs and from experimental highway data, (U)NFV captures nonlinear wave dynamics with significantly higher fidelity and scalability than traditional FV approaches.


Poster
P3-#209
IceCache: Memory-Efficient KV-cache Management for Long-Sequence LLMs

Yuzhen Mao ⋅ Qitong Wang ⋅ Martin Ester ⋅ Ke Li

Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV-cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long-generation tasks such as chain-of-thought reasoning. In this paper, we propose a novel KV-cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method enables more efficient token selection and better utilization of memory bandwidth during CPU–GPU transfers. Experimental results on LongBench show that, with a 256-token budget, IceCache maintains 99\% of the original accuracy achieved by the full KV-cache model. Moreover, compared to other offloading-based methods, IceCache attains competitive or even superior latency and accuracy while using only 25\% of the KV-cache token budget, demonstrating its effectiveness in long-sequence scenarios. The code is available on our project website at https://yuzhenmao.github.io/IceCache/.


Poster
P3-#621
Dataset Color Quantization: A Training-Oriented Framework for Dataset-Level Compression

YU CHENYUE ⋅ Lingao Xiao ⋅ Jinhong Deng ⋅ Ivor Tsang ⋅ Yang He

Large-scale image datasets are fundamental to deep learning, but their high storage demands pose challenges for deployment in resource-constrained environments. While existing approaches reduce dataset size by discarding samples, they often ignore the significant redundancy within each image -- particularly in the color space. To address this, we propose Dataset Color Quantization (DCQ), a unified framework that compresses visual datasets by reducing color-space redundancy while preserving information crucial for model training. DCQ achieves this by enforcing consistent palette representations across similar images, selectively retaining semantically important colors guided by model perception, and maintaining structural details necessary for effective feature learning. Extensive experiments across CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that DCQ significantly improves training performance under aggressive compression, offering a scalable and robust solution for dataset-level storage reduction.


Poster
P3-#620
MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

Xiaodong Chen ⋅ Mingming Ha ⋅ Zhenzhong Lan ⋅ Jing Zhang ⋅ Jianguo Li

The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further reparameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235BA22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).


Poster
P3-#619
ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

Lawrence Liu ⋅ Alexander Liu ⋅ Mengdi Wang ⋅ Tuo Zhao ⋅ Lin Yang

Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix- factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre- and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations, and generalizes to provide improvements for general N:M patterns and unstructured sparsity. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy.


Poster
P3-#618
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

Song Bian ⋅ Tao Yu ⋅ Shivaram Venkataraman ⋅ Youngsuk Park

Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines. Under the same training budget, optimized architectures achieve up to 2.1\% higher accuracy and 42\% greater inference throughput compared to LLaMA-3.2.


Poster
P3-#616
FreqKV: Key-Value Compression in Frequency Domain for Context Window Extension

Jushi Kai ⋅ Yixuan Wang ⋅ Boyi Zeng ⋅ Haoli Bai ⋅ Bo Jiang ⋅ Ziwei He ⋅ Zhouhan Lin

Existing key-value (KV) cache compression methods for large language models (LLMs) often rely on token eviction, which risks losing critical local information in both long prefilling and decoding scenarios. When extrapolating beyond the pretrained context length, their performance degrades sharply on long-context benchmarks. Motivated by the observation in the frequency domain that the context information is concentrated in the low-frequency components, we propose FreqKV, a parameter-free and architecture-agnostic approach. It iteratively compresses the increasing KV cache in the frequency domain, allowing models to process lengthy contexts efficiently. With minimal training at 8K length, FreqKV extends the context window of LLaMA-2-7B up to 256K tokens while maintaining stable perplexity. Extensive experiments on both prefilling and decoding stages demonstrate that FreqKV enables robust context window extension and consistently outperforms existing KV cache compression methods, highlighting its effectiveness for both understanding and generation in long contexts.


Poster
P3-#615
On learning linear dynamical systems in context with attention layers

Maria-Luiza Vlǎdǎrean ⋅ Xuhui Zhang ⋅ Suvrit Sra

This paper studies the expressive power of linear attention layers for in-context learning (ICL) of linear dynamical systems (LDS). We consider training on sequences of inexact observations produced by noise-corrupted LDSs, with all perturbations being Gaussian. Importantly, this non-i.i.d. data setting is a significant step towards modeling real-world scenarios. We provide the optimal weight construction for a single linear-attention layer and show its equivalence to one step of Gradient Descent relative to an autoregression objective of window size one. Guided by experiments, we uncover a connection to a generalization of the Preconditioned Conjugate Gradient method for larger window sizes. We back our findings with numerical evidence. These results add to the existing understanding of transformers’ expressivity as in-context learners and offer plausible hypotheses for recent observations that place their performance on par with that of the Kalman Filter — the optimal model-dependent learner for this setting.


Poster
P3-#614
Identifying and Evaluating Inactive Heads in Pretrained LLMs

Pedro Sandoval-Segura ⋅ Xijun Wang ⋅ Ashwinee Panda ⋅ Micah Goldblum ⋅ Ronen Basri ⋅ Tom Goldstein ⋅ David Jacobs

Attention is foundational to large language models (LLMs), enabling different heads to have diverse focus on relevant input tokens. However, learned behaviors like attention sinks, where the first token receives the most attention despite limited semantic importance, suggest some heads may be inactive, and point to a significant source of computational redundancy. To analyze this phenomenon, we evaluate 12 score functions that measure different ways a head can be inactive. Thresholding these scores allows us to analyze different sets of potentially inactive attention heads. We evaluate whether identified heads are inactive through model interventions, finding that more than 12% of attention heads are inactive on average, and can be ablated in specific contexts while maintaining MMLU accuracy to within 1% of the pretrained LLM. Across 3 model families, our score functions that measure the average norm of a head's output consistently identify inactive heads that would not have been found by score functions that rely solely on attention weights. We establish that relying on a score function that measures a first token attention sink would underestimate the prevalence of inactive heads, failing to identify more than 7\% of inactive heads on average. We also show how measuring score distributions can provide insights into attention behavior. For instance, we find evidence that finetuning causes little to no change in attention behavior, and that even within the same model family, large model scales present different attention behaviors.


Poster
P3-#613
Understanding Transformers for Time Series: Rank Structure, Flow-of-ranks, and Compressibility

Annan Yu ⋅ Danielle Maddix ⋅ Boran Han ⋅ Xiyuan Zhang ⋅ Abdul Fatir Ansari ⋅ Oleksandr Shchur ⋅ Christos Faloutsos ⋅ Andrew Gordon Wilson ⋅ Michael W Mahoney ⋅ Bernie Wang

Transformers are widely used across data modalities, and yet the principles distilled from text models often transfer imperfectly. In this paper, we analyze Transformers through the lens of rank structure. Our focus is on the time series setting, where the structural properties of the data remarkably differ from those of text or vision. Time-series embeddings, unlike text or vision, exhibit sharply decaying singular spectra: small patch sizes and smooth continuous mappings concentrate the data into low-rank subspaces. From this, we prove that the associated $Q/K/V$ projections admit accurate low-rank approximations, and that attention layers become compressible in proportion to the decay of the embedding spectrum. We introduce the concept of *flow-of-ranks*, a mechanism by which nonlinear mixing across depth inflates the rank, explaining why early layers are most amenable to compression and why rank schedules should grow with depth. Guided by these results, we compress Chronos, a large time series foundation model, achieving a reduction of $65\\%$ in inference time and $81\\%$ in memory without loss of accuracy. These findings provide principled guidance for allocating width, depth, and heads in time series foundation models, and for exploiting their inherent compressibility. Our code is available at https://github.com/amazon-science/tsfm-compression.


Poster
P3-#612
A State-Transition Framework for Efficient LLM Reasoning

Liang Zhang ⋅ Yu Zhao ⋅ Longyue Wang ⋅ Tianqi Shi ⋅ Weihua Luo ⋅ Kaifu Zhang ⋅ Jinsong Su

While Long Chain-of-Thought (CoT) reasoning significantly improves Large Language Models (LLMs) performance on complex reasoning tasks, the substantial computational and memory costs of generating long CoT sequences limit their efficiency and practicality. Existing studies usually enhance the reasoning efficiency of LLMs by compressing CoT sequences. However, this approach conflicts with test‑time scaling, limiting the reasoning capacity of LLMs. In this paper, we propose an efficient reasoning framework that models the reasoning process of LLMs as a state‑transition process. Specifically, we first apply a linear attention mechanism to estimate the LLM’s reasoning state, which records the historical reasoning information from previous reasoning steps. Then, based on the query prompt and the reasoning state, the LLM can efficiently perform the current reasoning step and update the state. With the linear attention, each token in the current reasoning step can directly retrieve relevant historical reasoning information from the reasoning state, without explicitly attending to tokens in previous reasoning steps. In this way, the computational complexity of attention is reduced from quadratic to linear, significantly improving the reasoning efficiency of LLMs. In addition, we propose a state-based reasoning strategy to mitigate the over-thinking issue caused by noisy reasoning steps. Extensive experiments across multiple datasets and model sizes demonstrate that our framework not only improves the reasoning efficiency of LLMs but also enhances their reasoning performance.


Poster
P3-#611
Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

Qingyue Yang ⋅ Jie Wang ⋅ Xing Li ⋅ Yinqi Bai ⋅ Tong Xialiang ⋅ Huiling Zhen ⋅ Jianye Hao ⋅ Mingxuan Yuan ⋅ Bin Li

Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at https://github.com/MIRALab-USTC/LLM-TAPPA.


Poster
P3-#610
Scaling Attention via Feature Sparsity

Yan Xie ⋅ Tiansheng Wen ⋅ Tang Da Huang ⋅ Bo Chen ⋅ Chenyu You ⋅ Stefanie Jegelka ⋅ Yifei Wang

Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: \emph{feature sparsity}. We propose \textbf{Sparse Feature Attention (SFA)}, where queries and keys are represented as $k$-sparse codes that preserve high-dimensional expressivity while reducing the cost of attention from $\Theta(n^2 d)$ to $\Theta(n^2 k^2/d)$. To make this efficient at scale, we introduce \textbf{FlashSFA}, an IO-aware kernel that extends FlashAttention to operate directly on sparse overlaps without materializing dense score matrices. Across GPT-2 and Qwen3 pretraining, SFA matches dense baselines while improving speed by up to $2.5\times$ and reducing FLOPs and KV-cache by nearly 50\%. On synthetic and downstream benchmarks, SFA preserves retrieval accuracy and robustness at long contexts, outperforming short-embedding baselines that collapse feature diversity. These results establish feature-level sparsity as a complementary and underexplored axis for efficient attention, enabling Transformers to scale to orders-of-magnitude longer contexts with minimal quality loss.


Poster
P3-#609
From Collapse to Control: Understanding and Extending Context Length in Emerging Hybrid Models via Universal Position Interpolation

Haochen Shen ⋅ Davis Wertheimer ⋅ Zheng Wang ⋅ Garrett Goon ⋅ Derrick Liu ⋅ Naigang Wang ⋅ Mudhakar Srivatsa ⋅ Raghu Ganti ⋅ Minjia Zhang

Hybrid Mamba-Transformer models have emerged as promising alternatives to pure Transformers, offering efficiency and competitive performance. However, they struggle to generalize beyond their training context windows, collapsing on long-context tasks. We provide the first systematic analysis of this failure, showing that it arises from uncontrolled state growth and uneven receptive field contributions across the hybrid architecture. Guided by this understanding, we introduce Universal Position Interpolation (UPI), a closed-form, training-free scaling method that unifies Mamba's cumulative decay with Transformer rotary frequency scaling. UPI selectively stabilizes unstable Mamba dynamics while rescaling Transformer encodings, controlling state growth and enabling reliable long-context generalization, with only a few auxiliary forward passes. Evaluation shows that UPI extends multiple state-of-the-art hybrid and pure Mamba models from 4K to up to 64K tokens on PG-19 perplexity, LongBench and RULER benchmarks, without sacrificing short-context accuracy. These findings establish the first principled bridge between Transformers and state-space models and open a new direction for training-free context extension methods for emerging hybrid models.


Poster
P3-#608
Mamba-3: Improved Sequence Modeling using State Space Principles

Aakash Sunil Lahoti ⋅ Kevin Li ⋅ Berlin Chen ⋅ Caitlin Wang ⋅ Aviv Bick ⋅ Zico Kolter ⋅ Tri Dao ⋅ Albert Gu

Scaling inference-time compute has emerged as an important driver of LLM performance, making inference efficiency a central focus of model design alongside model quality. While current Transformer models deliver strong quality, their quadratic compute and linear memory requirements make inference expensive. This has spurred the development of sub-quadratic models with reduced compute and constant memory requirements. However, many recent linear models trade off model quality and capability for algorithmic efficiency, failing on tasks such as state tracking. Moreover, their theoretically linear inference remains hardware-inefficient in practice. Guided by an inference-first perspective, we introduce three core methodological improvements inspired by the state space model (SSM) viewpoint of linear models. We combine: (1) a more expressive recurrence derived from SSM discretization, (2) a complex-valued state update rule enabling richer state tracking, and (3) a multi-input, multi-output (MIMO) formulation that improves model performance without increasing decode latency. Together with architectural refinements, Mamba-3 achieves significant gains across retrieval, state-tracking, and downstream language modeling tasks. At the 1.5B scale, Mamba-3 improves average downstream accuracy by 0.6 percentage points compared to the next best model (Gated DeltaNet), with the MIMO variant further improving accuracy by an additional 1.2 points, for a total gain of 1.8 points. Across state-size experiments, Mamba-3 achieves comparable perplexity to Mamba-2 despite using half the state size. These results demonstrate that Mamba-3 advances the performance–efficiency frontier.


Poster
P3-#607
Smooth Reading: Bridging the Gap of Recurrent LLM to Self-Attention LLM on Long-Context Understanding

Kai Liu ⋅ Zhan Su ⋅ Peijie Dong ⋅ Fengran Mo ⋅ Jianfei Gao ⋅ Shaoting Zhang ⋅ Kai Chen

Recurrent large language models (Recurrent LLMs) offer linear computational complexity as efficient alternatives to quadratic self-attention-based LLMs (Self-Attention LLMs). However, Recurrent LLMs underperform on long-context tasks due to limited fixed-size memory. Previous research focused on architectural innovations to enhance memory capacity, but failed to match Self-Attention LLM performance. We argue this limitation stems from processing entire contexts at once being ill-suited for Recurrent LLMs. We propose Smooth Reading, a co-design of recurrent architecture and inference method. It introduces a end-to-end multi-round inference method that processes context incrementally and iteratively summarizes information, reducing memory demands. Methodologically, we reveal architecture-inference interactions play an important role for performance, efficiency and scalability, shedding light on future Recurrent LLM design. Besides, our method substantially bridges the performance gap between Recurrent and Self-Attention LLMs on long-context tasks while preserving efficiency advantages. Smooth Reading boosts SWA-3B-4k from 5.68% lower to 3.61% higher performance than Self-Attention LLMs on LongBench, while maintaining 2.5× faster training and 2× faster inference at 64k context.


Poster
P3-#606
FASA: FREQUENCY-AWARE SPARSE ATTENTION

Yifei Wang ⋅ Yueqi Wang ⋅ Zhenrui Yue ⋅ Huimin Zeng ⋅ Yong Wang ⋅ Ismini Lourentzou ⋅ Zhengzhong Tu ⋅ Xiangxiang Chu ⋅ Julian McAuley

The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100\% of full-KV performance when only keeping 256 tokens, and achieves 2.56$\times$ speedup using just 18.9\% of the cache on AIME24.


Poster
P3-#605
NRGPT: An Energy-based Alternative for GPT

Nima Dehmamy ⋅ Benjamin Hoover ⋅ Bishwajit Saha ⋅ Leo Kozachkov ⋅ Jean-Jacques Slotine ⋅ Dmitry Krotov

Generative Pre-trained Transformer (GPT) architectures are the most popular design for language modeling. Energy-based modeling is a different paradigm that views inference as a dynamical process operating on an energy landscape. We propose a minimal modification of the GPT setting to unify it with the EBM framework. The inference step of our model, which we call eNeRgy-GPT (NRGPT), is conceptualized as an exploration of the tokens on the energy landscape. We prove, and verify empirically, that under certain circumstances this exploration becomes gradient descent, although they don’t necessarily lead to the best performing models. We demonstrate that our model performs well for simple language (Shakespeare dataset), algebraic ListOPS tasks, and richer settings such as OpenWebText language modeling. We also observe that our models may be more resistant to overfitting, doing so only during very long training.


Poster
P3-#604
Probing Rotary Position Embeddings through Frequency Entropy

Yui Oka ⋅ Kentaro Hanafusa ⋅ Taku Hasegawa ⋅ Kyosuke Nishida ⋅ Kuniko Saito

Rotary Position Embeddings (RoPE) are widely used in Transformers to encode positional information in token representations, yet the internal frequency structure of RoPE remains poorly understood. Previous studies have reported conflicting findings on the roles of high- and low-frequency dimensions, offering empirical observations but no unifying explanation. In this paper, we present a systematic framework that bridges these disparate results. We introduce Frequency Entropy (FE), a metric that quantifies the effective utilization of each RoPE frequency dimension, and we provide an analysis of how RoPE’s sinusoidal components contribute to model representations on a per-dimension basis. Based on an analysis of the Llama-4 model, which incorporates both RoPE and NoPE layers, we find that the periodicity captured by FE appears in RoPE layers but not in NoPE layers. Furthermore, FE identifies dimensions in which energy concentrates under RoPE. These characteristics are observed across the spectrum rather than being confined to specific dimensions. Moreover, attenuating extreme-entropy dimensions at inference yields downstream accuracy that is statistically indistinguishable from the baseline, with modest perplexity improvements on average, suggesting that such dimensions are often redundant. Overall, FE provides a simple, general diagnostic for RoPE with implications for analysis and design.


Journal Track Poster
P3-#603
Encoder-only Next Token Prediction

Ethan Ewer · Daewon Chae · Thomas Zeng · Jinkyu Kim · Kangwook Lee

Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP in settings with unbounded compute. We introduce the $\operatorname{Count3}$ task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language modeling.


Poster
P3-#602
World-In-World: World Models in a Closed-Loop World

Jiahan Zhang ⋅ Muqing Jiang ⋅ Nanru Dai ⋅ Taiming Lu ⋅ Arda Uzunoglu ⋅ Shunchi Zhang ⋅ Yana Wei ⋅ Jiahao Wang ⋅ Vishal Patel ⋅ Paul Liang ⋅ Daniel Khashabi ⋅ Cheng Peng ⋅ Rama Chellappa ⋅ Tianmin Shu ⋅ Alan Yuille ⋅ Yilun Du ⋅ Jieneng Chen

Generative world models (WMs) can now simulate worlds with striking visual realism, which naturally raises the question of whether they can endow embodied agents with predictive perception for decision making. Progress on this question has been limited by fragmented evaluation: most existing benchmarks adopt open-loop protocols that emphasize visual quality in isolation, leaving the core issue of embodied utility unresolved, i.e., do WMs actually help agents succeed at embodied tasks? To address this gap, we introduce World-In-World, the first open platform that benchmarks WMs in a closed-loop setting that mirrors real agent-environment interactions. World-In-World provides a unified online planning strategy and a standardized action API, enabling heterogeneous WMs for decision making. We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality; we also present the first data scaling law for world models in embodied settings. Our study uncovers three surprises: (1) visual quality alone does not guarantee task success—controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed-loop performance. By centering evaluation on closed-loop outcomes, World-In-World establishes a new benchmark for the systematic assessment of WMs.


Poster
P3-#601
Equivariant Splitting: Self-supervised learning from incomplete data

Victor Sechaud ⋅ Jérémy Scanvic ⋅ Quentin Barthélemy ⋅ Patrice Abry ⋅ Julián Tachella

Self-supervised learning for inverse problems allows to train a reconstruction network from noise and/or incomplete data alone. These methods have the potential of enabling learning-based solutions when obtaining ground-truth references for training is expensive or even impossible. In this paper, we propose a new self-supervised learning strategy devised for the challenging setting where measurements are observed via a single incomplete observation model. We introduce a new definition of equivariance in the context of reconstruction networks, and show that the combination of self-supervised splitting losses and equivariant reconstruction networks results in unbiased estimates of the supervised loss. Through a series of experiments on image inpainting, accelerated magnetic resonance imaging, sparse-view computed tomography, and compressive sensing, we demonstrate that the proposed loss achieves state-of-the-art performance in settings with highly rank-deficient forward models.


Poster
P3-#701
REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Mike Lasby ⋅ Ivan Lazarevich ⋅ Nish Sinnadurai ⋅ Sean Lie ⋅ Yani Ioannou ⋅ Vithursan Thangarasa

Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert merging on discriminative benchmarks, we find that expert pruning is a superior strategy for generative tasks. We demonstrate that existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms to minimize the reconstruction error bound. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.


Poster
P3-#702
Hierarchical Multi-Scale Molecular Conformer Generation

Jiapeng Hu ⋅ Weizhi Gao ⋅ Zhichao Hou ⋅ Xiaorui Liu

Molecular conformer generation is a fundamental task for drug discovery and material design. Although deep generative models have progressed in this area, existing methods often overlook the hierarchical structural organization inherent to molecules, leading to poor-quality generated conformers. To address this challenge, we demonstrate that capturing the spatial arrangement of key substructures, such as scaffolds, is essential, as they serve as anchors that define the overall molecular distribution. In this paper, we propose a hierarchical multi-scale molecular conformer generation framework (MSGEN), designed to enhance key substructure awareness by leveraging spatially informed guidance. Our framework initiates the generation process from coarse-grained key substructures, progressively refining the conformer by utilizing these coarser-scale structures as conditional guidance for subsequent finer-scale stages. To bridge scale discrepancies between stages, we introduce a molecular upsampling technique that aligns the structural scales, ensuring smooth propagation of geometric guidance. Extensive experiments on standard benchmarks demonstrate that our framework integrates seamlessly with a wide range of existing molecular generative models and consistently generates more stable and chemically plausible molecular conformers.


Poster
P3-#703
Generalized Parallel Scaling with Interdependent Generations

Harry Dong ⋅ David Brandfonbrener ⋅ Eryk Helenowski ⋅ Yun He ⋅ Mrinal Kumar ⋅ Han Fang ⋅ Yuejie Chi ⋅ Karthik Abinav Sankararaman

Parallel LLM inference scaling involves sampling a set of $N>1$ responses for a single input prompt. However, these $N$ parallel responses tend to be generated independently from each other, partitioning compute resources and leaving potentially useful information in one generation untapped by others. This is in contrast to response length scaling where past computation is used in all future steps. For higher quality responses and response sets, we propose Bridge to generate interdependent responses in parallel by rethinking batched LLM hidden states as holistic tensors rather than independent slices. With only a small amount (2.8\%-5.1\%) of new parameters, Bridge improves the relative mean accuracy gains from reinforcement learning with verifiable rewards by up to 39\% and boosts consistency of correct responses. Trained once, Bridge scales to any generation width, all with greater performance than independent generations, unlocking a more general mode of parallel scaling that effectively leverages information between sequences, compatible with any post-generation aggregation technique.

Recent advancements in Large Language Models (LLMs) have shifted from explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning, where intermediate thoughts are represented as vectors rather than text. However, latent reasoning can be brittle on challenging, out-of-distribution tasks where robust reasoning is most critical. To overcome these limitations, we introduce Latent Thought Policy Optimization (LTPO), a parameter-free framework that enhances LLM reasoning entirely at test time, without requiring model parameter updates. LTPO treats intermediate latent "thought" vectors as dynamic parameters that are actively optimized for each problem instance. It employs an online policy gradient method guided by an intrinsic, confidence-based reward signal computed directly from the frozen LLM's own output distributions, eliminating the need for external supervision or expensive text generation during optimization. Extensive experiments on five reasoning benchmarks show that LTPO not only matches or surpasses strong baselines on standard tasks but also demonstrates remarkable robustness where others fail. Most notably, on highly challenging AIME benchmarks where existing latent reasoning baselines collapse to near-zero accuracy, LTPO delivers substantial improvements, showcasing a unique capability for complex reasoning.


Poster
P3-#705
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

Chia-Yu Hung ⋅ Navonil Majumder ⋅ Zhifeng Kong ⋅ Ambuj Mehrish ⋅ Amir Zadeh ⋅ Chuan Li ⋅ Rafael Valle ⋅ Bryan Catanzaro ⋅ Soujanya Poria

We introduce TangoFlux, an efficient Text-to-Audio (TTA) generative model with 515M parameters, capable of generating up to 30 seconds of 44.1kHz audio in 3.7 seconds on a A40 GPU. A key challenge in aligning TTA models lies in creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). To address this, we propose CLAP-Ranked Preference Optimization (CRPO), a novel framework that iteratively generates and optimizes preference data to enhance TTA alignment. We show that the audio preference dataset generated using CRPO outperforms the static alternatives. With this framework, TangoFlux achieves state-of-the-art performance across both objective and subjective benchmarks. https://tangoflux.github.io/ holds the model-generated audio samples for comparison.


Poster
P3-#706
Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling

Christian Belardi ⋅ Justin Lovelace ⋅ Kilian Weinberger ⋅ Carla Gomes

Guided diffusion sampling relies on approximating often intractable likelihood scores, which introduces significant noise into the sampling dynamics. We propose using adaptive moment estimation to stabilize these noisy likelihood scores during sampling. Despite its simplicity, our approach achieves state-of-the-art results on image restoration and class-conditional generation tasks, outperforming more complicated methods, which are often computationally more expensive. We provide empirical analysis of our method on both synthetic and real data, demonstrating that mitigating gradient noise through adaptive moments offers an effective way to improve alignment.


Poster
P3-#707
Improving Classifier-Free Guidance in Masked Diffusion: Low-Dim Theoretical Insights with High-Dim Impact

Kevin Rojas ⋅ Ye He ⋅ Chieh-Hsin Lai ⋅ Yuhta Takida ⋅ Yuki Mitsufuji ⋅ Molei Tao

Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and its extensions to discrete diffusion has recently started to be investigated. In order to improve the algorithms in a principled way, this paper starts by analyzing the exact effect of CFG in the context of a low-dimensional masked diffusion model, with a special emphasis on the guidance schedule. Our analysis shows that high guidance early in sampling (when inputs are heavily masked) harms generation quality, while late-stage guidance has a larger effect. These findings provide a theoretical explanation for empirical observations in recent studies on guidance schedules. The analysis also reveals an imperfection of the current CFG implementations. These implementations can unintentionally cause imbalanced transitions, such as unmasking too rapidly during the early stages of generation, which degrades the quality of the resulting samples. To address this, we draw insight from the analysis and propose a novel classifier-free guidance mechanism. Intuitively, our method smoothens the transport between the data distribution and the initial (masked) distribution, which results in improved sample quality. Remarkably, our method is achievable via a simple one-line code change. Experiments on conditional image and text generation empirically confirm the efficacy of our method.


Poster
P3-#118
STAT: Skill-Targeted Adaptive Training

Yinghui He ⋅ Abhishek Panigrahi ⋅ Yong Lin ⋅ Sanjeev Arora

Language models often show little to no improvement (i.e., “saturation”) when trained via vanilla supervised fine-tuning (SFT) on data similar to what they saw in their training set (e.g., MATH). We introduce a new fine-tuning strategy, STAT, to train such a student model by using the metacognition ability of a stronger large language model (LLM) as the teacher. The teacher uses the task dataset to create a list of skills needed for the task, and then labels each data point with its required skills (Didolkar et al., 2024). By monitoring the student’s answers, the teacher creates a Missing-Skill-Profile for the student, tracking how often they failed to apply each skill in their responses. We use this idea to build a modified training set in one of two ways. In STAT-Sel, the teacher uses an existing set of training examples but adaptively reweights them according to the Missing-Skill-Profile. In STAT-Syn, the teacher synthesizes additional examples involving missing skills. Across extensive experiments on Llama and Qwen models, our methods yield improvements of up to 7.5% on MATH, whereas SFT provides only limited gains. Furthermore, STAT enhances performance on out-of-distribution benchmarks (e.g., AIME24/25, AMC23, etc.) by an average of 4.6%. Crucially, we find that STAT is complementary to RL via GRPO (Shao et al., 2024): after the model is improved using STAT to address skill gaps, GRPO continues to add further gains. We conclude that skill-targeted adaptive training should broadly improve current training pipelines.

We introduce String Seed of Thought (SSoT), a novel prompting method for LLMs that improves Probabilistic Instruction Following (PIF). We define PIF as a task requiring an LLM to select its answer from a predefined set of options, each associated with a specific probability, such that the empirical distribution of the generated answers aligns with the target distribution when prompted multiple times. While LLMs excel at tasks with single, deterministic answers, they often fail at PIF, exhibiting biases problematic for applications requiring non-deterministic behaviors, such as human-behavior simulation, content diversification, and multiplayer games. It also harms the diversity of generated responses, a crucial factor in test-time scaling, by causing the outputs to collapse into a limited set of answers. To address this, we propose SSoT, a simple prompting method that instructs an LLM to first output a random string to generate sufficient entropy. SSoT also instructs the LLM to extract randomness by manipulating this string to derive a final answer, thereby preserving diversity while adhering to specific constraints. We demonstrate that SSoT significantly improves the PIF performance of LLMs, approaching the ideal performance of a pseudo-random number generator. Notably, our experiments on NoveltyBench show SSoT's benefits extend beyond closed-set tasks to open-ended tasks by enhancing response diversity.


Poster
P3-#709
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification

Kimia Hamidieh ⋅ Veronika Thost ⋅ Walter Gerych ⋅ Mikhail Yurochkin ⋅ Marzyeh Ghassemi

Large language models (LLMs) often produce confident yet incorrect responses, and uncertainty quantification is one potential solution to more robust usage. Recent works routinely rely on self-consistency to estimate aleatoric uncertainty (AU), yet this proxy collapses when models are overconfident and produce the same incorrect answer across samples. We analyze this regime and show that cross-model semantic disagreement is higher on incorrect answers precisely when AU is low. Motivated by this, we introduce an epistemic uncertainty (EU) term that operates in the black-box access setting: EU uses only generated text from a small, scale-matched ensemble and is computed as the gap between inter-model and intra-model sequence-semantic similarity. We then define total uncertainty (TU) as the sum of AU and EU. In a comprehensive study across five 7-9B instruction-tuned models and ten long-form tasks, TU improves ranking calibration and selective abstention relative to AU, and EU reliably flags confident failures where AU is low. We further characterize when EU is most useful via agreement and complementarity diagnostics.


Poster
P3-#710
Exploring the Design Space of Transition Matching

Uriel Singer ⋅ Yaron Lipman

Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second ``internal'' generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains.


Poster
P3-#711
Dynamic Weight Grafting: Localizing Finetuned Factual Knowledge in Transformers

Todd Nief ⋅ David Reber ⋅ Sean Richardson ⋅ Ari Holtzman

When an LLM learns a new fact during finetuning (e.g., new movie releases, newly elected pope, etc.), where does this information go? Are entities enriched with relation information immediately, or do models recall information just-in-time before a prediction? Or, are "all of the above" true, with LLMs implementing multiple redundant heuristics? Existing localization approaches (e.g., activation patching) are ill-suited for this analysis because they usually replace parts of the residual stream, thus overriding previous information. To fill this interpretability gap, we propose dynamic weight grafting, an analysis technique that selectively grafts subsets of weights from a finetuned model onto a pretrained model. Using this technique, we show two separate pathways for retrieving finetuned relation information: 1) "enriching" the residual stream with relation information while processing the tokens that correspond to an entity (e.g., "Zendaya" in "Zendaya co-starred with Timothée Chalamet" and 2) "recalling" this information at the final token position before generating a target fact. In some cases, models need information from both of these pathways to correctly generate finetuned facts while, in other cases, either the "enrichment" or "recall" pathway alone is sufficient. We localize the "recall" pathway to model components---finding that "recall" occurs via both task-specific attention mechanisms and an entity-specific extraction step in the feedforward networks of the final layers before prediction. By targeting model components and parameters, as opposed to just activations, we are able to understand the mechanisms by which finetuned knowledge is retrieved during generation.


Poster
P3-#712
THE PATH OF LEAST RESISTANCE: GUIDING LLM REASONING TRAJECTORIES WITH PREFIX CONSENSUS

Ishan Jindal ⋅ Sai Prashanth Akuthota ⋅ Jayant Taneja ⋅ SACHIN SHARMA

Large language models achieve strong reasoning performance, but inference strategies such as Self-Consistency (SC) are computationally expensive, as they fully expand all reasoning traces. We introduce PoLR (Path of Least Resistance), the first inference-time method to leverage prefix self-consistency for compute-efficient reasoning. PoLR clusters short prefixes of reasoning traces, identifies the dominant cluster, and expands only a subset of promising paths, preserving the accuracy benefits of SC while substantially reducing token usage and latency. Our theoretical analysis, framed via mutual information and entropy, explains why early reasoning steps encode strong signals predictive of final correctness. Empirically, PoLR consistently matches or exceeds SC across GSM8K, Math500, AIME 2024/2025, and GPQA-Diamond, reducing token usage by up to 60% and wall-clock latency by up to 50%. Moreover, PoLR is fully complementary to adaptive inference methods (e.g., Adaptive Consistency, Early-Stopping SC) and can serve as a drop-in pre-filter, making SC substantially more efficient and scalable without requiring model fine-tuning.


Poster
P3-#713
SparseD: Sparse Attention for Diffusion Language Models

Zeqing Wang ⋅ Gongfan Fang ⋅ Xinyin Ma ⋅ Xingyi Yang ⋅ Xinchao Wang

While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention’s quadratic complexity with respect to context length in computing all query–key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose **SparseD**, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to $1.50\times$ speedup over FlashAttention at a 64k context length with 1,024 denoising steps. Code is available at https://github.com/INV-WZQ/SparseD.


Poster
P3-#714
Reformulation for Pretraining Data Augmentation

Hao Xintong ⋅ Rui-Jie Zhu ⋅ Ge Zhang ⋅ Ke Shen ⋅ Chenggang Li

Despite the impressive capabilities of large language models across various tasks, their continued scaling is severely hampered not only by data scarcity but also by the performance degradation associated with excessive data repetition during training. To overcome this critical bottleneck, we introduce the Massive Genre-Audience (MGA) reformulation method, a framework designed to augment corpora in a way that supports more effective model performance scaling. Instead of relying on complex, predefined seed systems, MGA systematically reformulates existing corpora into diverse, contextually-rich variations by adaptively generating genre-audience pairs. We present this framework and the resulting 770 billion token MGACorpus, created as a practical instantiation of our methodology. We experimentally validate MGA's core benefits by demonstrating superior scaling properties, in terms of both model size and data budget, against data repetition and upsampling (up to 13B parameters). Furthermore, our comprehensive analysis investigates the role of synthesis principles in generation quality and reveals nuances in evaluating model capabilities using standard loss metrics. Our work shows that a systematic framework like MGA provides a reliable pathway to substantially augment training datasets, effectively alleviating repetition bottlenecks and enabling more efficient scaling of large language models.


Poster
P3-#715
SoFlow: Solution Flow Models for One-Step Generative Modeling

Tianze Luo ⋅ Haotian Yuan ⋅ Zhuang Liu

The multi-step denoising process in diffusion and Flow Matching models causes major efficiency issues, which motivates research on few-step generation. We present Solution Flow Models (SoFlow), a framework for one-step generation from scratch. By analyzing the relationship between the velocity function and the solution function of the velocity ordinary differential equation (ODE), we propose a Flow Matching loss and a solution consistency loss to train our models. The Flow Matching loss allows our models to provide estimated velocity fields for Classifier-Free Guidance (CFG) during training, which improves generation performance. Notably, our consistency loss does not require the calculation of the Jacobian-vector product (JVP), a common requirement in recent works that is not well-optimized in deep learning frameworks like PyTorch. Experimental results indicate that, when trained from scratch using the same Diffusion Transformer (DiT) architecture and an equal number of training epochs, our models achieve better FID-50K scores than MeanFlow models on the ImageNet 256x256 dataset.


Poster
P3-#716
Structurally Human, Semantically Biased: Detecting LLM-Generated References with Embeddings and GNNs

Melika Mobini ⋅ Vincent Holst ⋅ Floriano Tori ⋅ Andres Algaba ⋅ Vincent Ginis

Large language models are increasingly used to curate bibliographies, raising the question: are their reference lists distinguishable from human ones? We build paired citation graphs, ground truth and GPT-4o-generated (from parametric knowledge), for 10,000 focal papers ($\approx$ 275k references) from SciSciNet, and added a field-matched random baseline that preserves out-degree and field distributions while breaking latent structure. We compare (i) structure-only node features (degree/closeness/eigenvector centrality, clustering, edge count) with (ii) 3072-D title/abstract embeddings, using an RF on graph-level aggregates and Graph Neural Networks with node features. Structure alone barely separates GPT from ground truth (RF accuracy $\approx$ 0.60) despite cleanly rejecting the random baseline ($\approx$ 0.89--0.92). By contrast, embeddings sharply increase separability: RF on aggregated embeddings reaches $\approx$ 0.83, and GNNs with embedding node features achieve 93\% test accuracy on GPT vs.\ ground truth. We show the robustness of our findings by replicating the pipeline with Claude Sonnet 4.5 and with multiple embedding models (OpenAI and SPECTER), with RF separability for ground truth vs.\ Claude $\approx 0.77$ and clean rejection of the random baseline. Thus, LLM bibliographies, generated purely from parametric knowledge, closely mimic human citation topology, but leave detectable semantic fingerprints; detection and debiasing should target content signals rather than global graph structure.


Poster
P3-#624
TEN-DM: Topology-Enhanced Diffusion Model for Spatio-Temporal Event Prediction

Yuxin Liu ⋅ Kaiming Wang ⋅ Chenguang Yang ⋅ Yulia Gel ⋅ Yuzhou Chen

Spatio-temporal point process (STPP) data appear in many domains. A natural way to model them is to describe how the instantaneous event rate varies over space and time given the observed history which enables interpretation, interaction detection, and forecasting. Traditional parametric kernel-based models, while historically dominant, struggle to capture complex nonlinear patterns. In contrast, deep learning methods leverage the representational power of neural networks to aggregate historical events and integrate spatio-temporal point processes. However, existing deep learning methods often process space and time independently, overlooking the spatio-temporal dependencies. To address this limitation, we propose a novel method called Topology-ENhanced Diffusion Model (TEN-DM), including two key components namely spatio-temporal graph construction and multimodal topological feature representation learning. Further, we use temporal query technique to effectively capture periodic temporal patterns for learning effective temporal representations. Extensive experiments show the effectiveness of TEN-DM on multiple STPP datasets compared to state-of-the-art methods.


Poster
P3-#717
e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

Amrith Setlur ⋅ Matthew Yang ⋅ Charlie Snell ⋅ Jeremiah Greer ⋅ Ian Wu ⋅ Virginia Smith ⋅ Max Simchowitz ⋅ Aviral Kumar

Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep "thinking" for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging "negative" gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.


Poster
P3-#718
Energy-Based Transformers are Scalable Learners and Thinkers

Alexi Gladstone ⋅ Ganesh Nanduru ⋅ Md Mofijul Islam ⋅ Peixuan Han ⋅ Hyeonjeong Ha ⋅ Aman Chadha ⋅ Yilun Du ⋅ Heng Ji ⋅ Jundong Li ⋅ Tariq Iqbal

Inference-time computation, analogous to human System 2 Thinking, has recently become popular for improving model performance. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question “Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?” We find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs)---a new class of Energy-Based Models (EBMs)---to assign an energy value to every input and candidate-prediction, enabling predictions through energy minimization until convergence. To support this approach, we introduce several key techniques for stable and parallelizable training, which enable the emergence of strong System 2 Thinking capabilities and scalable EBMs. Across discrete and continuous modalities, we find EBTs outperform the Transformer++ approach, scaling up to 35% faster during pretraining, and improving inference-time performance by up to 29%. EBTs also surpass Diffusion Transformers on image denoising while requiring 99% fewer forward passes. Moreover, System 2 Thinking with EBTs yields larger performance gains on data that is farther out-of-distribution, and EBTs achieve better results than existing models on most downstream tasks despite achieving the same or worse pretraining performance, enabling EBTs to generalize better than existing approaches. Consequently, EBTs are a flexible and exciting new approach for scaling both the learning and thinking capabilities of models.


Poster
P3-#719
DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

Wei Pan ⋅ Huiguo He ⋅ Hiuyi Cheng ⋅ Yilin Shi ⋅ Lianwen Jin

Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art (SOTA) methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency.


Poster
P3-#720
Retrieval-of-Thought: Efficient Reasoning via Reusing Thoughts

Ammar Ahmed ⋅ Azal Ahmad Khan ⋅ Ayaan Ahmad ⋅ Sheng Di ⋅ Zirui Liu ⋅ Ali Anwar

Large reasoning models improve accuracy by producing long reasoning traces, but this inflates latency and cost, motivating inference-time efficiency. We propose Retrieval-of-Thought (RoT), which reuses prior reasoning as composable ``thought" steps to guide new problems. RoT organizes steps into a thought graph with sequential and semantic edges to enable fast retrieval and flexible recombination. At inference, RoT retrieves query-relevant nodes and applies reward-guided traversal to assemble a problem-specific template that guides generation. This dynamic template reuse reduces redundant exploration and, therefore, reduces output tokens while preserving accuracy. We evaluate RoT on reasoning benchmarks with multiple models, measuring accuracy, token usage, latency, and memory overhead. Findings show small prompt growth but substantial efficiency gains, with RoT reducing output tokens by up to 40%, inference latency by 82%, and cost by 59% while maintaining accuracy. RoT establishes a scalable paradigm for efficient LRM reasoning via dynamic template construction through retrieval.


Poster
P3-#721
A Study of Posterior Stability in Time-Series Latent Diffusion

Yangming Li ⋅ Yixin Cheng ⋅ Mihaela van der Schaar

Latent diffusion has achieved remarkable success in image generation, with high sampling efficiency. However, this framework might suffer from posterior collapse when applied to time series. In this work, we first show that latent diffusion with a collapsed posterior degenerates into a much weaker generative model: variational autoencoder (VAE). This finding highlights the significance of addressing the problem. We then introduce a principled method: dependency measures, which quantify the sensitivity of a recurrent decoder to input variables. Through this method, we confirm that posterior collapse seriously affects latent time-series diffusion on real time series. For example, the latent variable has an exponentially decreasing impact on the decoder over time. Building on our theoretical and empirical studies, we finally introduce a new framework: posterior-stable latent diffusion, which interprets the diffusion process as a type of variational inference. In this way, it eliminates the use of risky KL regularization and penalizes decoder insensitivity. Extensive experiments on multiple real time-series datasets show that our new framework is with a highly stable posterior and notably outperforms previous baselines in time series synthesis.


Poster
P3-#722
Pareto-Conditioned Diffusion Models for Offline Multi-Objective Optimization

Jatan Shrestha ⋅ Santeri Heiskanen ⋅ Kari Hepola ⋅ Severi Rissanen ⋅ Pekka Jääskeläinen ⋅ Joni Pajarinen

Multi-objective optimization (MOO) arises in many real-world applications where trade-offs between competing objectives must be carefully balanced. In the offline setting, where only a static dataset is available, the main challenge is generalizing beyond observed data. We introduce Pareto-Conditioned Diffusion (PCD), a novel framework that formulates offline MOO as a conditional sampling problem. By conditioning directly on desired trade-offs, PCD avoids the need for explicit surrogate models. To effectively explore the Pareto front, PCD employs a reweighting strategy that focuses on high-performing samples and a reference-direction mechanism to guide sampling towards novel, promising regions beyond the training data. Experiments on standard offline MOO benchmarks show that PCD achieves highly competitive performance and, importantly, demonstrates greater consistency across diverse tasks than existing offline MOO approaches.


Poster
P3-#723
The Diffusion Duality, Chapter II: $\Psi$-Samplers and Efficient Curriculum

Justin Deschenaux ⋅ Caglar Gulcehre ⋅ Subham Sekhar Sahoo

Uniform-state discrete diffusion models excel at few-step generation and guidance due to their ability to self-correct, making them preferred over autoregressive or Masked diffusion models in these settings. However, their sampling quality plateaus with ancestral samplers as the number of steps increases. We introduce a family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers outperform ancestral sampling on both language and image modeling, achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve with more sampling steps. Taken together, these findings call into question the assumption that Masked diffusion is the inevitable future of diffusion-based language modeling. Beyond sampling, we develop a memory-efficient curriculum for the Gaussian relaxation training phase, reducing training time by 25% and memory by 33% compared to Duo while maintaining comparable perplexity on OpenWebText and LM1B and strong downstream performance. We release code, checkpoints, and a video-tutorial on https://s-sahoo.github.io/duo-ch2/


Poster
P3-#724
Generative Modeling from Black-Box Corruptions via Self-Consistent Stochastic Interpolants

Chirag Modi ⋅ Jiequn Han ⋅ Eric Vanden-Eijnden ⋅ Joan Bruna

Transport-based methods have emerged as a leading paradigm for building generative models from large, clean datasets. However, in many scientific and engineering domains, clean data are often unavailable: instead, we only observe measurements corrupted through a noisy, ill-conditioned channel. A generative model for the original data thus requires solving an inverse problem at the level of distributions. In this work, we introduce a novel approach to this task based on Stochastic Interpolants: we iteratively update a transport map between corrupted and clean data samples using only access to the corrupted dataset as well as black box access to the corruption channel. Under appropriate conditions, this iterative procedure converges towards a self-consistent transport map that effectively inverts the corruption channel, thus enabling a generative model for the clean data. We refer to the resulting method as the self-consistent stochastic interpolant (SCSI). It (i) is computationally efficient compared to variational alternatives, (ii) highly flexible, handling arbitrary nonlinear forward models with only black-box access, and (iii) enjoys theoretical guarantees. We demonstrate superior performance on inverse problems in natural image processing and scientific reconstruction, and establish convergence guarantees of the scheme under appropriate assumptions.


Poster
P3-#1904
FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Zhanqiu Hu ⋅ Jian Meng ⋅ Yash Akhauri ⋅ Mohamed Abdelfattah ⋅ Jae-sun Seo ⋅ Zhiru Zhang ⋅ Udit Gupta

Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized Autoregressive (AR) Models (e.g., Qwen2.5 7B, Llama3 8B), their iterative denoising requires multiple full-sequence forward passes, resulting in high computational costs and latency, particularly for long input prompts and long-context scenarios. Furthermore, parallel token generation introduces token incoherence problems, and current sampling heuristics suffer from significant quality drops with decreasing denoising steps. We address these limitations with two training-free techniques. First, we propose *FreeCache*, a Key-Value (KV) approximation caching technique that reuses stable KV projections across denoising steps, effectively reducing the computational cost of DLM inference. Second, we introduce *Guided Diffusion*, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking, dramatically reducing the total number of denoising iterations without sacrificing quality. We conduct extensive evaluations on open-source reasoning benchmarks, and our combined methods deliver an average of 12.14$\times$ end-to-end speedup across various tasks with negligible accuracy degradation. For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models. Our work successfully paved the way for scaling up the diffusion language model to a broader scope of applications across different domains. Our code and implementation are available at https://github.com/ZhanqiuHu/flash-dlm-experimental.


Poster
P3-#725
Learning to Reason Efficiently with Discounted Reinforcement Learning

Alex Ayoub ⋅ Kavosh Asadi ⋅ Dale Schuurmans ⋅ Csaba Szepesvari ⋅ Karim Bouyarmane

Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. More broadly, in goal reaching sequential decision problems we often want to reach the goal quickly, and LRM reasoning can be viewed through this lens. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning, analogous to preferring shorter successful trajectories in a stochastic shortest path problem. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.


Poster
P3-#726
Flower: A Flow-Matching Solver for Inverse Problems

Mehrsa Pourya ⋅ Bassam El Rawas ⋅ Michael Unser

We introduce Flower, a solver for linear inverse problems. It leverages a pre-trained flow model to produce reconstructions that are consistent with the observed measurements. Flower operates through an iterative procedure over three steps: (i) a flow-consistent destination estimation, where the velocity network predicts a denoised target; (ii) a refinement step that projects the estimated destination onto a feasible set defined by the forward operator; and (iii) a time-progression step that re-projects the refined destination along the flow trajectory. We provide a theoretical analysis that demonstrates how Flower approximates Bayesian posterior sampling, thereby unifying perspectives from plug-and-play methods and generative inverse solvers. On the practical side, Flower achieves state-of-the-art reconstruction quality while using nearly identical hyperparameters across various linear inverse problems. Our code is available at https://github.com/mehrsapo/Flower.


Poster
P3-#826
Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding

Daisuke Oba ⋅ Danushka Bollegala ⋅ Masahiro Kaneko ⋅ Naoaki Okazaki

Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step---even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose **SureLock**: when the posterior at an unmasked position has stabilized across steps (our *sure* condition), we *lock* that position---thereafter skipping its query projection and feed-forward sublayers---while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from $O(N^2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension. In practice, $M$ decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30--50\% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our project page is available at https://daioba.github.io/surelock.


Poster
P3-#825
Sample Reward Soups: Query-efficient Multi-Reward Guidance for Text-to-Image Diffusion Models

Yinghua Yao ⋅ Yuangang Pan ⋅ Guoji Fu ⋅ Ivor Tsang

Recent advances in inference-time alignment of diffusion models have shown reduced susceptibility to reward over-optimization. However, when aligning with multiple black-box reward functions, the number of required queries grows exponentially with the number of reward functions, making the alignment process highly inefficient. To address the challenge, we propose the first inference-time soup strategy, named Sample Reward Soups (SRSoup), for Pareto-optimal sampling across the entire space of preferences. Specifically, at each denoising step, we independently steer multiple denoising distributions using reward-guided search gradients (one for each reward function) and then linearly interpolate their search gradients. This design is effective because sample rewards can be shared when two denoising distributions are close, particularly during the early stages of the denoising process. As a result, SRSoup significantly reduces the number of queries required in the early stages without sacrificing performance. Extensive experiments demonstrate the effectiveness of SRSoup in aligning T2I models with diverse reward functions, establishing a practical and scalable solution. The code is available at https://github.com/EvaFlower/Sample-Reward-Soups-ICLR26.


Poster
P3-#824
ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Akshat Ramachandran ⋅ Marina Neseem ⋅ Charbel Sakr ⋅ Rangharajan Venkatesan ⋅ Brucek Khailany ⋅ Tushar Krishna

The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key–value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization–eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens' memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over SoTA baselines.


Poster
P3-#823
Soft-Masked Diffusion Language Models

Michael Hersche ⋅ Samuel Moor-Smith ⋅ Thomas Hofmann ⋅ Abbas Rahimi

Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-k predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that efficiently adapts masked diffusion language models to incorporate SM. We demonstrate that training a 169M parameter model from scratch with SM yields superior perplexity and MAUVE scores compared to binary masking baselines. Similarly, a pretrained model can be enhanced with SM through continued pretraining. Finally, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings.


Poster
P3-#822
Diverse Text-to-Image Generation via Contrastive Noise Optimization

Byungjun Kim ⋅ Soobin Um ⋅ Jong Chul YE

Text-to-image (T2I) diffusion models have demonstrated impressive performance in generating high-fidelity images, largely enabled by text-guided inference. However, this advantage often comes with a critical drawback: limited diversity, as outputs tend to collapse into similar modes under strong text guidance. Existing approaches typically optimize intermediate latents or text conditions during inference, but these methods deliver only modest gains or remain sensitive to hyperparameter tuning. In this work, we introduce Contrastive Noise Optimization, a simple yet effective method that addresses the diversity issue from a distinct perspective. Unlike prior techniques that adapt intermediate latents, our approach shapes the initial noise to promote diverse outputs. Specifically, we develop a contrastive loss defined in the Tweedie data space and optimize a batch of noise latents. Our contrastive optimization repels instances within the batch to maximize diversity while keeping them anchored to a reference sample to preserve fidelity. We further provide theoretical insights into the mechanism of this preprocessing to substantiate its effectiveness. Extensive experiments across multiple T2I backbones demonstrate that our approach achieves a superior quality-diversity Pareto frontier while remaining robust to hyperparameter choices.


Poster
P3-#821
Shift-and-Sum Quantization for Visual Autoregressive Models

Jaehyeon Moon ⋅ Bumsub Ham

Post-training quantization (PTQ) enables efficient deployment of deep networks using a small set of data. Its application to visual autoregressive models (VAR), however, remains relatively unexplored. We identify two key challenges for applying PTQ to VAR: (i) large reconstruction errors in attention–value products, especially at coarse scales where high attention scores occur more frequently; and (ii) a discrepancy between the sampling frequencies of codebook entries and their predicted probabilities due to limited calibration data. To address these challenges, we propose a PTQ framework tailored for VAR. First, we introduce a shift-and-sum quantization method that reduces reconstruction errors by aggregating quantized results from symmetrically shifted duplicates of value tokens. Second, we present a resampling strategy for calibration data that aligns sampling frequencies of codebook entries with their predicted probabilities. Experiments on class-conditional image generation, in-painting, out-painting, and class-conditional editing show consistent improvements across VAR architectures, establishing a new state of the art in PTQ for VAR.


Poster
P3-#820
RADAR: Reasoning-Ability and Difficulty-Aware Routing for Reasoning LLMs

Nigel Steven Fernandez ⋅ Branislav Kveton ⋅ Ryan Rossi ⋅ Andrew Lan ⋅ Jack Wang

Reasoning language models have demonstrated remarkable performance on many challenging tasks in math, science, and coding. Choosing the right reasoning model for practical deployment involves a performance-cost trade-off at two key levels: model size and reasoning budget, where larger models and higher reasoning budgets lead to better performance but incur greater cost and latency. In this work, we tackle this tradeoff from the angle of model configuration routing for different queries, and present RADAR (Reasoning–Ability and Difficulty-Aware Routing), a lightweight, interpretable, and scalable routing framework. Inspired by psychometrics, RADAR learns an item response model from model responses with different budgets to different queries, with interpretable parameters including query difficulties and model-budget abilities. RADAR then routes queries with higher difficulty to model-budget pairs with higher ability, and vice versa. We conduct extensive experiments on 8 widely used challenging reasoning benchmarks, demonstrating the superior performance of RADAR compared to state-of-the-art model routing methods. RADAR also exhibits query generalization capabilities, achieving strong performance on out-of-distribution queries on all benchmarks. RADAR is also scalable and can efficiently integrate additional models by dynamically selecting a small set of evaluation queries to estimate their abilities.


Poster
P3-#819
Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation

Bowei He ⋅ Yankai Chen ⋅ Xiaokun Zhang ⋅ Linghe Kong ⋅ Philip Yu ⋅ Xue Liu ⋅ Chen Ma

Knowledge distillation from Large Language Models (LLMs) to smaller models has emerged as a critical technique for deploying efficient AI systems. However, current methods for distillation via synthetic data lack pedagogical awareness, treating knowledge transfer as a one-off data synthesis and training task rather than a systematic learning process. In this paper, we propose a novel pedagogically-inspired framework for LLM knowledge distillation that draws from fundamental educational principles. Our approach introduces a three-stage pipeline—Knowledge Identifier, Organizer, and Adapter (IOA)—that systematically identifies knowledge deficiencies in student models, organizes knowledge delivery through progressive curricula, and adapts representations to match the cognitive capacity of student models. We integrate Bloom's Mastery Learning Principles and Vygotsky's Zone of Proximal Development to create a dynamic distillation process where student models approach teacher model's performance on prerequisite knowledge before advancing, and new knowledge is introduced with controlled, gradual difficulty increments. Extensive experiments using LLaMA-3.1/3.2 and Qwen2.5 as student models demonstrate that IOA achieves significant improvements over baseline distillation methods, with student models retaining 94.7\% of teacher performance on DollyEval while using less than 1/10th of the parameters. Our framework particularly excels in complex reasoning tasks, showing 19.2\% improvement on MATH and 22.3\% on HumanEval compared with state-of-the-art baselines.


Poster
P3-#818
Composition of Pretrained Diffusion Models: A Logic-Based Calculus

Peter Blohm ⋅ Vikas Garg

Composing pretrained diffusion models provides a cost-effective mechanism to encode constraints and unlock complex generative capabilities. Prior work relies on crafting compositional operators that seek to extend set-theoretic notions such as union and intersection to diffusion models, e.g., using a product or mixture of the underlying energy functions. We expose the inadequacy and inconsistency of combining these operators in terms of limited mode coverage, biased sampling, instability under negation queries, and failure to satisfy basic compositional laws such as idempotency and distributivity. We introduce a principled calculus grounded in fuzzy logic that resolves these issues. Specifically, we define a general class of conjunction, disjunction, and negation operators that generalize the classical mixtures, illustrating how they circumvent various pathologies and enable precise combinatorial reasoning with score models. Beyond existing methods, the proposed Dombi operators yield complex generative outcomes, such as the Exclusive-OR (XOR) of individual scores. We establish rigorous theoretical guarantees on the stability and temperature scaling of Dombi compositions, and derive Feynman-Kac correctors to mitigate the sampling bias in score composition. Empirical results on image generation with stable diffusion and multi-objective molecular generation substantiate the conceptual, theoretical, and methodological benefits. Overall, this work lays the foundation for systematic design, analysis, and deployment of diffusion ensembles. Code is available at https://github.com/Aalto-QuML/logic-diffusion-composition


Poster
P3-#817
Antithetic Noise in Diffusion Models

Jing Jia ⋅ Sifan Liu ⋅ Bowen Song ⋅ Wei Yuan ⋅ Liyue Shen ⋅ Guanyang Wang

We systematically study antithetic initial noise in diffusion models, discovering that pairing each noise sample with its negation consistently produces strong negative correlation. This universal phenomenon holds across datasets, model architectures, conditional and unconditional sampling, and even other generative models such as VAEs and Normalizing Flows. To explain it, we combine experiments and theory and propose a \textit{symmetry conjecture} that the learned score function is approximately affine antisymmetric (odd symmetry up to a constant shift), supported by empirical evidence. This negative correlation leads to substantially more reliable uncertainty quantification with up to $90\%$ narrower confidence intervals. We demonstrate these gains on tasks including estimating pixel-wise statistics and evaluating diffusion inverse solvers. We also provide extensions with randomized quasi-Monte Carlo noise designs for uncertainty quantification, and explore additional applications of the antithetic noise design to improve image editing and generation diversity. Our framework is training-free, model-agnostic, and adds no runtime overhead. Code is available at https://github.com/jjia131/Antithetic-Noise-in-Diffusion-Models-page.


Poster
P3-#816
Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

Hyeongyu Kang ⋅ Jaewoo Lee ⋅ Woocheol Shin ⋅ Kiyoung Om ⋅ Jinkyoo Park

Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose Soft Q-based Diffusion Finetuning (SQDF), a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity. Our code is available at https://github.com/Shin-woocheol/SQDF.


Poster
P3-#814
SESaMo: Symmetry-Enforcing Stochastic Modulation for Normalizing Flows

Janik Kreit ⋅ Dominic Schuh ⋅ Kim A. Nicoli ⋅ Lena Funcke

Deep generative models have recently garnered significant attention across various fields, from physics to chemistry, where sampling from unnormalized Boltzmann-like distributions represents a fundamental challenge. In particular, autoregressive models and normalizing flows have become prominent due to their appealing ability to yield closed-form probability densities. Moreover, it is well-established that incorporating prior knowledge—such as symmetries—into deep neural networks can substantially improve training performances. In this context, recent advances have focused on developing symmetry-equivariant generative models, achieving remarkable results. Building upon these foundations, this paper introduces Symmetry-Enforcing Stochastic Modulation (SESaMo). Similar to equivariant normalizing flows, SESaMo enables the incorporation of inductive biases (e.g., symmetries) into normalizing flows through a novel technique called \textit{stochastic modulation}. This approach enhances the flexibility of the generative model by enforcing exact symmetries while, for the first time, enabling the model to learn broken symmetries during training. Our numerical experiments benchmark SESaMo in different scenarios, including an 8-Gaussian mixture model and physically relevant field theories, such as the $\phi^4$ theory and the Hubbard model.

The famous Schr\"{o}dinger bridge (SB) has gained renewed attention in the generative machine learning field these days for its successful applications in various areas including unsupervised image-to-image translation and particle crowd modeling. Recently, a promising algorithm dubbed GSBM was proposed to solve the generalized SB (GSB) problem, an extension of SB to deal with additional path constraints. Therein the SB is formulated as a minimal kinetic energy conditional flow matching problem, and an additional task-specific stage cost is introduced as the conditional stochastic optimal control (CondSOC) problem. The GSB is a new emerging problem with considerable room for research contributions, and we introduce a novel Gaussian process pinned marginal path posterior inference as a meaningful contribution in this area. Our main motivation is that the stage cost in GSBM, typically representing task-specific obstacles in the particle paths and other congestion penalties, can be potentially noisy and uncertain. Whereas the current GSBM approach regards this stage cost as a noise-free deterministic quantity in the CondSOC optimization, we instead model it as a stochastic quantity. Specifically, we impose a Gaussian process (GP) prior on the pinned marginal path, view the CondSOC objective as a (noisy) likelihood function, and infer the posterior path via sparse variational free-energy GP approximate inference. The main benefit is more flexible marginal path modeling that takes into account the uncertainty in the stage cost such as more realistic noisy observations. On some image-to-image translation and crowd navigation problems under noisy scenarios, we show that our proposed GP-based method yields more robust solutions than the original GSBM.

In many real-world scenarios, obtaining fully observed samples is prohibitively expensive or even infeasible, while partial and noisy observations are comparatively easy to collect. In this work, we study distribution restoration with abundant noisy samples, assuming the corruption process is available as a black-box generator. We show that this task can be framed as a one-sided entropic optimal transport problem and solved via an EM-like algorithm. We further provide a test criterion to determine whether the true underlying distribution is recoverable under per-sample information loss, and show that in otherwise unrecoverable cases, a small number of clean samples can render the distribution largely recoverable. Building on these insights, we introduce SFBD-OMNI, a bridge model-based framework that maps corrupted sample distributions to the ground-truth distribution. Our method generalizes Stochastic Forward-Backward Deconvolution (SFBD; Lu et al., 2025) to handle arbitrary measurement models beyond Gaussian corruption. Experiments across benchmark datasets and diverse measurement settings demonstrate significant improvements in both qualitative and quantitative performance.


Poster
P3-#811
MrRoPE: Mixed-radix Rotary Position Embedding

Qingyuan Tian ⋅ Wenhong Zhu ⋅ Xiaoran Liu ⋅ Xiaofeng Wang ⋅ Rui Wang

Rotary Position Embedding (RoPE)-extension refers to modifying or generalizing the Rotary Position Embedding scheme to handle longer sequences than those encountered during pre-training. However, current extension strategies are highly diverse and lack a unified theoretical foundation. In this paper, we propose $\textbf{\textit{MrRoPE (Mixed-radix RoPE)}}$, a generalized encoding formulation based on a radix system conversion perspective, which elegantly unifies various RoPE-extension approaches as distinct radix conversion strategies. Based on this theory, we introduce two training-free extensions, $\textbf{\textit{MrRoPE-Uni}}$ and $\textbf{\textit{MrRoPE-Pro}}$, which leverage uniform and progressive radix conversion strategies, respectively, to achieve “train short, test long” generalization. Without fine-tuning, MrRoPE-Pro sustains over 85% recall in the 128K-context Needle-in-a-Haystack test and achieves more than double YaRN’s accuracy on Infinite-Bench retrieval and dialogue subsets. Theoretical analysis confirms that MrRoPE-Pro effectively raises the upper bound of RoPE's attainable encoding length, which further validates the reliability and utility of our theory and methodology.

Diffusion models achieve state-of-the-art image quality. However, sampling is costly at inference time because it requires a large number of function evaluations (NFEs). To reduce NFEs, classical ODE numerical methods have been adopted. Yet, the choice of prediction type and integration domain leads to different sampling behaviors. To address these issues, we introduce Dual-Solver, which generalizes multistep samplers through learnable parameters that continuously (i) interpolate among prediction types, (ii) select the integration domain, and (iii) adjust the residual terms. It retains the standard predictor-corrector structure while preserving second-order local accuracy. These parameters are learned via a classification-based objective using a frozen pretrained classifier (e.g., MobileNet or CLIP). For ImageNet class-conditional generation (DiT, GM-DiT) and text-to-image generation (SANA, PixArt-$\alpha$), Dual-Solver improves FID and CLIP scores in the low-NFE regime ($3 \le$ NFE $\le 9$) across backbones.


Poster
P3-#809
Learning Ordinal Probabilistic Reward from Preferences

Longze Chen ⋅ Lu Wang ⋅ Renke Shan ⋅ Ze Gong ⋅ Run Luo ⋅ Jiaming Li ⋅ Jing Luo ⋅ Qiyao Wang ⋅ Min Yang

Reward models are crucial for aligning large language models (LLMs) with human values and intentions. Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations: GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic interpretation. To address these challenges, we introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM). Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response. To make this paradigm practical, we present its closed-form, discrete realization: the Ordinal Probabilistic Reward Model (OPRM), which discretizes the quality score into a finite set of ordinal ratings. Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT). It enables rewards to better reflect absolute text quality by incorporating quality-level annotations, which guide the model to concentrate the probability mass within corresponding rating sub-regions. Experiments on various reward model benchmarks show that our method improves accuracy by 2.9% ~ 7.4% compared to prior reward models, demonstrating strong performance and data efficiency. Analysis of the score distribution provides evidence that our method captures not only relative rankings but also absolute quality.


Poster
P3-#808
Dynamic Multi-sample Mixup with Gradient Exploration for Open-set Graph Anomaly Detection

Caiyang Yu ⋅ Wei Ju ⋅ Haixin Wang ⋅ Yifan Wang ⋅ Ziyue Qiao

This paper studies the problem of open-set graph anomaly detection, which aims to generalize a graph neural network (GNN) trained with a small number of both normal and abnormal nodes to detect unseen anomalies different from training anomalies during inference. This problem is highly challenging due to both the data scarcity of unseen anomalies and the label scarcity for training nodes. Towards this end, we propose a novel approach named Dynamic Multi-sample Mixup with Gradient Exploration (DEMO) for open-set graph anomaly detection. The core of our proposed DEMO is to leverage a dynamic framework to adapt the optimization procedure with high generalizability. In particular, our DEMO first adaptively fuses multiple seen nodes to simulate the unseen anomalies, which expands the decision boundary for the detection model with enhanced generalizability. Moreover, we dynamically adjust sample weights based on their energy gradients to prioritize uncertain and informative nodes, ensuring a robust optimization procedure. To further address both label scarcity and severe class imbalance, we maintain a memory bank of historical records to guide the pseudo-labeling process of unlabeled nodes. Extensive experiments on various benchmark datasets validate the superiority of the proposed DEMO in comparison to various baselines.


Poster
P3-#807
Adaptive Mixture of Disentangled Experts for Dynamic Graph Out-of-Distribution Generalization

Chen Haibo ⋅ Xin Wang ⋅ Guanheng Chen ⋅ Yuan Meng ⋅ Haoyang Li ⋅ Yang Yao ⋅ Zeyang Zhang ⋅ Zhiqiang Zhang ⋅ JUN ZHOU ⋅ Ling Feng ⋅ Wenwu Zhu

Dynamic graph out-of-distribution (OOD) generalization has drawn an increasing amount of attention in the research community, given its wide applicability in real-world scenarios. Existing methods typically employ a fixed-architecture design to extract invariant patterns. However, there may exist evolving distribution shifts in dynamic graphs, leading to suboptimal performance of fixed-architecture designs. To address this issue, we propose a novel adaptive-architecture design to handle evolving distribution shifts over time, to the best of our knowledge, for the first time. The proposed adaptive-architecture design introduces an adaptive mixture of architecture experts to capture invariant patterns under evolving distribution shifts, which imposes three challenges: 1) How to detect and characterize evolving distribution shifts to inform architectural decisions; 2) How to dynamically route different expert architectures to handle varying distribution characteristics; 3) How to ensure that the adaptive mixture of experts effectively discovers invariant patterns. To solve these challenges, we propose a novel Adaptive Mixture of Disentangled Experts (AdaMix) model to adaptively route architecture experts to varying distribution shifts and jointly learn spatio-temporal invariant patterns. Specifically, we propose a spatio-temporal distribution detector to infer evolving distribution shifts by jointly leveraging historical and current information. Building upon this, we develop a prototype-guided mixture of disentangled experts that adaptively routes experts with disentangled factors to different distribution shifts. Finally, we design a distribution-aware intervention mechanism that discovers invariant patterns based on expert selection of nodes. Extensive experiments on both synthetic and real-world datasets demonstrate that our proposed AdaMix model significantly outperforms state-of-the-art baselines.

Dynamic Text-Attribute Graphs (DyTAGs), characterized by time-evolving graph interactions and associated text attributes, are prevalent in real-world applications. Existing methods, such as Graph Neural Networks (GNNs) and Large Language Models (LLMs), mostly focus on static TAGs. Extending these existing methods to DyTAGs is challenging as they largely neglect the *recent-global temporal semantics*: the recent semantic dependencies among interaction texts and the global semantic evolution of nodes over time. Furthermore, applying LLMs to the abundant and evolving text in DyTAGs faces efficiency issues. To tackle these challenges, we propose $\underline{Dy}$namic $\underline{G}$lobal-$\underline{R}$ecent $\underline{A}$daptive $\underline{S}$emantic $\underline{P}$rocessing (DyGRASP), a novel method that leverages LLMs and temporal GNNs to efficiently and effectively reason on DyTAGs. Specifically, we first design a node-centric implicit reasoning method together with a sliding window mechanism to efficiently capture recent temporal semantics. In addition, to capture global semantic dynamics of nodes, we leverage explicit reasoning with tailored prompts and an RNN-like chain structure to infer long-term semantics. Lastly, we intricately integrate the recent and global temporal semantics as well as the dynamic graph structural information using updating and merging layers. Extensive experiments on DyTAG benchmarks demonstrate DyGRASP's superiority, achieving up to 34\% improvement in Hit@10 for destination node retrieval task. Besides, DyGRASP exhibits strong generalization across different temporal GNNs and LLMs.


Poster
P3-#805
Controllable Logical Hypothesis Generation for Abductive Reasoning in Knowledge Graphs

Yisen Gao ⋅ Jiaxin Bai ⋅ Tianshi Zheng ⋅ Ziwei Zhang ⋅ Qingyun Sun ⋅ Xingcheng Fu ⋅ Jianxin Li ⋅ Yangqiu Song

Abductive reasoning in knowledge graphs aims to generate plausible logical hypotheses from observed entities, with broad applications in areas such as clinical diagnosis and scientific discovery. However, due to a lack of controllability, a single observation may yield numerous plausible but redundant or irrelevant hypotheses on large-scale knowledge graphs. To address this limitation, we introduce the task of controllable hypothesis generation to improve the practical utility of abductive reasoning. This task faces two key challenges when controlling for generating long and complex logical hypotheses: hypothesis space collapse and hypothesis reward oversensitivity. To address these challenges, we propose CtrlHGen, a Controllable logcial Hypothesis Generation framework for abductive reasoning over knowledge graphs, trained in a two-stage paradigm including supervised learning and subsequent reinforcement learning. To mitigate hypothesis space collapse, we design a dataset augmentation strategy based on sub-logical decomposition, enabling the model to learn complex logical structures by leveraging semantic patterns in simpler components. To address hypothesis reward oversensitivity, we incorporate smoothed semantic rewards including Dice and Overlap scores, and introduce a condition-adherence reward to guide the generation toward user-specified control constraints. Extensive experiments on three benchmark datasets demonstrate that our model not only better adheres to control conditions but also achieves superior semantic similarity performance compared to baselines. Our code is available at https://github.com/HKUST-KnowComp/CtrlHGen.


Poster
P3-#804
PolyGraph Discrepancy: a classifier-based metric for graph generation

Markus Krimmel ⋅ Philip Hartout ⋅ Karsten Borgwardt ⋅ Dexiong Chen

Existing methods for evaluating graph generative models primarily rely on Maximum Mean Discrepancy (MMD) metrics based on graph descriptors. While these metrics can rank generative models, they do not provide an absolute measure of performance. Their values are also highly sensitive to extrinsic parameters, namely kernel and descriptor parametrization, making them incomparable across different graph descriptors. We introduce PolyGraphScore (PGS), a new evaluation framework that addresses these limitations. It approximates the Jensen-Shannon (JS) distance of graph distributions by fitting binary classifiers to distinguish between real and generated graphs, featurized by these descriptors. The data log-likelihood of these classifiers approximates a variational lower bound on the JS distance between the two distributions. Resulting scores are constrained to the unit interval $[0,1]$ and are comparable across different graph descriptors. We further derive a theoretically grounded summary score that combines these individual metrics to provide a maximally tight lower bound on the distance for the given descriptors. Thorough experiments demonstrate that PGS provides a more robust and insightful evaluation compared to MMD metrics. A reference implementation of PGD is available at https://github.com/BorgwardtLab/polygraph-benchmark


Poster
P3-#803
Contraction and Hourglass Persistence for Learning on Graphs, Simplices, and Cells

Mattie Ji ⋅ Indradyumna Roy ⋅ Vikas Garg

Persistent homology (PH) encodes global information, such as cycles, and is thus increasingly integrated into graph neural networks (GNNs). PH methods in GNNs typically traverse an increasing sequence of subgraphs. In this work, we first expose limitations of this inclusion procedure. To remedy these shortcomings, we analyze contractions as a principled topological operation, in particular, for graph representation learning. We study the persistence of contraction sequences, which we call Contraction Homology (CH). We establish that forward PH and CH differ in expressivity. We then introduce Hourglass Persistence, a class of topological descriptors that interleave a sequence of inclusions and contractions to boost expressivity, learnability, and stability. We also study related families parametrized by two paradigms. We also discuss how our framework extends to simplicial and cellular networks. We further design efficient algorithms that are pluggable into end-to-end differentiable GNN pipelines, enabling consistent empirical improvements over many PH methods across standard real-world graph datasets. Code is available at https://github.com/Aalto-QuML/Hourglass.


Poster
P3-#802
TGM: A Modular and Efficient Library for Machine Learning on Temporal Graphs

Jacob Chmura ⋅ Shenyang(Andy) Huang ⋅ Tran Gia Bao Ngo ⋅ Ali Parviz ⋅ Farimah Poursafaei ⋅ Jure Leskovec ⋅ Michael Bronstein ⋅ Guillaume Rabusseau ⋅ Matthias Fey ⋅ Reihaneh Rabbany

Well-designed open-source software drives progress in Machine Learning (ML) research. While static graph ML enjoys mature frameworks like PyTorch Geometric and DGL, ML for temporal graphs (TG), networks that evolve over time, lacks comparable infrastructure. Existing TG libraries are often tailored to specific architectures, hindering support for diverse models in this rapidly evolving field. Additionally, the divide between continuous- and discrete-time dynamic graph methods (CTDG and DTDG) limits direct comparisons and idea transfer. To address these gaps, we introduce Temporal Graph Modelling (TGM), a research-oriented library for ML on temporal graphs, the first to unify CTDG and DTDG approaches. TGM offers first-class support for dynamic node features, time-granularity conversions, and native handling of link-, node-, and graph-level tasks. Empirically, TGM achieves an average 7.8× speedup across multiple models, datasets, and tasks compared to the widely used DyGLib, and an average 175× speedup on graph discretization relative to available implementations. Beyond efficiency, we show in our experiments how TGM unlocks entirely new research possibilities by enabling dynamic graph property prediction and time-driven training paradigms, opening the door to questions previously impractical to study.

Canonicalization is a widely used strategy in equivariant machine learning, enforcing symmetry in neural networks by mapping each input to a standard form. Yet, it often introduces discontinuities that can affect stability during training, limit generalization, and complicate universal approximation theorems. In this paper, we address this by introducing adaptive canonicalization, a general framework in which the canonicalization depends both on the input and the network. Specifically, we present the adaptive canonicalization based on prior maximization, where the standard form of the input is chosen to maximize the predictive confidence of the network. We prove that this construction yields continuous and symmetry-respecting models that admit universal approximation properties. We propose two applications of our setting: (i) resolving eigenbasis ambiguities in spectral graph neural networks, and (ii) handling rotational symmetries in point clouds. We empirically validate our methods on molecular and protein classification, as well as point cloud classification tasks. Our adaptive canonicalization outperforms the three other common solutions to equivariant machine learning: data augmentation, standard canonicalization, and equivariant architectures.


Poster
P3-#901
Is Graph Unlearning Ready for Practice? A Benchmark on Efficiency, Utility, and Forgetting

Samyak Jain ⋅ Ronak Kalvani ⋅ sainyam galhotra ⋅ Sayan Ranu

Graph Neural Networks (\textsc{Gnn}s) are increasingly being deployed in sensitive, user-centric applications where regulations such as the GDPR mandate the ability to remove data upon request. This has spurred interest in graph unlearning, the task of removing the influence of specific training data from a trained \textsc{Gnn} without retraining from scratch. While several unlearning techniques have recently emerged, the field lacks a principled benchmark to assess whether these methods truly provide a practical alternative to retraining and, if so, how to choose among them for different workloads. In this work, we present the first systematic benchmark for \textsc{Gnn} unlearning, structured around three core desiderata: \emph{efficiency} (is unlearning faster than retraining?), \emph{utility} (does the unlearned model preserve predictive performance and align with the retrained gold standard?), and \emph{forgetting} (does the model genuinely eliminate the influence of removed data?). Through extensive experiments across diverse datasets and deletion scenarios, we deliver a unified assessment of existing approaches, surfacing their trade-offs and limitations. Crucially, our findings show that most unlearning techniques are not yet practical for large-scale graphs. At the same time, our benchmarking yields actionable guidelines on when unlearning can be a viable alternative to retraining and how to select among methods for different workloads, thereby charting a path for future research toward more practical, scalable, and trustworthy graph unlearning.


Poster
P3-#902
Towards Quantifying Long-Range Interactions in Graph Machine Learning: a Large Graph Dataset and a Measurement

Huidong Liang ⋅ Haitz Sáez de Ocáriz Borde ⋅ Baskaran Sripathmanathan ⋅ Michael Bronstein ⋅ Xiaowen Dong

Long-range dependencies are critical for effective graph representation learning, yet most existing datasets focus on small graphs tailored to inductive tasks, offering limited insight into long-range interactions. Current evaluations primarily compare models employing global attention (e.g., graph transformers) with those using local neighborhood aggregation (e.g., message-passing neural networks) without a direct measurement of long-range dependency. In this work, we introduce $\texttt{City-Networks}$, a novel large-scale transductive learning dataset derived from real-world city road networks. This dataset features graphs with over $10^5$ nodes and significantly larger diameters than those in existing benchmarks, naturally embodying long-range information. We annotate the graphs based on local node eccentricities, ensuring that the classification task inherently requires information from distant nodes. Furthermore, we propose a generic measurement based on the Jacobians of neighbors from distant hops, offering a principled quantification of long-range dependencies. Finally, we provide theoretical justifications for both our dataset design and the proposed measurement—particularly by focusing on over-smoothing and influence score dilution—which establishes a robust foundation for further exploration of long-range interactions in graph neural networks.


Poster
P3-#903
: One LLM Token for Explicit Graph Structural Understanding

Jingyao Wu ⋅ Bin Lu ⋅ Zijun Di ⋅ Xiaoying Gan ⋅ Meng Jin ⋅ Luoyi Fu ⋅ Xinbing Wang ⋅ Chenghu Zhou

Large language models show great potential in unstructured data understanding, but still face significant challenges with graphs due to their structural hallucination. Existing approaches mainly either verbalize graphs into natural language, which leads to excessive token consumption and scattered attention, or transform graphs into trainable continuous embeddings (i.e., soft prompt), but exhibit severe misalignment with original text tokens. To solve this problem, we propose to incorporate one special token to fully represent the \textbf{\underline{S}}tructure \textbf{\underline{O}}f \textbf{\underline{G}}raph within a unified token space, facilitating explicit topology input and structural information sharing. Specifically, we propose a topology-aware structural tokenizer that maps each graph topology into a highly selective single token. Afterwards, we construct a set of hybrid structure Question-Answering corpora to align new structural tokens with existing text tokens. With this approach, empowers LLMs to understand, generate, and reason in a concise and accurate manner. Extensive experiments on five graph-level benchmarks demonstrate the superiority of our method, achieving a performance improvement of 9.9–41.4\% compared to the baselines while exhibiting interpretability and consistency. Furthermore, our method provides a flexible extension to node-level tasks, enabling both global and local structural understanding. The codebase is publicly available\footnote{The code of our project is available at \href{https://anonymous.4open.science/r/SOG-8432}{https://anonymous.4open.science/r/SOG-8432}.}.


Poster
P3-#904
Differentiable Lifting for Topological Neural Networks

Jorge Franco ⋅ Gabriel Duarte ⋅ Alexander Nikitin ⋅ Moacir Ponti ⋅ Diego Mesquita ⋅ Amauri Souza

Topological neural networks (TNNs) enable leveraging higher-order structures on graphs (e.g., cycles and cliques) to boost the expressive power of message-passing neural networks. In turn, however, these structures are typically identified a priori through an unsupervised graph lifting operation. Notwithstanding, this choice is crucial and may have a drastic impact on a TNN's performance on downstream tasks. To circumvent this issue, we propose ∂lift (DiffLift), a general framework for learning graph liftings to hypergraphs and cellular, simplicial, and combinatorial complexes in an end-to-end fashion. In particular, our approach leverages learned vertex-level latent representations to identify and parameterize distributions over candidate higher-order cells for inclusion. This results in a scalable model which can be readily integrated into any TNN. Our experiments show that ∂lift outperforms existing lifting methods on multiple benchmarks for graph and node classification across different TNN architectures, with TNN+ ∂lift combinations surpassing standard GNN baselines. Notably, our approach leads to gains of up to 45% over static liftings, including both connectivity- and feature-based ones.


Poster
P3-#905
Temporal Graph Thumbnail: Robust Representation Learning with Global Evolutionary Skeleton

Weining Shi ⋅ Zhisen Wen ⋅ Qinggang Zhang ⋅ Chentao Zhang ⋅ Zhihong Zhang

Temporal graphs are commonly employed as conceptual models for capturing time-evolving interactions in real-world systems. Representation learning on such non-Euclidean data typically depends on aggregating information from neighbors, and the presence of temporal dynamics further complicates this process. However, neighbors often contain noisy information in practice, making the unreliable propagation of knowledge and may even lead to the model failure. Although existing methods employ adaptive spatiotemporal neighbor sampling strategies or temporal dependency modeling frameworks to enhance model robustness, their constrained sampling scope limits handling of severe noise and long-term dependencies. This limitation can be attributed to a fundamental cause: neglecting global evolution inherently overlooks the temporal regularities encoded in continuous dynamics. To address this, we propose the Temporal Graph Thumbnail (TGT), encapsulating a temporal graph’s global evolutionary skeleton as a thumbnail to characterize temporal regularities and enhance model robustness. Specifically, we model the thumbnail by leveraging von Neumann graph entropy and node mutual information to extract essential evolutionary skeleton from the raw temporal graph, and subsequently use it to guide optimization for model learning. In addition to rigorous theoretical derivation, extensive experiments demonstrate that TGT achieves superior capability and robustness compared to baselines, particularly in rapidly evolving and noisy environments. The code is available at https://anonymous.4open.science/r/TGT-BDF2.


Poster
P3-#906
Paradigm Shift of GNN Explainer from Label Space to Prototypical Representation Space

Jun Yin ⋅ Senzhang Wang ⋅ Ziluowen Luo ⋅ Peng Huo ⋅ Hao Yan ⋅ Hao Miao ⋅ Chaozhuo Li ⋅ Shirui Pan ⋅ Chengqi Zhang

Post-hoc instance-level graph neural network (GNN) explainers are developed to identify a compact subgraph (i.e., explanation) that encompasses the most influential components for each input graph. A fundamental limitation of existing methods lies in the insufficient utilization of structural information during GNN explainer optimization. They typically optimize the explainer by aligning the GNN predictions of input graph and its explanation in the graph label space which inherently lacks expressiveness to describe various graph structures. Motivated by the powerful structural expression ability of vectorized graph representations, we for the first time propose to shift the GNN explainer optimization from the graph label space to the graph representation space. However, the paradigm shift is challenging due to both the entanglement between the explanatory and non-explanatory substructures, and the distributional discrepancy between the input graph and the explanation subgraph. To this end, we meticulously design IDEA, a universal dual-stage optimization framework grounded in a prototypical graph representation space, which can generalize across diverse existing GNN explainer architectures. Specifically, in the Structural Information Disentanglement stage, a graph tokenizer equipped with a structure-aware disentanglement objective is designed to disentangle the explanatory substructures and encapsulate them into explanatory prototypes. In the Explanatory Prototype Alignment stage, IDEA aligns the representational distributions of the input graph and its explanation unified in the prototypical representation space, to optimize the GNN explainer. Comprehensive experiments on real-world and synthetic datasets demonstrate the effectiveness of IDEA, with the average improvements of ROC-AUC by 4.45% and precision by 48.71%. We further integrate IDEA with diverse explainer architectures and achieve an improvement by up to 10.70%, which verifies its generalizability.


Poster
P3-#907
A Scalable Inter-edge Correlation Modeling in CopulaGNN for Link Sign Prediction

Jinkyu Sung ⋅ Myunggeum Jee ⋅ Joonseok Lee

Link sign prediction on a signed graph is a task to determine whether the relationship represented by an edge is positive or negative. Since the presence of negative edges violates the graph homophily assumption that adjacent nodes are similar, regular graph methods have not been applicable without auxiliary structures to handle them. We aim to directly model the latent statistical dependency among edges with the Gaussian copula and its corresponding correlation matrix, extending CopulaGNN (Ma et al., 2021). However, a naive modeling of edge-edge relations is computationally intractable even for a graph with moderate scale. To address this, we propose to 1) represent the correlation matrix as a Gramian of edge embeddings, significantly reducing the number of parameters, and 2) reformulate the conditional probability distribution to dramatically reduce the inference cost. We theoretically verify scalability of our method by proving its linear convergence. Also, our extensive experiments demonstrate that it achieves significantly faster convergence than baselines, maintaining competitive prediction performance to the state-of-the-art models.


Poster
P3-#908
Rethinking the Gold Standard: Why Discrete Curvature Fails to Fully Capture Over-squashing in GNNs?

Jialong Chen ⋅ Bowen Deng ⋅ Zibin Zheng ⋅ Chuan Chen

As a topological invariant for discrete structures, discrete curvature has been widely adopted in the study of complex networks and graph neural networks. A prevailing viewpoint posits that edges with highly negative curvature will induce graph bottlenecks and the over-squashing phenomenon. In this paper, we critically re-examine this view and put forward our central claim: **high negative curvature is a sufficient but not a necessary condition for over-squashing**. We first construct a family of counterexamples demonstrating the failure of discrete curvature, where some edges are severely squashed, but the curvature still appears positive. Furthermore, extensive experiments demonstrate that the most commonly used discrete curvature measure --- Ollivier–Ricci curvature --- fails to detect as many as 30%~40% of over-squashed edges. To alleviate this limitation, we propose Weighted Augmented Forman-3 Curvature ($\mathsf{WAF3}$), which significantly improves the detection of over-squashed edges. Additionally, we develop a highly efficient approximation algorithm for $\mathsf{WAF3}$, enabling curvature computation on graphs with five million edges in only 23.6 seconds, which is 133.7 times faster than the existing algorithm with the lowest complexity for curvatures.

After a renaissance phase in which researchers revisited the message-passing paradigm through the lens of deep learning, the graph machine learning community shifted its attention towards a deeper and practical understanding of message-passing's benefits and limitations. In this paper, we notice how the fast pace of progress around the topics of oversmoothing and oversquashing, the homophily-heterophily dichotomy, and long-range tasks, came with the consolidation of commonly accepted beliefs and assumptions -- under the form of universal statements -- that are not always true nor easy to distinguish from each other. We argue that this has led to ambiguities around the investigated problems, preventing researchers from focusing on and addressing precise research questions while causing a good amount of misunderstandings. Our contribution is to make such common beliefs explicit and encourage critical thinking around these topics, refuting universal statements via simple yet formally sufficient counterexamples. The end goal is to clarify conceptual differences, helping researchers address more clearly defined and targeted problems. The hope is to clarify the distinction between the different issues and promote separate but intertwined research directions to address them.


Poster
P3-#910
A Graph Meta-Network for Learning on Kolmogorov–Arnold Networks

Guy Bar-Shalom ⋅ Ami Tavory ⋅ Itay Evron ⋅ Maya Bechler-Speicher ⋅ Ido Guy ⋅ Haggai Maron

Weight-space models learn directly from the parameters of neural networks, enabling tasks such as predicting their accuracy on new datasets. Naive methods -- like applying MLPs to flattened parameters -- perform poorly, making the design of better weight-space architectures a central challenge. While prior work leveraged permutation symmetries in standard networks to guide such designs, no analogous analysis or tailored architecture yet exists for Kolmogorov–Arnold Networks (KANs). In this work, we show that KANs share the same permutation symmetries as MLPs, and propose the KAN-graph, a graph representation of their computation. Building on this, we develop WS-KAN, the first weight-space architecture that learns on KANs, which naturally accounts for their symmetry. We analyze WS-KAN’s expressive power, showing it can replicate an input KAN’s forward pass - a standard approach for assessing expressiveness in weight-space architectures. We construct a comprehensive ``zoo'' of trained KANs spanning diverse tasks, which we use as benchmarks to empirically evaluate WS-KAN. Across all tasks, WS-KAN consistently outperforms structure-agnostic baselines, often by a substantial margin.


Poster
P3-#911
Federated Graph-Level Clustering Network with Dual Knowledge Separation

Xiaobao Wang ⋅ Renda Han ⋅ Ronghao Fu ⋅ Di Jin

Federated Graph-level Clustering (FGC) offers a promising framework for analyzing distributed graph data while ensuring privacy protection. However, existing methods fail to simultaneously consider knowledge heterogeneity across intra- and inter-client, and still attempt to share as much knowledge as possible, resulting in consensus failure in the server. To solve these issues, we propose a novel Federated Graph-level Clustering Network with Dual Knowledge Separation (FGCN-DKS). The core idea is to decouple differentiated subgraph patterns and optimize them separately on the client, and then leverage cluster-oriented patterns to guide personalized knowledge aggregation on the server. Specifically, on the client, we separate personalized subgraphs and cluster-oriented subgraphs for each graph. Then the former are retained locally for further refinement of the clustering process, while pattern digests are extracted from the latter for uploading to the server. On the server, we calculate the relation of inter-cluster patterns to adaptively aggregate cluster-oriented prototypes and parameters. Finally, the server generates personalized guidance signals for each cluster of clients, which are then fed back to local clients to enhance overall clustering performance. Extensive experiments on multiple graph benchmark datasets have proven the superiority of the proposed FGCN-DKS over the SOTA methods.


Poster
P3-#912
The Logical Expressiveness of Topological Neural Networks

Amirreza Akbari ⋅ Amauri Souza ⋅ Vikas Garg

Graph neural networks (GNNs) are the standard for learning on graphs, yet they have limited expressive power, often expressed in terms of the Weisfeiler-Leman (WL) hierarchy or within the framework of first-order logic. In this context, topological neural networks (TNNs) have recently emerged as a promising alternative for graph representation learning. By incorporating higher-order relational structures into message-passing schemes, TNNs offer higher representational power than traditional GNNs. However, a fundamental question remains open: _what is the logical expressiveness of TNNs?_ Answering this allows us to characterize precisely which binary classifiers TNNs can represent. In this paper, we address this question by analyzing isomorphism tests derived from the underlying mechanisms of general TNNs. We introduce and investigate the power of higher-order variants of WL-based tests for combinatorial complexes, called $k$-CCWL test. In addition, we introduce the topological counting logic $TC_{k}$, an extension of standard counting logic featuring a novel pairwise counting quantifier $\exists^{N}(x_i,x_j) \varphi(x_i,x_j),$ which explicitly quantifies pairs $(x_i, x_j)$ satisfying property $\varphi$. We rigorously prove the exact equivalence: $\text{k-CCWL} \equiv \text{TC}_{k{+}2} \equiv \text{Topological }(k{+}2)\text{-pebble game}.$ These results establish a logical expressiveness theory for TNNs.


Poster
P3-#913
Relational Graph Transformer

Vijay Prakash Dwivedi ⋅ Sri Jaladi ⋅ Yangyi Shen ⋅ Federico Lopez ⋅ Charilaos Kanatsoulis ⋅ Rishi Puri ⋅ Matthias Fey ⋅ Jure Leskovec

Relational Deep Learning (RDL) is a promising approach for building state-of-the-art predictive models on multi-table relational data by representing it as a heterogeneous temporal graph. However, commonly used Graph Neural Network models suffer from fundamental limitations in capturing complex structural patterns and long-range dependencies that are inherent in relational data. While Graph Transformers have emerged as powerful alternatives to GNNs on general graphs, applying them to relational entity graphs presents unique challenges: (i) Traditional positional encodings fail to generalize to massive, heterogeneous graphs; (ii) existing architectures cannot model the temporal dynamics and schema constraints of relational data; (iii) existing tokenization schemes lose critical structural information. Here we introduce the Relational Graph Transformer (RelGT), the first graph transformer architecture designed specifically for relational tables. RelGT employs a novel multi-element tokenization strategy that decomposes each node into five components (features, type, hop distance, time, and local structure), enabling efficient encoding of heterogeneity, temporality, and topology without expensive precomputation. Our architecture combines local attention over sampled subgraphs with global attention to learnable centroids, incorporating both local and database-wide representations. Across 21 tasks from the RelBench benchmark, RelGT consistently matches or outperforms GNN baselines by up to 18%, establishing Graph Transformers as a powerful architecture for Relational Deep Learning.


Journal Track Poster
P3-#914
Adaptive Mesh Quantization for Neural PDE Solvers

Winfried van den Dool · Maksim Zhdanov · Yuki M. Asano · Max Welling

Physical systems commonly exhibit spatially varying complexity, presenting a significant challenge for neural PDE solvers. While Graph Neural Networks can handle the irregular meshes required for complex geometries and boundary conditions, they still apply uniform computational effort across all nodes regardless of the underlying physics complexity. This leads to inefficient resource allocation where computationally simple regions receive the same treatment as complex phenomena. We address this challenge by introducing Adaptive Mesh Quantization: spatially adaptive quantization across mesh node, edge and cluster features, dynamically adjusting the bit-width used by a quantized model. We propose an adaptive bit-width allocation strategy driven by a lightweight auxiliary model that identifies high-loss regions in the input mesh. This enables dynamic resource distribution in the main model, where regions of higher difficulty are allocated increased bit-width, optimizing computational resource utilization. We demonstrate our framework's effectiveness by integrating it with two state-of-the-art models, MP-PDE and GraphViT, to evaluate performance across multiple tasks: 2D Darcy flow, large-scale unsteady fluid dynamics in 2D, steady-state Navier–Stokes simulations in 3D, and a 2D hyper-elasticity problem. Our framework demonstrates consistent Pareto improvements over uniformly quantized baselines, yielding up to 50\% improvements in performance at the same cost.


Poster
P3-#915
Bilateral Information-aware Test-time Adaptation for Vision-Language Models

Jingwei Sun ⋅ Jianing ZHU ⋅ Jiangchao Yao ⋅ Gang Niu ⋅ Masashi Sugiyama ⋅ Bo Han

Test-time adaptation (TTA) fine-tunes models using new data encountered during inference, which enables the vision-language models to handle test data with covariant shifts. Unlike training-time adaptation, TTA does not require a test-distributed validation set or consider the worst-case distribution within a given tolerance. However, previous methods primarily focused on adaption-objective design, while the data tend to be fully utilized or simply filtered through a fixed low-entropy selection criteria. In this paper, we analyze the weakness of previous selection criterion and find that only selecting fixed proportion of low-entropy samples fails to ensure optimal performance across various datasets and can lead the model to becoming over-confident in wrongly classified samples, showing unexpected overfitting to atypical features and compromising effective adaptation. To improve upon them, we propose \textit{Bilateral Information-aware Test-Time Adaptation} (BITTA), which simultaneously leverages two distinct parts of the test inputs during adaptation. Specifically, a dynamic proportion of low-entropy samples are used to learn the core representation under covariant shifts, while high-entropy samples are adopted to unlearn atypical features. This dual approach prevents the model from undesired memorization and ensures extensive optimal performance. Comprehensive experiments validate the effectiveness in various datasets and model architectures. The code is publicly available at: https://github.com/tmlr-group/BITTA.


Poster
P3-#916
Dual Randomized Smoothing: Beyond Global Noise Variance

Chenhao Sun ⋅ Yuhao Mao ⋅ Martin Vechev

Randomized Smoothing (RS) is a prominent technique for certifying the robustness of neural networks against adversarial perturbations. With RS, achieving high accuracy at small radii requires a small noise variance, while achieving high accuracy at large radii requires a large noise variance. However, the global noise variance used in the standard RS formulation leads to a fundamental limitation: there exists no global noise variance that simultaneously achieves strong performance at both small and large radii. To break through the global variance limitation, we propose a dual RS framework which enables input-dependent noise variances. To achieve that, we first prove that RS remains valid with input-dependent noise variances, provided the variance is locally constant around each input. Building on this result, we introduce two components which form our dual RS framework: (i) a variance estimator first predicts an optimal noise variance for each input, (ii) this estimated variance is then used by a standard RS classifier. The variance estimator is independently smoothed via RS to ensure local constancy, enabling flexible design. We also introduce efficient training strategies to iteratively optimize the two components involved in the framework. Extensive experiments on the CIFAR-10 dataset demonstrate that our dual RS method provides strong performance for both small and large radii—unattainable with global noise variance—while incurring only a 60\% computational overhead at inference. Moreover, it consistently outperforms prior input-dependent noise approaches across most radii, with particularly large gains at radii 0.5, 0.75, and 1.0, achieving relative improvements of 15.6\%, 20.0\%, and 15.7\%, respectively. On ImageNet, dual RS remains effective across all radii, with 8.6\%, 17.1\% and 9.1\% performance advantages at radii 0.5, 1.0 and 1.5 respectively. Additionally, the proposed dual RS framework naturally provides a routing perspective for certified robustness, improving the accuracy-robustness trade-off with off-the-shelf expert RS models. Our code is available at https://github.com/eth-sri/Dual-Randomized-Smoothing.


Poster
P3-#917
Contamination Detection for VLMs Using Multi‑Modal Semantic Perturbations

Jaden Park ⋅ Mu Cai ⋅ Feng Yao ⋅ Jingbo Shang ⋅ Soochahn Lee ⋅ Yong Jae Lee

Recent advances in Vision–Language Models (VLMs) have achieved state-of-the-art performance on numerous benchmark tasks. However, the use of internet-scale, often proprietary, pretraining corpora raises a critical concern for both practitioners and users: inflated performance due to \emph{test-set leakage}. While prior works have proposed mitigation strategies such as decontamination of pretraining data and benchmark redesign for LLMs, the complementary direction of developing detection methods for \emph{contaminated VLMs} remains underexplored. To address this gap, we deliberately contaminate open-source VLMs on popular benchmarks and show that existing detection approaches either fail outright or exhibit inconsistent behavior. We then propose a novel simple yet effective detection method based on \textit{multi-modal semantic perturbation}, demonstrating that contaminated models fail to generalize under controlled perturbations. Finally, we validate our approach across multiple realistic contamination strategies, confirming its robustness and effectiveness. The code and perturbed dataset are released here: \href{https://github.com/jadenpark0/mm-perturb}{https://github.com/jadenpark0/mm-perturb}.


Poster
P3-#918
Are Reasoning LLMs Robust to Interventions on their Chain-of-Thought?

Alexander von Recum ⋅ Leander Girrbach ⋅ Zeynep Akata

Reasoning LLMs (RLLMs) generate step-by-step chains of thought (CoTs) before giving an answer, which improves performance on complex tasks and makes reasoning transparent. But how robust are these reasoning traces to disruptions that occur within them? To address this question, we introduce a controlled evaluation framework that perturbs a model’s own CoT at fixed timesteps. We design seven interventions (benign, neutral, and adversarial) and apply them to multiple open-weight RLLMs across MATH, SCIENCE, and LOGIC tasks. Our results show that RLLMs are generally robust, reliably recovering from diverse perturbations, with robustness improving with model size and degrading when interventions occur early. However, robustness is not style-invariant: paraphrasing suppresses doubt-like expressions and reduces performance, while other interventions trigger doubt and support recovery. Recovery also carries a cost: neutral and adversarial noise can inflate CoT length by more than 200%, whereas paraphrasing shortens traces but harms accuracy. These findings provide new evidence on how RLLMs maintain reasoning integrity, identify doubt as a central recovery mechanism, and highlight trade-offs between robustness and efficiency that future training methods should address.

Bridging the gap between the formal precision of system specifications and the nuances of human language is critical for reliable engineering, robotics, and AI safety, but it remains a major bottleneck. Prior efforts in grounding formal logic remain fragmented, resulting in datasets that are very small-scale (~2-5k examples), domain-specific, or translate logic into overly technical forms rather than context-rich natural language (NL). Thus, failing to adequately bridge formal methods and practical NLP. To address this gap, we introduce VERIFY, the first large-scale dataset meticulously designed to unify these elements. This dataset contains more than 200k+ rigorously generated triplets, each comprising a Linear Temporal Logic (LTL) formula, a structured, human-readable 'Intermediate Technical Language' (ITL) representation designed as a bridge between logic and text, and a domain-specific NL description contextualized across 13 diverse domains. VERIFY's construction pipeline ensures high fidelity: LTL formulas are enumerated and verified via model checking, mapped to the novel ITL representation using a provably complete formal grammar, and then translated into context-aware NL via LLM-driven generation. We guarantee data quality through extensive validation protocols, i.e., manual expert verification of 10,000 diverse samples. Furthermore, automated semantic consistency checks judged by Llama 3.3 confirmed an estimated >97% semantic correctness. From the initial experiments, we demonstrate VERIFY's scalability, logical complexity, and contextual diversity, significantly challenging standard models such as T5 and Llama 3.


Poster
P3-#920
Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness

Subeen Park ⋅ JOOWANG KIM ⋅ Hakyung Lee ⋅ Sunjae yoo ⋅ Kyungwoo Song

Deep learning models achieve strong performance across various domains but often rely on spurious correlations, making them vulnerable to distribution shifts. This issue is particularly severe in subpopulation shift scenarios, where models struggle in underrepresented groups. While existing methods have made progress in mitigating this issue, their performance gains are still constrained. They lack a theoretical motivation connecting the embedding space representations with worst-group error. To address this limitation, we propose Spurious Correlation-Aware Embedding Regularization for Worst-Group Robustness (SCER), a novel approach that directly regularizes feature representations to suppress spurious cues. We theoretically show that worst-group error is influenced by how strongly the classifier relies on spurious versus core directions, as identified from differences in group-wise mean embeddings across domains and classes. By imposing theoretical constraints at the embedding level, SCER encourages models to focus on core features while reducing sensitivity to spurious patterns. Through systematic evaluation on multiple vision and language tasks, we show that SCER outperforms prior state-of-the-art methods in worst-group accuracy. Our code is available at \href{https://github.com/MLAI-Yonsei/SCER}{https://github.com/MLAI-Yonsei/SCER}.


Poster
P3-#921
Robust Deep Reinforcement Learning against Adversarial Behavior Manipulation

Shojiro Yamabe ⋅ Kazuto Fukuchi ⋅ Jun Sakuma

This study investigates behavior-targeted attacks on reinforcement learning and their countermeasures. Behavior-targeted attacks aim to manipulate the victim's behavior as desired by the adversary through adversarial interventions in state observations. Existing behavior-targeted attacks have some limitations, such as requiring white-box access to the victim's policy. To address this, we propose a novel attack method using imitation learning from adversarial demonstrations, which works under limited access to the victim's policy and is environment-agnostic. In addition, our theoretical analysis proves that the policy's sensitivity to state changes impacts defense performance, particularly in the early stages of the trajectory. Based on this insight, we propose time-discounted regularization, which enhances robustness against attacks while maintaining task performance. To the best of our knowledge, this is the first defense strategy specifically designed for behavior-targeted attacks.


Poster
P3-#922
Capability-Based Scaling Trends for LLM-Based Red-Teaming

Alexander Panfilov ⋅ Paul Kassianik ⋅ Maksym Andriushchenko ⋅ Jonas Geiping

As large language models grow in capability and agency, identifying vulnerabilities through red-teaming becomes vital for safe deployment. However, traditional prompt-engineering approaches may prove ineffective once red-teaming turns into a \emph{weak-to-strong} problem, where target models surpass red-teamers in capabilities. To study this shift, we frame red-teaming through the lens of the \emph{capability gap} between attacker and target. We evaluate more than 600 attacker-target pairs using LLM-based jailbreak attacks that mimic human red-teamers across diverse families, sizes, and capability levels. Three strong trends emerge: (i) more capable models are better attackers, (ii) attack success drops sharply once the target’s capability exceeds the attacker's, and (iii) attack success rates correlate with high performance on social science splits of the MMLU-Pro benchmark. From these observations, we derive a \emph{jailbreaking scaling curve} that predicts attack success for a fixed target based on attacker-target capability gap. These findings suggest that fixed-capability attackers (e.g., humans) may become ineffective against future models, increasingly capable open-source models amplify risks for existing systems, and model providers must accurately measure and control models' persuasive and manipulative abilities to limit their effectiveness as attackers.

Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical physics. We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantee smooth training.

For overparameterized linear regression with isotropic Gaussian design and minimum-$\ell_p$ interpolator $p\in(1,2]$, we give a unified, high-probability characterization for the scaling of the family of parameter norms $ \\{ \lVert \widehat{w_p} \rVert_r \\}_{r \in [1,p]} $ with sample size. We solve this basic, but unresolved question through a simple dual-ray analysis, which reveals a competition between a signal *spike* and a *bulk* of null coordinates in $X^\top Y$, yielding closed-form predictions for (i) a data-dependent transition $n_\star$ (the "elbow"), and (ii) a universal threshold $r_\star=2(p-1)$ that separates $\lVert \widehat{w_p} \rVert_r$'s which plateau from those that continue to grow with an explicit exponent. This unified solution resolves the scaling of *all* $\ell_r$ norms within the family $r\in [1,p]$ under $\ell_p$-biased interpolation, and explains in one picture which norms saturate and which increase as $n$ grows. We then study diagonal linear networks (DLNs) trained by gradient descent. By calibrating the initialization scale $\alpha$ to an effective $p_{\mathrm{eff}}(\alpha)$ via the DLN separable potential, we show empirically that DLNs inherit the same elbow/threshold laws, providing a predictive bridge between explicit and implicit bias. Given that many generalization proxies depend on $\lVert \widehat {w_p} \rVert_r$, our results suggest that their predictive power will depend sensitively on which $l_r$ norm is used.

Local loss geometry in machine learning is inherently a two-operator concept. While a single loss is locally characterized by its Hessian spectrum, practical learning depends on both training and test losses, whose joint geometry is determined not only by their spectra but by the alignment of their eigenspaces. We establish general foundations for this two-loss geometry by deriving a universal local fluctuation law: the expected test-loss increment under small training perturbations is a trace combining train and test spectral data with a precise factor quantifying eigenvector overlap. We further prove a transfer law describing how overlaps transform under noise. As a solvable model, we apply these results to ridge regression under arbitrary covariate shift, where operator-valued free probability yields asymptotically exact overlap decompositions that identify overlaps as the natural quantities for specifying shift, and resolve multiple descent: error peaks are governed by eigenspace misalignment rather than Hessian ill-conditioning alone. We then validate the fluctuation law in multilayer perceptrons, develop scalable estimators for overlap functionals based on subspace iteration and kernel polynomial methods, and apply them to a ResNet-20 trained on CIFAR-10, showing that class imbalance reshapes train–test geometry through induced misalignment. Together, these results establish eigenvector overlaps as the fundamental missing ingredient in local loss geometry, providing both theoretical foundations and practical tools for analyzing generalization in modern neural networks.


Poster
P3-#926
Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian

Yiran Zhang ⋅ Weihang Xu ⋅ Mo Zhou ⋅ Maryam Fazel ⋅ Simon Du

Score matching has become a central training objective in modern generative modeling, particularly in diffusion models, where it is used to learn high-dimensional data distributions through the estimation of score functions. Despite its empirical success, the theoretical understanding of the optimization behavior of score matching, particularly in over-parameterized regimes, remains limited. In this work, we study gradient descent for training over-parameterized models to learn a single Gaussian distribution. Specifically, we use a student model with $n$ learnable parameters, motivated by the structure of a Gaussian mixture model, and train it on data generated from a single ground-truth Gaussian using the population score matching objective. We analyze the optimization dynamics under multiple regimes. When the noise scale is sufficiently large, we prove a global convergence result for gradient descent, which resembles the known behavior of gradient EM in over-parameterized settings. In the low-noise regime, we identify the existence of a stationary point, highlighting the difficulty of proving global convergence in this case. Nevertheless, we show convergence under certain initialization conditions: when the parameters are initialized to be exponentially small, gradient descent ensures convergence of all parameters to the ground truth. We further give an example where, without the exponentially small initialization, the parameters may not converge to the ground truth. Finally, we consider the case of random initialization, where parameters are sampled from a Gaussian distribution far from the ground truth. We prove that, with high probability, only one parameter converges while the others diverge to infinity, yet the loss still converges to zero with a $1/\tau$ rate, where $\tau$ is the number of iterations. We also establish a nearly matching lower bound on the convergence rate in this regime. This is the first work to establish global convergence guarantees for Gaussian mixtures with at least three components under the score matching framework.


Poster
P3-#1026
Memorizing Long-tail Data Can Help Generalization Through Composition

Mo Zhou ⋅ Haoyang Ma ⋅ Rong Ge

Deep learning has led researchers to rethink the relationship between memorization and generalization. In many settings, memorization does not hurt generalization due to implicit regularization and may help by memorizing long-tailed examples. In this paper, we consider the synergy between memorization and simple composition --- the ability to make correct prediction on a combination of long-tailed features. Theoretically, we show that for a linear setting, memorization together with composition can help the model make correct predictions on rare test examples that require a combination of long-tailed features, even if such combinations were never observed in the training data. Experiments on neural network architecture on simple data show that the theoretical insight extends beyond the linear setting, and we further observe that the composition capability of the model depends on its architecture.

The high-dimensional parameter space of deep neural networks --- the neuromanifold --- is endowed with a unique metric tensor defined by the Fisher information. Reliable and scalable computation of this metric tensor is valuable for theorists and practitioners. Focusing on neural classifiers, we return to a low-dimensional space of probability distributions, which we call the core space, and examine the spectrum and envelopes of its Fisher information matrix. We extend our discoveries there to deterministic bounds for the metric tensor on the neuromanifold. We introduce an unbiased random estimator based on Hutchinson's trace method and derive related bounds. It can be evaluated efficiently with a single backward pass per batch, with a standard deviation bounded by the true value up to scaling.


Poster
P3-#1024
The Effect of Attention Head Count on Transformer Approximation

Penghao Yu ⋅ Haotian Jiang ⋅ Zeyu Bao ⋅ Ruoxi Yu ⋅ Qianxiao Li

Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized $D$-retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for $\epsilon$-approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as $O(1/\epsilon^{cT})$, for some constant $c$ and sequence length $T$. To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order $O(T)$ allows complete memorization of the input, resulting in the approximation entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.


Poster
P3-#1023
Theoretical Modeling of Large Language Model Self-Improvement Training Dynamics Through Solver-Verifier Gap

Yifan Sun ⋅ Yushan Liang ⋅ Zhen Zhang ⋅ Xin Liu ⋅ Jiaye Teng

Self-improvement is a significant techniques within the realm of large language model (LLM), aiming to enhance the LLM performance without relying on external data. Despite its significance, generally how LLM performances evolve during the self-improvement process remains underexplored. In this paper, we theoretically model the training dynamics of self-improvement via the concept of solver-verifier gap. This is inspired by the conjecture that the performance enhancement of self-improvement stems from the gap between LLM's solver capability and verifier capability. Based on the theoretical framework, we further show how to model the entire training trajectory. This framework allows quantifying the capability limit of self-improvement by fitting the theoretical model to the experiment results. We validate the effectiveness of the theoretical framework on various LLMs and datasets. Beyond self-improvement, we extend our analysis to investigate how external data influences these dynamics within the framework. Notably, we find that under limited external data regimes, such external data can be utilized at any stage without significantly affecting final performances, which accords with the empirical observations.


Poster
P3-#1022
Language Models are Injective and Hence Invertible

Giorgos Nikolaou ⋅ Tommaso Mencattini ⋅ Donato Crisostomi ⋅ Andrea Santilli ⋅ Yannis Panagakis ⋅ Emanuele Rodolà

Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model’s representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.


Poster
P3-#1021
Training-Free Determination of Network Width via Neural Tangent Kernel

Tatsumi Sunada ⋅ Toshihiko Yamasaki ⋅ Atsuto Maki

Determining an appropriate size for an artificial neural network under computational constraints is a fundamental challenge. This paper introduces a practical metric, derived from Neural Tangent Kernel (NTK), for estimating the minimum necessary network width with respect to test loss prior to training. We provide both theoretical and empirical evidence that the smallest eigenvalue of the NTK strongly influences test loss in wide but finite-width neural networks. Based on this observation, we define an NTK-based metric computed at initialization to identify what we call cardinal width, i.e., the width of a network at which generalization performance saturates. Our experiments across multiple datasets and architectures demonstrate the effectiveness of this metric in estimating the cardinal width.


Poster
P3-#1020
Deep Hierarchical Learning with Nested Subspace Networks for Large Language Models

Paulius Rauba ⋅ Mihaela van der Schaar

Large neural networks are typically trained for a fixed computational budget, creating a rigid trade-off between performance and efficiency that is ill-suited for deployment in resource-constrained or dynamic environments. Existing approaches to this problem present a difficult choice: training a discrete collection of specialist models is computationally prohibitive, while dynamic methods like slimmable networks often lack the flexibility to be applied to large, pre-trained foundation models. In this work, we propose Nested Subspace Networks (NSNs), a novel architectural paradigm that enables a single model to be dynamically and granularly adjusted across a continuous spectrum of compute budgets at inference time. The core of our approach is to re-parameterize linear layers to satisfy a nested subspace property, such that the function computed at a given rank is a strict subspace of the function at any higher rank. We show that this entire hierarchy of models can be optimized jointly via an uncertainty-aware objective that learns to balance the contributions of different ranks based on their intrinsic difficulty. We demonstrate empirically that NSNs can be surgically applied to pre-trained LLMs and unlock a smooth and predictable compute-performance frontier. For example, a single NSN-adapted model can achieve a 50\% reduction in inference FLOPs with only a 5 percentage point loss in accuracy. Our findings establish NSNs as a powerful framework for creating the next generation of adaptive foundation models.


Poster
P3-#1019
MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

Yu Zhang ⋅ Huiling Zhen ⋅ Mingxuan Yuan ⋅ Bei Yu

Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34\% higher training throughput.


Poster
P3-#1018
CPiRi: Channel Permutation-Invariant Relational Interaction for Multivariate Time Series Forecasting

Jiyuan Xu ⋅ Wenyu Zhang ⋅ Xin Jing ⋅ Jiahao Nie ⋅ Shuai Chen ⋅ Shuai Zhang

Current methods for multivariate time series forecasting can be classified into channel-dependent and channel-independent models. Channel-dependent models learn cross-channel features but often overfit the channel ordering, which hampers adaptation when channels are added or reordered. Channel-independent models treat each channel in isolation to increase flexibility, yet this neglects inter-channel dependencies and limits performance. To address these limitations, we propose CPiRi, a channel permutation invariant (CPI) framework that infers cross-channel structure from data rather than memorizing a fixed ordering, enabling deployment in settings with structural and distributional co-drift without retraining. CPiRi couples spatio-temporal decoupling architecture with permutation-invariant regularization training strategy: a frozen pretrained temporal encoder extracts high-quality temporal features, a lightweight spatial module learns content-driven inter-channel relations, while a channel shuffling strategy enforces CPI during training. We further ground CPiRi in theory by analyzing permutation equivariance in multivariate time series forecasting. Experiments on multiple benchmarks show state-of-the-art results. CPiRi remains stable when channel orders are shuffled and exhibits strong inductive generalization to unseen channels even when trained on only half of the channels, while maintaining practical efficiency on large-scale datasets. The source code is released at https://github.com/JasonStraka/CPiRi.


Poster
P3-#1017
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

Chen Shani ⋅ Liron Soffer ⋅ Dan Jurafsky ⋅ Yann LeCun ⋅ Ravid Shwartz-Ziv

Humans organize knowledge into compact conceptual categories that balance compression with semantic richness. Large Language Models (LLMs) exhibit impressive linguistic abilities, but whether they navigate this same compression-meaning trade-off remains unclear. We apply an Information Bottleneck framework to compare human conceptual structure with embeddings from 40+ LLMs using classic categorization benchmarks (Rosch, 1973a; 1975; McCloskey & Glucksberg, 1978). We find that LLMs broadly agree with human category boundaries, yet fall short on fine-grained semantic distinctions. Unlike humans, who maintain "inefficient" representations that preserve contextual nuance, LLMs aggressively compress, achieving more optimal information-theoretic compression at the cost of semantic richness. Surprisingly, encoder models outperform much larger decoder models in agreement with human categories, suggesting that understanding and generation rely on distinct representational mechanisms. Training-dynamics analysis reveals a two-phase trajectory: rapid initial concept formation followed by architectural reorganization, during which semantic processing migrates from deep to mid-network layers as the model discovers increasingly efficient, sparser encodings. These divergent strategies, where LLMs optimize for compression and humans for adaptive utility, reveal fundamental differences between artificial and natural intelligence. This highlights the need for models that preserve the conceptual "inefficiencies" essential for human-like understanding.


Poster
P3-#1016
Gradient Intrinsic Dimensionality Alignment:Narrowing The Gap Between Low-Rank Adaptation and Full Fine-Tuning

Jingqi Ye ⋅ Haonan He ⋅ Minglei Li ⋅ Fujun Han ⋅ Tao Chen ⋅ Peng Ye

Parameter-Efficient Fine-Tuning (PEFT) techniques, such as Low-Rank Adaptation (LoRA) and its variants, have emerged as critical tools for adapting large pretrained models under limited computational resources. However, a notable performance gap persists between these LoRA methods and Full Fine-Tuning (FFT). In this paper, we investigate a key yet overlooked cause of this gap: the relationship between LoRA's low-rank adaptation subspace and true effective update directions of FFT gradients, which we define as the gradient intrinsic dimensionality. To systematically quantify this dimension, we first propose a novel entropy-based estimator, uncovering substantial discrepancies (up to more than 100x) between the rank of LoRA and the gradient intrinsic dimensionality. Motivated by this finding, we introduce RaLoRA, which adaptively aligns the ranks of LoRA adapters with layer-specific gradient intrinsic dimensions, without increasing the number of overall parameters. We further extend this approach into RaLoRA-Pro, integrating intra-layer rank alignment and inter-layer parameter reallocation guided by loss sensitivity, enabling finer-grained capacity relocation under comparable parameters. Extensive experiments demonstrate the effectiveness of our methods. Specifically, compared to vanilla LoRA, our methods achieve more than +5\% improvement on GLUE, +0.57 on MT-Bench, +5.23\% on GSM8K, +5.69\% on HumanEval, and +1.58\% on image classification, confirming consistent and substantial performance gains across diverse tasks and modalities.


Poster
P3-#1015
Cutting the Skip: Training Residual-Free Transformers

Yiping Ji ⋅ James Martens ⋅ Jianqiao Zheng ⋅ Ziqin Zhou ⋅ Peyman Moghadam ⋅ Xinyu Zhang ⋅ Hemanth Saratchandran ⋅ Simon Lucey

Transformers have achieved remarkable success across a wide range of applications, a feat often attributed to their scalability. Yet training them without residual (skip) connections remains notoriously difficult. While skips stabilize optimization, they also disrupt the hierarchical structure of representations, raising the long-standing question of whether transformers can be trained efficiently without them. In this work, we address this problem by analyzing the Jacobian of a skipless transformer block, showing why residuals improve conditioning and revealing that their stabilization benefits can be recovered through a principled initialization strategy. Building on this insight, we introduce the first method that enables stable and efficient training of skipless transformers without altering the standard architecture. We validate our approach on Vision Transformers (ViTs) in both supervised and self-supervised settings, demonstrating that skipless ViTs trained with our initialization overcome the usual optimization barriers, learn richer hierarchical representations, and outperform strong residual baselines on dense prediction benchmarks. These results show that skip connections are not a fundamental requirement for training ViTs and open new avenues for hierarchical representation learning in vision models.


Poster
P3-#1014
Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

Dongyang Fan ⋅ Diba Hashemi ⋅ Sai Karimireddy ⋅ Martin Jaggi

Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal—URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.


Poster
P3-#1013
AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Changzhi Zhou ⋅ Ao Liu ⋅ Yuchi Deng ⋅ Zhiying Zeng ⋅ Tao Zhang ⋅ Haotian Zhu ⋅ Jianwei Cai ⋅ Yue Mao ⋅ Chenchen Zhang ⋅ Lingyun Tan ⋅ ZiyanXU ⋅ Bohui Zhai ⋅ HengyiLIu ⋅ Speed Zhu ⋅ Wiggin Zhou ⋅ Fengzong Lian

Large Language Models (LLMs) have shown impressive performance across diverse domains, with code generation emerging as a particularly prominent application. However, existing benchmarks designed to evaluate code generation exhibit several critical limitations. First, most rely on manual annotations, which are time-consuming and difficult to scale across programming languages and problem complexities. Second, the majority focus primarily on Python, while the few multilingual benchmarks suffer from limited difficulty and imbalanced language coverage. To overcome these challenges, we present AutoCodeGen, an automated framework for constructing high-difficulty, multilingual code generation datasets without manual annotations. Our approach guarantees correctness and completeness by generating test inputs with LLMs, obtaining test outputs within a multilingual sandbox, and further enhancing quality through reverse problem generation and multi-stage filtering. Based on this novel method, we introduce AutoCodeBench, a large-scale benchmark suite spanning 20 programming languages with balanced coverage. AutoCodeBench is designed to rigorously evaluate LLMs on diverse, challenging, and realistic multilingual programming tasks. Extensive experiments reveal that even state-of-the-art models struggle on these tasks, particularly in low-resource languages. Besides, we release complementary training and evaluation resources, including a large-scale, verifiable multilingual instruction dataset generated via the same pipeline, as well as a multilingual sandbox with high-concurrency support. We hope these contributions will provide a solid foundation for future research and inspire the community to explore more automatic and scalable approaches to multilingual code generation, with a particular emphasis on advancing progress in low-resource languages.


Poster
P3-#1012
FAME: Formal Abstract Minimal Explanation for Neural Networks

Ryma Boumazouza ⋅ Raya Elsaleh ⋅ Melanie Ducoffe ⋅ Shahaf Bassan ⋅ Guy Katz

We propose $\textbf{FAME}$ (Formal Abstract Minimal Explanations), a new class of abductive explanations grounded in abstract interpretation. FAME is the first method to scale to large neural networks while reducing explanation size. Our main contribution is the design of dedicated perturbation domains that eliminate the need for traversal order. FAME progressively shrinks these domains and leverages LiRPA-based bounds to discard irrelevant features, ultimately converging to a $\textbf{formal abstract minimal explanation}$. To assess explanation quality, we introduce a procedure that measures the worst-case distance between an abstract minimal explanation and a true minimal explanation. This procedure combines adversarial attacks with an optional $VERI{\large X}+$ refinement step. We benchmark FAME against $VERI{\large X}+$ and demonstrate consistent gains in both explanation size and runtime on medium- to large-scale neural networks.


Poster
P3-#1011
Spinning Straw into Gold: Relabeling LLM Agent Trajectories in Hindsight for Successful Demonstrations

Zichao Li ⋅ Gang Wu ⋅ Jack Wang ⋅ Ruiyi Zhang ⋅ Wanrong Zhu ⋅ Ryan Rossi ⋅ Vlad Morariu ⋅ Jihyung Kil

Large language model agents operate in partially observable, long-horizon settings where obtaining supervision remains a major bottleneck. We address this by utilizing a source of supervision overlooked in existing post-training methods: unintended yet successful goals embedded within agent rollouts. Specifically, we introduce Hindsight Supervised Learning (HSL), where an auxiliary LLM reviews each completed trajectory and relabels it with all of the natural-language goals the agent actually achieved. HSL then pairs the trajectory with its relabeled goals and uses these pairs for additional fine-tuning. To mitigate suboptimality in the relabeled data, we propose two learning techniques for HSL, irrelevant-action masking and sample reweighting. Our experiments show that HSL is flexible and compatible with existing post-training pipelines. It improves both SFT and DPO, with larger gains on long-horizon tasks with more diverse goal spaces. Moreover, HSL is sample-efficient: on ALFWorld, it surpasses baselines trained on the full dataset while using only one quarter of the ground-truth demonstrations.


Poster
P3-#1010
SHE-LoRA: Selective Homomorphic Encryption for Federated Tuning with Heterogeneous LoRA

Jianmin Liu ⋅ Li Yan ⋅ Borui Li ⋅ Lei Yu ⋅ Chao Shen

Federated fine-tuning is critical for improving the performance of large language models (LLMs) in handling domain-specific tasks while keeping training data decentralized and private. However, prior work has shown that clients' private data can actually be recovered via gradient inversion attacks. Existing privacy preservation techniques against such attacks typically entail performance degradation and high costs, making them ill-suited for clients with heterogeneous data distributions and device capabilities. In this paper, we propose SHE-LoRA, which integrates selective homomorphic encryption (SHE) and low-rank adaptation (LoRA) to enable efficient and privacy-preserving federated tuning of LLMs in cross-device environments. Based on model parameter sensitivity assessment, heterogeneous clients adaptively negotiate and select a subset of model parameters for homomorphic encryption. To ensure accurate model aggregation, we design a column-aware secure aggregation method and customized reparameterization techniques to align the aggregation results with the heterogeneous device capabilities of clients. Extensive experiments demonstrate that SHE-LoRA maintains performance comparable to non-private baselines, achieves strong resistance to state-of-the-art attacks, and significantly reduces communication overhead by 99.71\% and encryption time by 99.87\%, compared to HE baselines.


Poster
P3-#1009
FlexLinearAttention: Compiling a Unified Abstraction into Scalable Kernels for Linear Attention

Haojie Duanmu ⋅ Size Zheng ⋅ Ningxin Zheng ⋅ Jianqiao Lu ⋅ Xuegui Zheng ⋅ Xingcheng Zhang ⋅ Li-Wen Chang ⋅ Xin Liu ⋅ Dahua Lin

The quadratic complexity of softmax attention poses a major bottleneck for long-context modeling, motivating a surge of linear attention variants with linear complexity. Unlike softmax attention, which benefits from optimized kernels, linear attention lacks general-purpose, hardware-efficient support and scalable distributed implementations. We introduce Flexible Linear Attention (FlexLA), a domain-specific compiler that automates the generation of high-performance, scalable kernels for a wide range of linear attention models directly from high-level PyTorch code. At its core, FlexLA employs an intuitive programming abstraction that decomposes any linear attention algorithm into three canonical phases: intra-chunk computation, inter-chunk state propagation, and output merging. This unified abstraction enables FlexLA to perform domain-specific optimizations, automatically generating kernels that fuse computation and communication at a fine-grained tile level and eliminating host synchronization. Our evaluation demonstrates that FlexLA combines programmability with performance: a wide range of linear attention variants can be implemented in just a few dozen lines of code, while the generated kernels deliver 1.01x-4.9x the performance of sate-of-the-art expert-optimized library and scale with near-linear efficiency on scalar gated linear attention to 16 million tokens on 128 GPUs, surpassing the state-of-the-art distributed baseline by up to 7.2x.


Poster
P3-#1008
Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Ang Lv ⋅ Jin Ma ⋅ Yiyuan Ma ⋅ Siyuan Qiao

Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain intermediate activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on $n^2$ activations, where $n$ is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.


Poster
P3-#1007
The Curious Case of In-Training Compression of State Space Models

Makram Chahine ⋅ Philipp Nazari ⋅ Daniela Rus ⋅ T. Konstantin Rusch

State Space Models (SSMs), developed to tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference. At their core are recurrent dynamical systems that maintain a hidden state, with update costs scaling with the state dimension. A key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden. Control theory, and more specifically Hankel singular value analysis, provides a potent framework for the measure of energy for each state, as well as the balanced truncation of the original system down to a smaller representation with performance guarantees. Leveraging the eigenvalue stability properties of Hankel matrices, we apply this lens to SSMs $\textit{during training}$, where only dimensions of high influence are identified and preserved. Our approach, CompreSSM, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models. Experiments show that in-training reduction significantly accelerates optimization while preserving expressivity, with compressed models retaining task-critical structure lost by models trained directly at smaller dimension. In other words, SSMs that begin large and shrink during training achieve computational efficiency while maintaining higher performance. Project code is available at https://github.com/camail-official/compressm.


Poster
P3-#1006
Denoising Neural Reranker for Recommender Systems

Wenyu Mao ⋅ Shuchang Liu ⋅ HailanYang ⋅ Xiaobei Wang ⋅ Xiaoyu Yang ⋅ Xu Gao ⋅ Xiang Li ⋅ Lantao Hu ⋅ Han Li ⋅ Kun Gai ⋅ An Zhang ⋅ Xiang Wang

For multi-stage recommenders in industry, a user request would first trigger a simple and efficient retriever module that selects and ranks a list of relevant items, then the recommender calls a slower but more sophisticated reranking model that refines the item list exposure to the user. To consistently optimize the two-stage retrieval reranking framework, most efforts have focused on learning reranker-aware retrievers. In contrast, there has been limited work on how to achieve a retriever-aware reranker. In this work, we provide evidence that the retriever scores from the previous stage are informative signals that have been underexplored. Specifically, we first empirically show that the reranking task under the two-stage framework is naturally a noise reduction problem on the retriever scores, and theoretically show the limitations of naive utilization techniques of the retriever scores. Following this notion, we derive an adversarial framework DNR that associates the denoising reranker with a carefully designed noise generation module. The resulting DNR solution extends the conventional score error minimization loss with three augmented objectives, including: 1) a denoising objective that aims to denoise the noisy retriever scores to align with the user feedback; 2) an adversarial retriever score generation objective that improves the exploration in the retriever score space; and 3) a distribution regularization term that aims to align the distribution of generated noisy retriever scores with the real ones. We conduct extensive experiments on three public datasets and an industrial recommender system, together with analytical support, to validate the effectiveness of the proposed DNR.


Poster
P3-#1005
Cut Less, Fold More: Model Compression through the Lens of Projection Geometry

Olga Saukh ⋅ Dong Wang ⋅ Haris Šikić ⋅ Yun Cheng ⋅ Lothar Thiele

Compressing neural networks without retraining is vital for deployment at scale. We study calibration-free compression through the lens of projection geometry: structured pruning is an axis-aligned projection, whereas model folding performs a low-rank projection via weight clustering. We formalize both as orthogonal operators and show that, within a rank distance of one, folding provably yields smaller parameter reconstruction error, and under mild smoothness assumptions, smaller functional perturbations than pruning. At scale, we evaluate >1'000 checkpoints spanning ResNet18, PreActResNet18, ViT-B/32, and CLIP ViT-B/32 on CIFAR-10 and ImageNet-1K, covering diverse training hyperparameters (optimizers, learning rates, augmentations, regularization, sharpness-aware training). We show that folding typically achieves higher post-compression accuracy, with the largest gains at moderate–high compression. The gap narrows and occasionally reverses at specific training setups. Our results position folding as a geometry-aware, calibration-free alternative to pruning that is often superior in practice and principled in theory.


Poster
P3-#1004
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation

Ruichen Zhang ⋅ Rana Muhammad Shahroz Khan ⋅ Zhen Tan ⋅ Dawei Li ⋅ Song Wang ⋅ Tianlong Chen

Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to provide actionable insights and establish best practices for optimizing CoT distillation through data-centric techniques, ultimately facilitating the development of more accessible and capable reasoning models. The nonymous codebase can be accessed https://anonymous.4open.science/r/DC-COT-FF4C/


Poster
P3-#1003
DynamicInfer: Runtime-Aware Sparse Offloading for LLMs Inference on a Consumer-Grade GPU

Zhui Zhu ⋅ Weichen Zhang ⋅ Zhenghan Zhou ⋅ Yunhao Liu ⋅ Fan Dang

Large Language Models (LLMs) have achieved remarkable success in various NLP tasks, but their enormous memory footprints pose significant challenges for deployment on consumer-grade GPUs. Prior solutions, such as PowerInfer, combine offloading and sparse activation to reduce memory and computational overhead, but suffer from static neuron partitioning, leading to suboptimal GPU utilization and increased latency. In this work, we present DynamicInfer, a runtime neuron offloading framework that dynamically adapts neuron scheduling based on input-dependent activation patterns. DynamicInfer introduces (1) a hierarchical neural caching strategies, (2) a load-aware neuron activation mechanism tailored to heterogeneous hardware, and (3) an activation-aware prefetching pipeline that overlaps data transfer with computation. Extensive experiments on ReluLLaMA and Prosparse models across multiple hardware platforms demonstrate that DynamicInfer achieves up to 253\% speedup over llama.cpp and 59\% over PowerInfer, while retaining model accuracy. Our approach offers a practical and scalable solution for high-performance LLM inference on resource-constrained devices.

As increasingly large pre-trained models are released, deploying them on edge devices for privacy-preserving applications requires effective compression. Recent works combine quantization with the fine-tuning of high-precision LoRA adapters, which can substantially reduce model size while mitigating the accuracy loss from quantization. However, edge devices have inherently heterogeneous capabilities, while performing configuration-wise fine-tuning for every quantization setting is computationally prohibitive. In this paper, we propose CoA-LoRA, a method that dynamically adjusts the LoRA adapter to arbitrary quantization configurations (i.e., the per-layer bit-width choices of a pre-trained model) without requiring repeated fine-tuning. This is accomplished via a configuration-aware model that maps each configuration to its low-rank adjustments. The effectiveness of this model critically depends on the training configuration set, a collection of configurations chosen to cover different total bit-width budgets. However, constructing a high-quality configuration set is non-trivial. We therefore design a Pareto-based configuration search that iteratively optimizes the training configuration set, yielding more precise low-rank adjustments. Our experiments demonstrate that, unlike the state-of-the-art methods that require fine-tuning a separate LoRA adapter for each configuration, CoA-LoRA incurs no additional time cost while achieving comparable or even superior performance to those methods.


Poster
P3-#1001
OpenThoughts: Data Recipes for Reasoning Models

Etash Guha ⋅ Ryan Marten ⋅ Sedrick Keh ⋅ Negin Raoof ⋅ Georgios Smyrnis ⋅ Hritik Bansal ⋅ Marianna Nezhurina ⋅ Jean Mercat ⋅ Trung Vu ⋅ Zayne Sprague ⋅ Ashima Suvarna ⋅ Benjamin Feuer ⋅ Leon Liangyu Chen ⋅ Zaid Khan ⋅ Eric Frankel ⋅ Sachin Grover ⋅ Caroline Choi ⋅ Niklas Muennighoff ⋅ Shiye Su ⋅ Wanjia Zhao ⋅ John Yang ⋅ Shreyas Pimpalgaonkar ⋅ Kartik sharma ⋅ Charlie Ji ⋅ Yichuan Deng ⋅ Sarah Pratt ⋅ Vivek Ramanujan ⋅ Jon Saad-Falcon ⋅ Stutee Acharya ⋅ Jeffrey Li ⋅ Achal Dave ⋅ Alon Albalak ⋅ Kushal Arora ⋅ Blake Wulfe ⋅ Chinmay Hegde ⋅ Greg Durrett ⋅ Sewoong Oh ⋅ Mohit Bansal ⋅ Saadia Gabriel ⋅ Aditya Grover ⋅ Kai-Wei Chang ⋅ Vaishaal Shankar ⋅ Aaron Gokaslan ⋅ Mike Merrill ⋅ Tatsunori Hashimoto ⋅ Yejin Choi ⋅ Jenia Jitsev ⋅ Reinhard Heckel ⋅ Maheswaran Sathiamoorthy ⋅ Alex Dimakis ⋅ Ludwig Schmidt

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best train- ing recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. Our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data genera- tion pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Dia- mond – improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on openthoughts.ai.


Poster
P3-#1101
Chessformer: A Unified Architecture for Chess Modeling

Daniel Monroe ⋅ George Eilender ⋅ Philip Chalmers ⋅ Zhenwei Tang ⋅ Ashton Anderson

Chess has played a uniquely important role as a testbed domain for artificial intelligence. Applying new architectures to improve absolute chess performance, and more recently to predict human moves at specified skill levels, has therefore garnered attention in the machine learning literature. Current approaches to these problems employ transformer models with widely varying architectural designs, and use unintuitive tokenization schemes that are not amenable to interpretability techniques, which hinders their applicability for teaching and human-AI interaction. We introduce Chessformer, a novel chess transformer model design that consists of an encoder-only model which processes chessboard squares as input tokens, instead of moves or the entire position, a dynamic positional encoding scheme that allows the model to flexibly adapt to the unique geometries present in chess, and an attention-based policy output design. We show that Chessformer advances the state of the art in all three major chess modeling goals: it significantly improves the chess-playing performance of a state-of-the-art chess engine, it surpasses the previous best human move-matching prediction performance with a much smaller model, and it enables substantial interpretability benefits. Our unified approach constitutes a broad advance across several important tasks in chess AI, and also demonstrates the benefits of carefully adapting transformers' tokenization systems, output systems, and positional encodings to reflect the structure of a domain of interest.


Poster
P3-#1102
LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

Wenkai Yang ⋅ Weijie Liu ⋅ Ruobing Xie ⋅ Yiju Guo ⋅ Lulu Wu ⋅ Saiyong Yang ⋅ Yankai Lin

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time after RLVR, prior studies incorporate the training of model's self-verification capabilities into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which doubles the inference cost per sample and significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification training can be approximately reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a Mean Squared Error (MSE) loss that aligns the last-token self-rewarding scores with the verifier-based reasoning rewards, and jointly optimizes the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores serve as auxiliary reward signals in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last solution token immediately after solution generation, thereby incurring only the minimal extra cost of at most one additional token inference. Experimental results show that our method not only improves the reasoning performance of the model also equips it with remarkable self-rewarding capability, thereby further boosting its inference-time scaling performance.


Blog Track Poster
P3-#1103
Dynamic Parameter Reuse Augments Reasoning via Latent Chain of Thought

Kaitlin Maile ⋅ Joao Sacramento

Standard language models often rely on massive parameter counts for their performance, utilizing each parameter only once per inference pass. This prompts consideration of recurrent structures, where models reuse parameters across sequential time, depth, or training progression to achieve improved performance and reduced training cost. We draw connections in the landscape of parameter reuse, from growing models via stacking to recurrent looping, and postulate that these architectural priors act as a form of Latent Chain of Thought (LCoT), allowing models to reason in a continuous state space. By shifting towards deeper and dynamic computation, grown and recurrent architectures offer a path toward improved reasoning in compact networks, ascending beyond scaling laws of standard architectures.


Poster
P3-#1104
Learning Flexible Forward Trajectories for Masked Molecular Diffusion

Hyunjin Seo ⋅ Taewon Kim ⋅ Sihyun Yu ⋅ Sungsoo Ahn

Masked diffusion models (MDMs) have achieved notable progress in modeling discrete data, while their potential in molecular generation remains underexplored. In this work, we explore their potential and introduce the surprising result that naively applying standards MDMs to molecules leads to severe performance degradation. We trace this critical issue to a state-clashing problem-where the forward diffusion trajectories of distinct molecules collapse into a common state, resulting in a mixture of reconstruction targets that cannot be learned with a typical reverse diffusion with unimodal predictions. To mitigate this, we propose Masked Element-wise Learnable Diffusion (MELD) that orchestrates per-element corruption trajectories to avoid collisions between different molecular graphs. This is realized through a parameterized noise scheduling network that learns distinct corruption rates for individual graph elements, i.e., atoms and bonds. Across extensive experiments, MELD achieves 100\% chemical validity in unconditional generation on QM9 and ZINC250K datasets, while markedly improving distributional and property alignment over standard MDMs on both conditional and unconditioned generation.

Designing protein sequences that fold into a target 3-D structure, termed as the inverse folding problem, is central to protein engineering. However, it remains challenging due to the vast sequence space and the importance of local structural constraints. Existing deep learning approaches achieve strong recovery rates, however, lack explicit mechanisms to reuse fine-grained structure-sequence patterns conserved across natural proteins. To mitigate this, we present PRISM a multimodal retrieval-augmented generation framework for inverse folding. PRISM retrieves fine-grained representations of potential motifs from known proteins and integrates them with a hybrid self-cross attention decoder. PRISM is formulated as a latent-variable probabilistic model and implemented with an efficient approximation, combining theoretical grounding with practical scalability. Experiments across multiple benchmarks, including CATH-4.2, TS50, TS500, CAMEO 2022, and the PDB date split, demonstrate the fine-grained multimodal retrieval efficacy of PRISM in yielding SoTA perplexity and amino acid recovery, while also improving the foldability metrics (RMSD, TM-score, pLDDT).


Poster
P3-#1106
Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

Xinzhe Yuan ⋅ Zhuo Chen ⋅ Jianshu Zhang ⋅ Huan Xiong ⋅ Nanyang Ye ⋅ Yuqiang Li ⋅ Qinying Gu

Scientific discovery is increasingly constrained by costly experiments and limited budgets, making efficient optimization essential for AI for science. Bayesian Optimization (BO), while widely adopted for balancing exploration and exploitation, suffers from slow cold-start performance and poor scalability in high-dimensional settings, limiting its effectiveness in real-world scientific applications. To address these challenges, we propose LLM-Guided Bayesian Optimization (LGBO), the first LLM preference-guided BO framework that continuously integrates the semantic reasoning of large language models (LLMs) into the optimization loop. Unlike prior works that use LLMs only for warm-start initialization or candidate generation, LGBO introduces a region-lifted preference mechanism that embeds LLM-driven preferences into every iteration, shifting the surrogate mean in a stable and controllable way. Theoretically, we prove that LGBO is not perform significantly worse than standard BO in the worst case, while achieving significantly faster convergence when preferences align with the objective. Empirically, LGBO achieves consistent improvements across diverse dry benchmarks in physics, chemistry, biology, and materials science. Most notably, in a new wet-lab optimization of Fe–Cr battery electrolytes, LGBO reaches \textbf{90\% of the best observed value within 6 iterations}, whereas standard BO and existing LLM-augmented baselines require more than 10 iterations. Together, the results suggest that LGBO offers a promising direction for integrating LLMs into scientific optimization workflows.


Poster
P3-#815
A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond

Nikos Tsikouras ⋅ Yorgos Pantis ⋅ Ioannis Mitliagkas ⋅ Christos Tzamos

Understanding the dynamics of feature learning in neural networks (NNs) remains a significant challenge. The work of (Mousavi-Hosseini et al., 2023) analyzes a multiple index teacher-student setting and shows that a two-layer student attains a low-rank structure in its first-layer weights when trained with stochastic gradient descent (SGD) and a strong regularizer. This structural property is known to reduce sample complexity of generalization. Indeed, in a second step, the same authors establish algorithm-specific learning guarantees under additional assumptions. In this paper, we focus exclusively on the structure discovery aspect and study it under weaker assumptions, more specifically: we allow (a) NNs of arbitrary size and depth, (b) with all parameters trainable, (c) under any smooth loss function, (d) tiny regularization, and (e) trained by any method that attains a second-order stationary point (SOSP), e.g. perturbed gradient descent (PGD). At the core of our approach is a key $\textit{derandomization}$ lemma, which states that optimizing the function $E_{x} \left[g_{\theta}(Wx + b)\right]$ converges to a point where $W = 0$, under mild conditions. The fundamental nature of this lemma directly explains structure discovery and has immediate applications in other domains including an end-to-end approximation for MAXCUT, and computing Johnson-Lindenstrauss embeddings.

Graph generation plays an important role in various domains such as molecular design, protein prediction, and drug discovery. However, generating graph-structured data poses challenges due to the complex dependencies inherent in graphs, spanning from intricate local substructures to broad global topologies. Although recent advances in graph-generative models have made notable progress, traditional node-level generative paradigms may have difficulty simultaneously capturing the multiscale dependencies in graphs. To address these challenges, we propose a unified latent diffusion model that jointly learns local and global topological information, enabling effective and efficient graph generation. Besides, our approach introduces a dual conditioning mechanism designed to promote dynamic interaction between local and global information, equipping the generative model with global and local awareness to better capture the coupled dependencies within graphs. Our method can largely promote the joint modeling of global and local information and substantially improve the quality of the generated graphs. Extensive experiments consistently demonstrate the effectiveness of our method.


Poster
P3-#1109
FALCON: Few-step Accurate Likelihoods for Continuous Flows

Danyal Rehman ⋅ Tara Akhound-Sadegh ⋅ Artem Gazizov ⋅ Yoshua Bengio ⋅ Alexander Tong

Scalable sampling of molecular states in thermodynamic equilibrium is a long-standing challenge in statistical physics. Boltzmann Generators tackle this problem by pairing a generative model, capable of exact likelihood computation, with importance sampling to obtain consistent samples under the target distribution. Current Boltzmann Generators primarily use continuous normalizing flows (CNFs) trained with flow matching for efficient training of powerful models. However, likelihood calculation for these models is extremely costly, requiring thousands of function evaluations per sample, severely limiting their adoption. In this work, we propose Few-Step Accurate Likelihoods for Continuous Flows (FALCON), a method which allows for few-step sampling with a likelihood accurate enough for importance sampling applications by introducing a hybrid training objective that encourages invertibility. We show FALCON outperforms state-of-the-art normalizing flow models for molecular Boltzmann sampling and is two orders of magnitude faster than the equivalently performing CNF model. FALCON code is available at: https://github.com/danyalrehman/FALCON.


Poster
P3-#1110
Learning Molecular Chirality via Chiral Determinant Kernels

Runhan Shi ⋅ Zhicheng Zhang ⋅ Letian Chen ⋅ Gufeng Yu ⋅ Yang Yang

Chirality is a fundamental molecular property that governs stereospecific behavior in chemistry and biology. Capturing chirality in machine learning models remains challenging due to the geometric complexity of stereochemical relationships and the limitations of traditional molecular representations that often lack explicit stereochemical encoding. Existing approaches to chiral molecular representation primarily focus on central chirality, relying on handcrafted stereochemical tags or limited 3D encodings, and thus fail to generalize to more complex forms, such as axial chirality. In this work, we introduce \textbf{ChiDeK} (\textbf{Chi}ral \textbf{De}terminant \textbf{K}ernels), a framework that systematically integrates stereogenic information into molecular representation learning. We propose the chiral determinant kernel to encode the SE(3)-invariant chirality matrix and employ cross-attention to integrate stereochemical information from local chiral centers into the global molecular representation. This design enables explicit modeling of chiral-related features within a unified architecture, capable of jointly encoding central and axial chirality. To support the evaluation of axial chirality, we construct a new benchmark for electronic circular dichroism (ECD) and optical rotation (OR) prediction. Across four tasks, including R/S configuration classification, enantiomer ranking, ECD spectrum prediction, and OR prediction, ChiDeK achieves substantial improvements over state-of-the-art baselines, most notably yielding over 7\% higher accuracy on axially chiral tasks on average.


Poster
P3-#1111
Multi-state Protein Sequence Design with DynamicMPNN

Alex Abrudan ⋅ Sebastian Pujalte Ojeda ⋅ Chaitanya Joshi ⋅ Matthew Greenig ⋅ Felipe Engelberger ⋅ Alena Khmelinskaia ⋅ Jens Meiler ⋅ Michele Vendruscolo ⋅ Tuomas Knowles

Structural biology has long been dominated by the one sequence, one structure, one function paradigm, yet many critical biological processes—from enzyme catalysis to membrane transport—depend on proteins that adopt multiple conformational states. Existing multi-state design approaches rely on post-hoc aggregation of single-state predictions, achieving poor experimental success rates compared to single-state design. We introduce DynamicMPNN, an inverse folding model explicitly trained to generate sequences compatible with multiple conformations through joint learning across conformational ensembles. Trained on 46,033 conformational pairs covering 75% of CATH superfamilies and evaluated using Alphafold 3, DynamicMPNN outperforms ProteinMPNN by up to 31% on decoy-normalized RMSD and by 12% on sequence recovery across our challenging multi-state protein benchmark.


Poster
P3-#1112
Controllable diffusion-based generation for multi-channel biological data

Haoran Zhang ⋅ Mingyuan Zhou ⋅ Wesley Tansey

Biological profiling technologies, such as imaging mass cytometry (IMC) and spatial transcriptomics (ST), generate multi-channel data with strong spatial alignment and complex inter-channel relationships. Modeling such data requires generative frameworks that jointly model spatial structure and inter-channel dependencies and generalize across arbitrary subsets of observed and missing channels. Existing generative models typically assume low-dimensional inputs (e.g., RGB images) and rely on simple conditioning mechanisms that disrupt spatial correspondence and overlook inter-channel dependencies. This work proposes a unified multi-channel diffusion (MCD) framework for controllable generation of structured biological data with complex inter-channel relationships. Our model introduces two key innovations: (1) a hierarchical feature injection mechanism that enables multi-resolution conditioning on spatially aligned observed channels, and (2) two complementary channel attention modules to capture inter-channel relationships and recalibrate latent features. To support flexible conditioning and generalization to arbitrary sets of observed channels, we train the model using a random channel masking strategy, enabling it to reconstruct missing channels given any combination of observed channels as the spatial condition. We demonstrate state-of-the-art performance across both spatial and non-spatial biological data generation tasks, including imputation in spatial proteomics and clinical imaging, as well as gene-to-protein translation in single-cell datasets, and show strong generalizability to unseen conditional configurations.


Poster
P3-#1113
Physically Valid Biomolecular Interaction Modeling with Gauss-Seidel Projection

Siyuan Chen ⋅ Minghao Guo ⋅ Caoliwen Wang ⋅ Anka He Chen ⋅ Yikun Zhang ⋅ Jingjing Chai ⋅ Yin Yang ⋅ Wojciech Matusik ⋅ Peter Yichen Chen

Biomolecular interaction modeling has been substantially advanced by foundation models, yet they often produce all-atom structures that violate basic steric feasibility. We address this limitation by enforcing physical validity as a strict constraint during both training and inference with a unified module. At its core is a differentiable projection that maps the provisional atom coordinates from the diffusion model to the nearest physically valid configuration. This projection is achieved using a Gauss-Seidel scheme, which exploits the locality and sparsity of the constraints to ensure stable and fast convergence at scale. By implicit differentiation to obtain gradients, our module integrates seamlessly into existing frameworks for end-to-end finetuning. With our Gauss-Seidel projection module in place, two denoising steps are sufficient to produce biomolecular complexes that are both physically valid and structurally accurate. Across six benchmarks, our $2$-step model achieves the same structural accuracy as state-of-the-art $200$-step diffusion baselines, delivering ${\sim}10\times$ wall-clock speedups while guaranteeing physical validity. The code is available at https://github.com/chensiyuan030105/ProteinGS.git.


Poster
P3-#1114
Doloris: Dual Conditional Diffusion Implicit Bridges with Sparsity Masking Strategy for Unpaired Single-Cell Perturbation Estimation

Changxi Chi ⋅ Jun Xia ⋅ Yufei Huang ⋅ Zhuoli Ouyang ⋅ Cheng Tan ⋅ Yunfan Liu ⋅ Jingbo Zhou ⋅ Chang Yu ⋅ Liangyu Yuan ⋅ Siyuan Li ⋅ Zelin Zang ⋅ Stan Z Li

Estimating single-cell responses across various perturbations facilitates the identification of key genes and enhances drug screening, significantly boosting experimental efficiency. However, single-cell sequencing is a destructive process, making it impossible to capture the same cell's phenotype before and after perturbation. Consequently, data collected under perturbed and unperturbed conditions are inherently unpaired, creating a critical yet unresolved problem in single-cell perturbation modeling. Moreover, the high dimensionality and sparsity of single-cell expression make direct modeling prone to focusing on zeros and neglecting meaningful patterns. To address these problems, we propose a new paradigm for single-cell perturbation modeling. Specifically, we leverage dual diffusion models to learn the control and perturbed distributions separately, and implicitly align them through a shared Gaussian latent space, without requiring explicit cell pairing. Furthermore, we introduce a sparsity masking strategy in which the mask model learns to predict zero-expressed genes, allowing the diffusion model to focus on capturing meaningful patterns among expressed genes and thereby preserving diversity in high-dimensional sparse data. We introduce \textbf{Doloris}, a generative framework that defines a new paradigm for modeling unpaired, high-dimensional, and sparse single-cell perturbation data. It leverages dual conditional diffusion models for separate learning of control and perturbed distributions, complemented by a sparsity masking strategy to enhance prediction of zero-valued genes. The results on publicly available datasets show that our model effectively captures the diversity of single-cell perturbations and achieves state-of-the-art performance. To facilitate reproducibility, we include the code in the supplementary materials. Code available at \url{https://github.com/ChangxiChi/Doloris}.


Poster
P3-#1115
VCWorld: A Biological World Model for Virtual Cell Simulation

Zhijian Wei ⋅ Runze Ma ⋅ Zichen Wang ⋅ Zhongmin Li ⋅ Shuotong Song ⋅ Shuangjia Zheng

Virtual cell modeling aims to predict cellular responses to perturbations. Existing virtual cell models rely heavily on large-scale single-cell datasets, learning explicit mappings between gene expression and perturbations. Although recent models attempt to incorporate multi-source biological information, their generalization remains constrained by data quality, coverage, and batch effects. More critically, these models often function as black boxes, offering predictions without interpretability or consistency with biological principles, which undermines their credibility in scientific research. To address these challenges, we present VCWorld, a cell-level white-box simulator that integrates structured biological knowledge with the iterative reasoning capabilities of large language models to instantiate a biological world model. VCWorld operates in a data-efficient manner to reproduce perturbation-induced signaling cascades and generates interpretable, stepwise predictions alongside explicit mechanistic hypotheses. In drug perturbation benchmarks, VCWorld achieves state-of-the-art predictive performance, and the inferred mechanistic pathways are consistent with publicly available biological evidence. Our code is publicly available at https://anonymous.4open.science/r/VCWorld-B970.


Poster
P3-#1116
Controllable Sequence Editing for Biological and Clinical Trajectories

Michelle M. Li ⋅ Kevin Li ⋅ Yasha Ektefaie ⋅ Ying Jin ⋅ Yepeng Huang ⋅ Shvat Messica ⋅ Tianxi Cai ⋅ Marinka Zitnik

Conditional generation models for longitudinal sequences can produce new or modified trajectories given a conditioning input. However, they often lack control over when the condition should take effect (timing) and which variables it should influence (scope). Most methods either operate only on univariate sequences or assume that the condition alters all variables and time steps. In scientific and clinical settings, interventions instead begin at a specific moment, such as the time of drug administration or surgery, and influence only a subset of measurements while the rest of the trajectory remains unchanged. CLEF learns temporal concepts that encode how and when a condition alters future sequence evolution. These concepts allow CLEF to apply targeted edits to the affected time steps and variables while preserving the rest of the sequence. We evaluate CLEF on 8 datasets spanning cellular reprogramming, patient health, and sales, comparing against 9 state-of-the-art baselines. CLEF improves immediate sequence editing accuracy by 16.28% (MAE) on average against their non-CLEF counterparts. Unlike prior models, CLEF enables one-step conditional generation at arbitrary future times, outperforming their non-CLEF counterparts in delayed sequence editing by 26.73% (MAE) on average. We test CLEF under counterfactual inference assumptions and show up to 62.84% (MAE) improvement on zero-shot conditional generation of counterfactual trajectories. In a case study of patients with type 1 diabetes mellitus, CLEF identifies clinical interventions that generate realistic counterfactual trajectories shifted toward healthier outcomes.


Poster
P3-#1117
Quantifying Cross-Attention Interaction in Transformers for Interpreting TCR-pMHC Binding

Jiarui Li ⋅ Zixiang Yin ⋅ Haley Smith ⋅ Zhengming Ding ⋅ Samuel Landry ⋅ Ramgopal Mettu

CD8+ “killer” T cells and CD4+ “helper” T cells play a central role in the adaptive immune system by recognizing antigens presented by Major Histocompatibility Complex (pMHC) molecules via T Cell Receptors (TCRs). Modeling binding between T cells and the pMHC complex is fundamental to understanding basic mechanisms of human immune response as well as in developing therapies. While transformer-based models such as TULIP have achieved impressive performance in this domain, their black-box nature precludes interpretability and thus limits a deeper mechanistic understanding of T cell response. Most existing post-hoc explainable AI (xAI) methods are confined to encoder-only, co-attention, or model-specific architectures and cannot handle encoder-decoder transformers used in TCR-pMHC modeling. To address this gap, we propose Quantifying Cross-Attention Interaction (QCAI), a new post-hoc method designed to interpret the cross-attention mechanisms in transformer decoders. Quantitative evaluation is a challenge for XAI methods; we have compiled TCR-XAI, a benchmark consisting of 274 experimentally determined TCR-pMHC structures to serve as ground truth for binding. Using these structures we compute physical distances between relevant amino acid residues in the TCR-pMHC interaction region and evaluate how well our method and others estimate the importance of residues in this region across the dataset. We show that QCAI achieves state-of-the-art performance on both interpretability and prediction accuracy under the TCR-XAI benchmark.


Poster
P3-#1118
scDFM: Distributional Flow Matching Model for Robust Single-Cell Perturbation Prediction

Chenglei Yu ⋅ Chuanrui Wang ⋅ Bangyan Liao ⋅ Tailin Wu

A central goal in systems biology and drug discovery is to predict the transcriptional response of cells to perturbations. This task is challenging due to the noisy, sparse nature of single-cell measurements and the fact that perturbations often induce population-level shifts rather than changes in individual cells. Existing deep learning methods typically assume cell-level correspondences, limiting their ability to capture such global effects. We present **scDFM**, a generative framework based on conditional flow matching that models the full distribution of perturbed cells conditioned on control states. By incorporating an MMD objective, our method aligns perturbed and control populations beyond cell-level correspondences. To further improve robustness to sparsity and noise, we propose the Perturbation-Aware Differential Transformer architecture (PAD-Transformer), a backbone that leverages gene interaction graphs and differential attention to capture context-specific expression changes. **scDFM** outperforms prior methods across multiple genetic and drug perturbation benchmarks, excelling in both unseen and combinatorial settings. In the combinatorial setting, it reduces MSE by 19.6\% over the strongest baseline. These results highlight the importance of distribution-level generative modeling for robust $\textit{in silico}$ perturbation prediction. The code is available at https://github.com/AI4Science-WestlakeU/scDFM.


Poster
P3-#1119
Fast Proteome-Scale Protein Interaction Retrieval via Residue-Level Factorization

Jianan Zhao ⋅ Zhihao Zhan ⋅ Narendra Chaudhary ⋅ Xinyu Yuan ⋅ Zuobai Zhang ⋅ Qian Cong ⋅ Jian Zhou ⋅ Sanchit Misra ⋅ Jian Tang

Protein-protein interactions (PPIs) are mediated at the residue level. Most sequence-based PPI models consider residue-residue interactions across two proteins, which can yield accurate interaction scores but are too slow to scale. At proteome scale, identifying candidate PPIs requires evaluating nearly *all possible protein pairs*. For $N$ proteins of average length $L$, exhaustive all-against-all search requires $\mathcal{O}(N^2L^2)$ computation, rendering conventional approaches computationally impractical. We introduce RaftPPI, a scalable framework that approximates residue-level PPI modeling while enabling efficient large-scale retrieval. RaftPPI represents residue interactions with a Gaussian kernel, approximated efficiently via structured random Fourier features, and applies a low-rank factorized attention mechanism that admits pooling into a compact embedding per protein. Each protein is encoded once into an indexable embedding, allowing approximate nearest-neighbor search to replace exhaustive pairwise scoring, reducing proteome-wide retrieval from *months* to *minutes* on a single GPU or CPU. On the human proteome with the D-SCRIPT dataset, RaftPPI retrieves the top 20% pairs from $\sim$200M candidate pairs in 5.7 minutes on an A100 GPU, or 3.3 minutes on an Intel Xeon 6980P CPU, covering 75.1% of the true interacting pairs, compared to 4.9 GPU months for the best prior method (61.2%). Across seven benchmarks with sequence- and degree-controlled splits, RaftPPI achieves state-of-the-art PPI classification and retrieval performance, while enabling residue-aware, retrieval-friendly screening at proteome scale.

The transcriptional response to genetic perturbation reveals fundamental insights into complex cellular systems. While current approaches have made progress in predicting genetic perturbation responses, they provide limited biological understanding and cannot systematically refine existing knowledge. Overcoming these limitations requires an end-to-end integration of data-driven learning and existing knowledge. However, this integration is challenging due to inconsistencies between data and knowledge bases, such as noise, misannotation, and incompleteness. To address this challenge, we propose ALIGNED (Adaptive aLignment for Inconsistent Genetic kNowledgE and Data), a neuro-symbolic framework based on the Abductive Learning (ABL) paradigm. This end-to-end framework aligns neural and symbolic components and performs systematic knowledge refinement. We introduce a balanced consistency metric to evaluate the predictions' consistency against both data and knowledge. Our results show that ALIGNED outperforms state-of-the-art methods by achieving the highest balanced consistency, while also re-discovering biologically meaningful knowledge. Our work advances beyond existing methods to enable both the transparency and the evolution of mechanistic biological understanding.


Poster
P3-#1121
GALAX: Graph-Augmented Language Model for Explainable Reinforcement-Guided Subgraph Reasoning in Precision Medicine

Heming Zhang ⋅ Di Huang ⋅ Wenyu Li ⋅ Michael Province ⋅ Yixin Chen ⋅ Philip Payne ⋅ Fuhai Li

In precision medicine, quantitative multi-omic features, topological context, and textual biological knowledge play vital roles in identifying disease-critical signaling pathways and targets, guiding the discovery of novel therapeutics and effective treatment strategies. Existing pipelines capture only one or two of these—numerical omics ignore topological context, text-centric LLMs lack quantitative grounded reasoning, and graph-only models underuse rich node semantics and the generalization power of LLMs—thereby limiting mechanistic interpretability. Although Process Reward Models (PRMs) aim to guide reasoning in LLMs, they remain limited by coarse step definitions, unreliable intermediate evaluation, and vulnerability to reward hacking with added computational cost. These gaps motivate jointly integrating quantitative multi-omic signals, topological structure with node annotations, and literature-scale text via LLMs, using subgraph reasoning as the principle bridge linking numeric evidence, topological knowledge and language context. To resolve this challenge, we propose GALAX (Graph Augmented LAnguage model with eXplainability), an innovative framework that integrates pretrained Graph Neural Networks (GNNs) into Large Language Models (LLMs) via reinforcement learning guided by a Graph Process Reward Model (GPRM), which generates disease-relevant subgraphs in a step-wise manner initiated by an LLM and iteratively evaluated by a pretrained GNN and schema-based rule check, enabling process-level supervision without explicit labels. As an application, we also introduced Target-QA, a benchmark combining CRISPR-identified targets, multi-omic profiles, and biomedical graph knowledge across diverse cancer cell lines, which enables GNN pretraining for supervising step-wise graph construction and supports long-context reasoning over text-numeric graphs (TNGs), providing a scalable and biologically grounded framework for explainable, reinforcement-guided subgraph reasoning toward reliable and interpretable target and pathway discovery in precision medicine.


Poster
P3-#1122
One protein is all you need

Anton Bushuiev ⋅ Roman Bushuiev ⋅ Olga Pimenova ⋅ Nikola Zadorozhny ⋅ Raman Samusevich ⋅ Elisabet Manaskova ⋅ Rachel Seongeun Kim ⋅ Hannes Stärk ⋅ Jiri Sedlar ⋅ Martin Steinegger ⋅ Tomas Pluskal ⋅ Josef Sivic

Generalization beyond training data remains a central challenge in machine learning for biology. A common way to enhance generalization is self-supervised pre-training on large datasets. However, aiming to perform well on all possible proteins can limit a model’s capacity to excel on any specific one, whereas practitioners typically need accurate predictions for individual proteins they study, often not covered in training data. To address this limitation, we propose a method that enables self-supervised customization of protein language models to one target protein at a time, on the fly, and without assuming any additional data. We show that our Protein Test-Time Training (ProteinTTT) method consistently enhances generalization across different models, their sizes, and datasets. ProteinTTT improves structure prediction for challenging targets, achieves new state-of-the-art results on protein fitness prediction, and enhances function prediction on two tasks. We also demonstrate ProteinTTT on two challenging case studies. We show that customization via ProteinTTT enables more accurate antibody–antigen loop modeling and improves 19% of structures in the Big Fantastic Virus Database, delivering improved predictions where general-purpose AlphaFold2 and ESMFold struggle.


Poster
P3-#1123
Tokenization to Transfer: Do Genomic Foundation Models Learn Good Representations?

Kirill Vishniakov ⋅ Karthik Viswanathan ⋅ Aleksandr Medvedev ⋅ Praveenkumar Kanithi ⋅ Marco Pimentel ⋅ Ronnie Rajan ⋅ Shadab Khan

The success of Large Language Models has inspired the development of Genomic Foundation Models (GFMs) through similar pretraining techniques. However, the relationship between pretraining performance and effectiveness in downstream genomic tasks remains unclear. Additionally, the high computational cost of pretraining raises questions about its cost-efficiency. To assess the usefulness of pretraining in genomics, we evaluated seven different GFMs across 52 diverse genomic tasks, comparing them to their counterparts with randomly initialized weights. Across benchmarks, we find that randomly initialized models provide surprisingly strong baselines and tokenizer and architecture choices strongly shape both these baselines and the gains from pretraining. Specifically, character‑token models often match or exceed the performance of larger pretrained k‑mer or BPE models, whereas subword models appear to benefit from pretraining. We also find that the evaluated GFMs fail to capture clinically relevant genetic mutations, with embeddings and log‑likelihood ratios showing limited sensitivity to annotated variants. For the tasks we study, these results suggest that current NLP‑style pretraining strategies provide modest, tokenizer‑gated improvements over strong random baselines and motivate more biologically informed tokenization and variant‑aware objectives. Our code is available at https://github.com/m42-health/gfm-random-eval.


Poster
P3-#1124
From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity

Tianxi Wan ⋅ Jiaming Luo ⋅ Siyuan Chen ⋅ Kunyao Lan ⋅ Jianhua Chen ⋅ Haiyang Geng ⋅ Mengyue Wu

Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards. Through this rigorous process, we construct the first large-scale dialogue dataset supporting comorbidity, containing 3,000 multi-turn diagnostic dialogues validated by psychiatrists. This dataset enhances diagnostic accuracy and treatment planning, offering a valuable resource for psychiatric comorbidity research. Compared to real-world clinical transcripts, PsyCoTalk exhibits high structural and linguistic fidelity in terms of dialogue length, token distribution, and diagnostic reasoning strategies. Licensed psychiatrists confirm the realism and diagnostic validity of the dialogues. This dataset enables the development and evaluation of models capable of multi-disorder psychiatric screening in a single conversational pass.


Poster
P3-#1125
Bridging Radiology and Pathology Foundation Models via Concept-Based Multimodal Co-Adaptation

Yihang Chen ⋅ Yanyan Huang ⋅ Fuying Wang ⋅ Maximus Yeung ⋅ Yuming Jiang ⋅ Shujun Wang ⋅ Lequan Yu

Pretrained medical foundation models (FMs) have shown strong generalization across diverse imaging tasks, such as disease classification in radiology and tumor grading in histopathology. While recent advances in parameter-efficient finetuning have enabled effective adaptation of FMs to downstream tasks, these approaches are typically designed for a single modality. In contrast, many clinical workflows rely on joint diagnosis from heterogeneous domains, such as radiology and pathology, where fully leveraging the representation capacity of multiple FMs remains an open challenge. To address this gap, we propose Concept Tuning and Fusing (CTF), a parameter-efficient framework that uses clinically grounded concepts as a shared semantic interface to enable cross-modal co-adaptation before fusion. By incorporating task-specific concepts that are relevant across modalities, CTF aligns radiology and pathology representations, thereby enhancing their complementarity and enabling interpretation. We further design a Global–Context–Shared Prompt (GCSP) mechanism, which employs a small set of learnable tokens to capture domain-specific priors, shared patient-level information, and cross-domain context. The resulting concept alignment scores from each modality are then fused to produce a final prediction. Extensive experiments demonstrate that CTF outperforms strong unimodal, latent-fusion, and adapter-based baselines (e.g., AUC 0.903 on TCGA-GBMLGG). Notably, CTF achieves these gains without finetuning the full FMs, requiring only 0.15\% additional parameters, thus highlighting the effectiveness of concept-based multimodal co-adaptation. Our code is available at: https://github.com/HKU-MedAI/CTF.


Poster
P3-#1126
MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Ran Xu ⋅ Yuchen Zhuang ⋅ Yishan Zhong ⋅ Yue Yu ⋅ Zifeng Wang ⋅ Xiangru Tang ⋅ Hang Wu ⋅ May Dongmei Wang ⋅ Peifeng Ruan ⋅ Donghan Yang ⋅ Tao Wang ⋅ Guanghua Xiao ⋅ Xin Liu ⋅ Carl Yang ⋅ Yang Xie ⋅ Wenqi Shi

We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating MedAgentGym as an effective training ground while establishing itself as a cost-effective, privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By offering a unified execution environment with a comprehensive benchmark and accessible, extensible training resources, MedAgentGym delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical data science.


Poster
P3-#1226
Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Pengfei ZHANG ⋅ Tianxin Xie ⋅ Minghao Yang ⋅ Li Liu

Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A²CA). Unlike static pipelines, Thinker-A²CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a modality-weaving Diagnoser that weaves clinical text with audio tokens via strategic global attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a flow matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for this work, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.


Poster
P3-#1225
Structural Prognostic Event Modeling for Multimodal Cancer Survival Analysis

Yilan Zhang ⋅ Li Nanbo ⋅ Changchun Yang ⋅ Jürgen Schmidhuber ⋅ Xin Gao

The integration of histology images and gene profiles has shown great promise for improving survival prediction in cancer. However, current approaches often struggle to model intra- and inter-modal interactions efficiently and effectively due to the high dimensionality and complexity of the inputs. A major challenge is capturing critical prognostic events that, though few, underlie the complexity of the observed inputs and largely determine patient outcomes. These events---manifested as high-level structural signals such as spatial histologic patterns or pathway co-activations---are typically sparse, patient-specific, and unannotated, making them inherently difficult to uncover. To address this, we propose SlotSPE, a slot-based framework for structural prognostic event modeling. Specifically, inspired by the principle of factorial coding, we compress each patient’s multimodal inputs into compact, modality-specific sets of mutually distinctive slots using slot attention. By leveraging these slot representations as encodings for prognostic events, our framework enables both efficient and effective modeling of complex intra- and inter-modal interactions, while also facilitating seamless incorporation of biological priors that enhance prognostic relevance. Extensive experiments on ten cancer benchmarks show that SlotSPE outperforms existing methods in 8 out of 10 cohorts, achieving an overall improvement of 2.9%. It remains robust under missing genomic data and delivers markedly improved interpretability through structured event decomposition.


Poster
P3-#1224
M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

Juntao Jiang ⋅ Jiangning Zhang ⋅ Yali bi ⋅ BAI Jinsheng ⋅ Weixuan Liu ⋅ Weiwei Jin ⋅ Zhucun Xue ⋅ Yong Liu ⋅ Xiaobin Hu ⋅ Shuicheng YAN

Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features (1) a diverse, multi-level difficulty dataset covering 24 examination types, (2) 13 varying-difficulty tasks, (3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and (4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.


Poster
P3-#1223
MASAM: Multimodal Adaptive Sharpness-Aware Minimization for Heterogeneous Data Fusion

ZIjie CHEN ⋅ Kejing Yin ⋅ Wenfang Yao ⋅ William Kwok-wai Cheung ⋅ Jing Qin

Multimodal learning requires integrating heterogeneous modalities, such as structured records, visual imagery, and temporal signals. It has been revealed that this heterogeneity causes modality encoders to converge at different rates, making the multimodal learning imbalanced. We empirically observe that such an imbalance is related to the sharpness of the solution. Modality encoders that converge faster could be dragged into sharp regions due to inter-modal interference, degrading the generalization capability of unimodal features learned. Sharpness-Aware Minimization is effective in improving generalization via finding solutions in flat regions. However, its application in multimodal scenarios is challenging: 1) SAM overemphasizes the dominant modality, inducing misaligned perturbations in weaker modalities, and 2) the perturbation gradient calculation is affected by interference from other modalities. To address these issues, we propose Multimodal Adaptive Sharpness-Aware Minimization (MASAM), which optimizes different modalities based on their dominance. We design an Adaptive Perturbation Score (APS) using convergence speed and gradient alignment to identify dominant modalities for SAM application. Our Modality-Decoupled Perturbation Scaling (MDPS) then reduces inter-modal interference during optimization, better aligning each modality with shared information. Extensive empirical evaluations on five multimodal datasets and six downstream tasks demonstrate that MASAM consistently attains flatter solutions, achieves balanced multimodal learning, and subsequently surpasses state-of-the-art methods across diverse datasets and tasks. Code is available at https://github.com/Orange2107/MASAM-Multimodal-Adaptive-SAM.


Poster
P3-#1222
Unified Brain Surface and Volume Registration

Mazdak Abulnaga ⋅ Andrew Hoopes ⋅ Malte Hoffmann ⋅ Robin Magnet ⋅ Maks Ovsjanikov ⋅ Lilla Zollei ⋅ John Guttag ⋅ Bruce Fischl ⋅ Adrian Dalca

Accurate registration of brain MRI scans is fundamental for cross-subject analysis in neuroscientific studies. This involves aligning both the cortical surface of the brain and the interior volume. Traditional methods treat volumetric and surface-based registration separately, which often leads to inconsistencies that limit downstream analyses. We propose a deep learning framework, UCS, that registers 3D brain MRI images by jointly aligning both cortical and subcortical regions, through a unified volume-and-surface-based representation. Our approach leverages an intermediate spherical coordinate space to bridge anatomical surface topology with volumetric anatomy, enabling consistent and anatomically accurate alignment. By integrating spherical registration into the learning, our method ensures geometric coherence between volume and surface domains. In a series of experiments on both in-domain and out-of-domain datasets, our method consistently outperforms both classical and machine learning-based registration methods--improving the Dice score by up to 7 points while maintaining regular deformation fields. Additionally, it is orders of magnitude faster than the standard method for this task, and is simpler to use because it requires no additional inputs beyond an MRI scan. Its superior accuracy, fast inference, and ease of use sets a new standard for joint cortical and subcortical registration.


Poster
P3-#1221
Reliable Evaluation of MRI Motion Correction: Dataset and Insights

Kun Wang ⋅ Tobit Klug ⋅ Stefan Ruschke ⋅ Jan Kirschke ⋅ Reinhard Heckel

Correcting motion artifacts in scientific and medical imaging is important, as they significantly impact image quality. However, evaluating deep learning-based and classical motion correction methods remains fundamentally difficult due to the lack of accessible ground-truth target data. To address this challenge, we study three evaluation approaches: real-world evaluation based on reference scans, simulated motion, and reference-free evaluation, each with its merits and shortcomings. To enable evaluation with real-world motion artifacts, we release PMoC3D, a dataset consisting of unprocessed $\textbf{P}$aired $\textbf{Mo}$tion-$\textbf{C}$orrupted $\textbf{3D}$ brain MRI data. To advance evaluation quality, we introduce MoMRISim, a feature-space metric trained for evaluating motion reconstructions. We assess each evaluation approach and find real-world evaluation together with MoMRISim, while not perfect, to be most reliable. Evaluation based on simulated motion systematically exaggerates algorithm performance, and reference-free evaluation overrates oversmoothed deep learning outputs.


Poster
P3-#1319
Distilling and Adapting: A Topology-Aware Framework for Zero-Shot Interaction Prediction in Multiplex Biological Networks

Alana Deng ⋅ Sugitha Janarthanan ⋅ Yan Sun ⋅ Zihao Jing ⋅ Pingzhao Hu

Multiplex Biological Networks (MBNs), which represent multiple interaction types between entities, are crucial for understanding complex biological systems. Yet, existing methods often inadequately model multiplexity, struggle to integrate structural and sequence information, and face difficulties in zero-shot prediction for unseen entities with no prior neighbourhood information. To address these limitations, we propose a novel framework for zero-shot interaction prediction in MBNs by leveraging context-aware representation learning and knowledge distillation. Our approach leverages domain-specific foundation models to generate enriched embeddings, introduces a topology-aware graph tokenizer to capture multiplexity and higher-order connectivity, and employs contrastive learning to align embeddings across modalities. A teacher–student distillation strategy further enables robust zero-shot generalization. Experimental results demonstrate that our framework outperforms state-of-the-art methods in interaction prediction for MBNs, providing a powerful tool for exploring various biological interactions and advancing personalized therapeutics.


Poster
P3-#1220
Physics vs Distributions: Pareto Optimal Flow Matching with Physics Constraints

Giacomo Baldan ⋅ Qiang Liu ⋅ Alberto Guardone ⋅ Nils Thuerey

Physics-constrained generative modeling aims to produce high-dimensional samples that are both physically consistent and distributionally accurate, a task that remains challenging due to often conflicting optimization objectives. Recent advances in flow matching and diffusion models have enabled efficient generative modeling, but integrating physical constraints often degrades generative fidelity or requires costly inference-time corrections. Our work is the first to recognize the trade-off between distributional and physical accuracy. Based on the insight of inherently conflicting objectives, we introduce Physics-Based Flow Matching (PBFM) a method that enforces physical constraints at training time using conflict-free gradient updates and unrolling to mitigate Jensen's gap. Our approach avoids manual loss balancing and enables simultaneous optimization of generative and physical objectives. As a consequence, physics constraints do not impede inference performance. We benchmark our method across three representative PDE benchmarks. PBFM achieves a Pareto-optimal trade-off, competitive inference speed, and generalizes to a wide range of physics-constrained generative tasks, providing a practical tool for scientific machine learning. Code and datasets available at https://github.com/tum-pbs/PBFM.


Poster
P3-#1219
CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers

Haining Pan ⋅ James Roggeveen ⋅ Erez Berg ⋅ Juan Alvarez ⋅ Debanjan Chowdhury ⋅ Surya Ganguli ⋅ Federico Ghimenti ⋅ Juraj Hasik ⋅ Henry Hunt ⋅ Hong-Chen Jiang ⋅ Mason Kamb ⋅ Ying-Jer Kao ⋅ Ehsan Khatami ⋅ Michael Lawler ⋅ Di Luo ⋅ Titus Neupert ⋅ Xiaoliang Qi ⋅ Michael Brenner ⋅ Eun-Ah Kim

Large language models (LLMs) have demonstrated remarkable progress in coding and mathematical problem-solving; however, evaluation on advanced research-level problems in the hard sciences remains scarce. To fill this gap, we present \cmt, a dataset of 50 original problems covering condensed matter theory (CMT) at the level of an expert researcher. The solution for these problems involve analytical and computational approaches commonly used in quantum many-body physics and classical statistical mechanics. The dataset has been designed and verified by a worldwide panel of expert researchers through a collaborative environment. Topics in the dataset include Hartree-Fock mean-field theory, exact diagonalization methods, quantum Monte Carlo sampling, density matrix renormalization group, quantum statistical mechanics, classical statistical mechanics, and model building. We evaluate different LLMs by programmatically checking LLM-generated solutions against expert-supplied ground truth. To verify LLMs performance at scale, we developed an automated machine-grading pipeline suitable for advanced physics research problems. For example, we handle non-commuting operators that are essential for quantum many-body problems by symbolic manipulation and normal ordering. Our evaluations show that frontier models struggle with all of the problems in the dataset, highlighting a gap in the physical reasoning skills of current LLMs. Notably, experts identified strategies for creating increasingly difficult problems by interacting with the LLMs and exploiting common failure modes. While the highest-performing model, GPT5, correctly solves 30\% of the problems, average performance across 17 models (GPT, Gemini, Claude, DeepSeek, and Llama classes) is only 11.4$\pm$2.1\%. Moreover, our benchmark contains 18 problems that not a single one of the 17 models considered here can correctly solve, and 26 problems that are solved by at most one model. These currently unsolvable problems span the fields of Quantum Monte Carlo, Variational Monte Carlo, and Density Matrix Renormalization Group. Furthermore, we illustrate how incorrect answers sometimes violate fundamental symmetries or have unphysical scaling dimensions. We believe that this benchmark set provides valuable guidance for the future development of language models, aiming to achieve the goal of AI research assistants and tutors.


Poster
P3-#1218
Si-GT: Fast Interconnect Signal Integrity Analysis for Integrated Circuit Design via Graph Transformers

Yuting Hu ⋅ Tarek Mohamed ⋅ Chenhui Xu ⋅ Hua Xiang ⋅ Hussam Amrouch ⋅ Gi-Joon Nam ⋅ Jinjun Xiong

Signal integrity issues present significant challenges in modern integrated circuit (IC) design, as crosstalk-induced delay variation and transient glitches caused by capacitive coupling among interconnects can severely impact IC functional correctness. Although circuit simulators like SPICE can deliver accurate signal integrity analysis, their computational cost becomes prohibitive for large-scale designs. In this paper, we propose Si-GT, a novel transformer-based model for fast and accurate signal integrity analysis in IC interconnects. Our model elaborates three key designs: (1) virtual NET token to encode net-specific signal characteristics and serve as net-wise representation, (2) mesh pattern encoding to embed high-order mesh structures at each node while distinguishing uncoupled wire segments, and (3) intra-inter net (IIN) attention mechanism to capture structures of signal propagation path and coupling connections. To support model training and evaluation, we construct the first interconnect signal integrity dataset comprising 200k delay examples and 187k glitch examples using SPICE simulations as the golden reference. Our experiments show that our Si-GT surpasses state-of-the-art graph neural network and graph transformer baselines with substantially reduced computation compared to SPICE, offering a scalable and effective solution for interconnect signal integrity analysis in IC design verification. We release the code, model, and datasets at https://github.com/xlab-ub/Si-GT.


Poster
P3-#1217
Robust and Interpretable Adaptation of Equivariant Materials Foundation Models via Sparsity-promoting Fine-tuning

Youngwoo Cho ⋅ Seunghoon Yi ⋅ Wooil Yang ⋅ Sungmo Kang ⋅ Young-Woo Son ⋅ Jaegul Choo ⋅ Joonseok Lee ⋅ Soo Kyung Kim ⋅ Hongkee Yoon

Pre-trained materials foundation models, or machine learning interatomic potentials, leverage general physicochemical knowledge to effectively approximate potential energy surfaces. However, they often require domain-specific calibration due to physicochemical diversity as well as mismatches between practical computational settings and those used in constructing the pre-training data. To address this, we propose a sparsity-promoting fine-tuning method that selectively updates model parameters by exploiting the structural properties of E(3)-equivariant materials foundation models. On energy and force prediction tasks across molecular and crystalline benchmarks, our method matches or surpasses full fine-tuning and equivariant low-rank adaptation while updating only ~3 \% of parameters, and in some cases as little as \~0.5 \%. Beyond energy and force calibration, we further demonstrate task generalizability by applying our method to magnetic moment prediction and magnetism-aware total energy modeling. Finally, analysis of sparsity patterns reveals physically interpretable signatures, such as enhanced $d$-orbital contributions in transition metal systems. Overall, our results establish sparsity-promoting fine-tuning as a flexible and interpretable method for domain specialization of equivariant materials foundation models.


Poster
P3-#1216
MAVEN: A Mesh-Aware Volumetric Encoding Network for Simulating 3D Flexible Deformation

Zhe Feng ⋅ Shilong Tao ⋅ Haonan Sun ⋅ Shaohan Chen ⋅ Zhanxing Zhu ⋅ Yunhuai Liu

Deep learning-based approaches, particularly graph neural networks (GNNs), have gained prominence in simulating flexible deformations and contacts of solids, due to their ability to handle unstructured physical fields and nonlinear regression on graph structures. However, existing GNNs commonly represent meshes with graphs built solely from vertices and edges. These approaches tend to overlook higher-dimensional spatial features, e.g., 2D facets and 3D cells, from the original geometry. As a result, it is challenging to accurately capture boundary representations and volumetric characteristics, though this information is critically important for modeling contact interactions and internal physical quantity propagation, particularly under sparse mesh discretization. In this paper, we introduce MAVEN, a \textbf{m}esh-\textbf{a}ware \textbf{v}olumetric \textbf{e}ncoding \textbf{n}etwork for simulating 3D flexible deformation, which explicitly models geometric mesh elements of higher dimension to achieve a more accurate and natural physical simulation. MAVEN establishes learnable mappings among 3D cells, 2D facets, and vertices, enabling flexible mutual transformations. Explicit geometric features are incorporated into the model to alleviate the burden of implicitly learning geometric patterns. Experimental results show that MAVEN consistently achieves state-of-the-art performance across established datasets and a novel metal stretch-bending task featuring large deformations and prolonged contacts.


Poster
P3-#617
Panda: A pretrained forecast model for chaotic dynamics

Jeffrey Lai ⋅ Anthony Bao ⋅ William Gilpin

Chaotic systems are intrinsically sensitive to small errors, challenging efforts to construct predictive data-driven models of real-world dynamical systems such as fluid flows or neuronal activity. Prior efforts comprise either specialized models trained on individual time series, or foundation models trained on vast time series databases with little underlying dynamical structure. Motivated by dynamical systems theory, we present Panda, Patched Attention for Nonlinear Dynamics. We train Panda on a novel synthetic, extensible dataset of 20,000 chaotic dynamical systems that we discover using an evolutionary algorithm. Trained purely on simulated data, Panda exhibits emergent properties: zero-shot forecasting of unseen chaotic systems preserving both short-term accuracy and distributional measures, nonlinear resonance patterns in attention heads, and effective prediction of real-world experimental time series. Despite having been trained only on low-dimensional ordinary differential equations, Panda spontaneously develops the ability to predict partial differential equations without retraining. We also demonstrate a neural scaling law for differential equations, underscoring the potential of pretrained models for probing abstract mathematical domains like nonlinear dynamics.


Poster
P3-#1215
Generalized Spherical Neural Operators: Green’s Function Formulation

Hao Tang ⋅ Hao Chen ⋅ Chao Li

Neural operators offer powerful approaches for solving parametric partial differential equations, but extending them to spherical domains remains challenging due to the need to preserve intrinsic geometry while avoiding distortions that break rotational consistency. Existing spherical operators rely on rotational equivariance but often lack the flexibility for real-world complexity. We propose a generalized operator-design framework based on designable Green’s function and its harmonic expansion, establishing a solid operator-theoretic foundation for spherical learning. Based on this, we propose an absolute and relative position-dependent Green’s function that enables flexible balance of equivariance and invariance for real-world modeling. The resulting operator, Green's-function Spherical Neural Operator (GSNO) with a novel spectral learning method, can adapt to non-equivariant systems while retaining spherical geometry, spectral efficiency and grid invariance. To exploit GSNO, we develop SHNet, a hierarchical architecture that combines multi-scale spectral modeling with spherical up-down sampling, enhancing global feature representation. Evaluations on diffusion MRI, shallow water dynamics, and global weather forecasting, GSNO and SHNet consistently outperform state-of-the-art methods. The theoretical and experimental results position GSNO as a principled and generalized framework for spherical operator learning, bridging rigorous theory with real-world complexity. The code is available at: https://github.com/haot2025/GSNO.


Poster
P3-#1214
End-to-End Probabilistic Framework for Learning with Hard Constraints

Utkarsh Utkarsh ⋅ Danielle Maddix ⋅ Ruijun Ma ⋅ Michael W Mahoney ⋅ Bernie Wang

We present ProbHardE2E, a probabilistic forecasting framework that incorporates hard operational/physical constraints and provides uncertainty quantification. Our methodology uses a novel differentiable probabilistic projection layer (DPPL) that can be combined with a wide range of neural network architectures. DPPL allows the model to learn the system in an end-to-end manner, compared to other approaches where constraints are satisfied either through a post-processing step or at inference. ProbHardE2E optimizes a strictly proper scoring rule, without making any distributional assumptions on the target, which enables it to obtain robust distributional estimates (in contrast to existing approaches that generally optimize likelihood-based objectives, which can be biased by their distributional assumptions and model choices); and it can incorporate a range of non-linear constraints (increasing the power of modeling and flexibility). We apply ProbHardE2E in learning partial differential equations with uncertainty estimates and to probabilistic time-series forecasting, showcasing it as a broadly applicable general framework that connects these seemingly disparate domains. Our code is available at https://github.com/amazon-science/probharde2e.


Poster
P3-#1213
Incomplete Data, Complete Dynamics: A Diffusion Approach

Zihan Zhou ⋅ Chenguang Wang ⋅ Hongyi Ye ⋅ Yongtao Guan ⋅ Tianshu Yu

Learning physical dynamics from data is a fundamental challenge in machine learning and scientific modeling. Real-world observational data are inherently incomplete and irregularly sampled, posing significant challenges for existing data-driven approaches. In this work, we propose a principled diffusion-based framework for learning physical systems from incomplete training samples. To this end, our method strategically partitions each such sample into observed context and unobserved query components through a carefully designed splitting strategy, then trains a conditional diffusion model to reconstruct the missing query portions given available contexts. This formulation enables accurate imputation across arbitrary observation patterns without requiring complete data supervision. Specifically, we provide theoretical analysis demonstrating that our diffusion training paradigm on incomplete data achieves asymptotic convergence to the true complete generative process under mild regularity conditions. Empirically, we show that our method significantly outperforms existing baselines on synthetic and real-world physical dynamics benchmarks, including fluid flows and weather systems, with particularly strong performance in limited and irregular observation regimes. These results demonstrate the effectiveness of our theoretically principled approach for learning and imputing partially observed dynamics.


Poster
P3-#1212
Fast Convergence of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks

Xianliang Xu ⋅ Wang Kong ⋅ Jiaheng Mao ⋅ Zhongyi Huang ⋅ Ye Li

In the context of over-parameterization, there is a line of work demonstrating that randomly initialized (stochastic) gradient descent (GD) converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. However, the convergence rate of GD for training two-layer neural networks exhibits poor dependence on the sample size and the Gram matrix, leading to a slow training process. In this paper, we show that for training two-layer $\text{ReLU}^3$ Physics-Informed Neural Networks (PINNs), the learning rate can be improved from the smallest eigenvalue of the limiting Gram matrix to the reciprocal of the largest eigenvalue, implying that GD actually enjoys a faster convergence rate. Despite such improvements, the convergence rate is still tied to the least eigenvalue of the Gram matrix, leading to slow convergence. We then develop the positive definiteness of Gram matrices with general smooth activation functions and provide the convergence analysis of natural gradient descent (NGD) in training two-layer PINNs, demonstrating that the maximal learning rate can be $\mathcal{O}(1)$ and at this rate, the convergence rate is independent of the Gram matrix. In particular, for smooth activation functions, the convergence rate of NGD is quadratic. Numerical experiments are conducted to verify our theoretical results.


Poster
P3-#1211
Enhancing Stability of Physics-Informed Neural Network Training Through Saddle-Point Reformulation

Dmitry Bylinkin ⋅ Mikhail Aleksandrov ⋅ Savelii Chezhegov ⋅ Aleksandr Beznosikov

Physics-informed neural networks (PINNs) have gained prominence in recent years and are now effectively used in a number of applications. However, their performance remains unstable due to the complex landscape of the loss function. To address this issue, we reformulate PINN training as a nonconvex-strongly concave saddle-point problem. After establishing the theoretical foundation for this approach, we conduct an extensive experimental study, evaluating its effectiveness across various tasks and architectures. Our results demonstrate that the proposed method outperforms the current state-of-the-art techniques.


Poster
P3-#1210
DGNet: Discrete Green Networks for Data-Efficient Learning of Spatiotemporal PDEs

Yingjie Tan ⋅ Quanming Yao ⋅ Yaqing Wang

Spatiotemporal partial differential equations (PDEs) underpin a wide range of scientific and engineering applications. Neural PDE solvers offer a promising alternative to classical numerical methods. However, existing approaches typically require large numbers of training trajectories, while high-fidelity PDE data are expensive to generate. Under limited data, their performance degrades substantially, highlighting their low data efficiency. A key reason is that PDE dynamics embody strong structural inductive biases that are not explicitly encoded in neural architectures, forcing models to learn fundamental physical structure from data. A particularly salient manifestation of this inefficiency is poor generalization to unseen source terms. In this work, we revisit Green’s function theory—a cornerstone of PDE theory—as a principled source of structural inductive bias for PDE learning. Based on this insight, we propose DGNet, a discrete Green network for data-efficient learning of spatiotemporal PDEs. The key idea is to transform the Green’s function into a graph-based discrete formulation, and embed the superposition principle into the hybrid physics–neural architecture which reduces the burden of learning physical priors from data, thereby improving sample efficiency. Across diverse spatiotemporal PDE scenarios, DGNet consistently achieves state-of-the-art accuracy using only tens of training trajectories. Moreover, it exhibits robust zero-shot generalization to unseen source terms, serving as a stress test that highlights its data-efficient structural design.


Poster
P3-#1209
A Spectral-Grassmann Wasserstein metric for operator representations of dynamical systems

Thibaut Germain ⋅ Rémi Flamary ⋅ Vladimir Kostic ⋅ Karim Lounici

The geometry of dynamical systems estimated from trajectory data is a major challenge for machine learning applications. Koopman and transfer operators provide a linear representation of nonlinear dynamics through their spectral decomposition, offering a natural framework for comparison. We propose a novel approach that represents each system as a distribution over its joint operator eigenvalues and spectral projectors and defines a metric between systems leveraging optimal transport. The proposed metric is invariant to the sampling frequency of trajectories. It is also computationally efficient, supported by finite-sample convergence guarantees, and enables the computation of Fréchet means, providing interpolation between dynamical systems. Experiments on simulated and real-world datasets show that our approach consistently outperforms standard operator-based distances in machine learning applications, including dimensionality reduction and classification, and provides meaningful interpolation between dynamical systems.

Data assimilation techniques are crucial for accurately tracking complex dynamical systems by integrating observational data with numerical forecasts. Recently, score-based data assimilation methods emerged as powerful tools for high-dimensional and nonlinear data assimilation. However, these methods still incur substantial computational costs due to the need for expensive forward simulations. In this work, we propose LD-EnSF, a novel score-based data assimilation method that fully eliminates the need for full-space simulations by evolving dynamics directly in a compact latent space. Our method incorporates improved Latent Dynamics Networks (LDNets) to learn accurate surrogate dynamics and introduces a history-aware LSTM encoder to effectively process sparse and irregular observations. By operating entirely in the latent space, LD-EnSF achieves speedups orders of magnitude over existing methods while maintaining high accuracy and robustness. We demonstrate the effectiveness of LD-EnSF on several challenging high-dimensional benchmarks with highly sparse (in both space and time) and noisy observations.


Poster
P3-#1207
PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

Wanjia Zhao ⋅ Qinwei (Martin) Ma ⋅ Jingzhe Shi ⋅ Shirley Wu ⋅ Jiaqi Han ⋅ Yijia Xiao ⋅ Si-Yuan Chen ⋅ Xiao Luo ⋅ Ludwig Schmidt ⋅ James Y Zou

Benchmarks for competition-style reasoning have advanced evaluation in mathematics and programming, yet physics remains comparatively underexplored. Most existing physics benchmarks evaluate only final answers, which fail to capture reasoning processes, while recent stepwise methods rely on heuristic LLM-as-judge scoring or restrictive linear assumptions, limiting reliability and diagnostic validity. We introduce PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts' scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.


Poster
P3-#1206
Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI

Bogdan Raonic ⋅ Siddhartha Mishra ⋅ Samuel Lanthaler

Data-driven models are increasingly adopted in critical scientific fields like weather forecasting and fluid dynamics. These methods can fail on out-of-distribution (OOD) data, but detecting such failures in regression tasks is an open challenge. We propose a new OOD detection method based on estimating joint likelihoods using a score-based diffusion model. This approach considers not just the input but also the regression model's prediction, providing a task-aware reliability score. Across numerous scientific datasets, including PDE datasets, satellite imagery and brain tumor segmentation, we show that this likelihood strongly correlates with prediction error. Our work provides a foundational step towards building a verifiable 'certificate of trust', thereby offering a practical tool for assessing the trustworthiness of AI-based scientific predictions.


Poster
P3-#102
ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning

Zhao Jin ⋅ Zhengping Che ⋅ Tao Li ⋅ Zhen Zhao ⋅ Kun Wu ⋅ Yuheng Zhang ⋅ Yinuo Zhao ⋅ Zehui Liu ⋅ Qiang Zhang ⋅ Xiaozhu Ju ⋅ Jing Tian ⋅ Yousong Xue ⋅ Jian Tang

Robot learning increasingly relies on simulation to advance complex abilities such as dexterous manipulation and precise interaction, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinders their utility for training models to master robotic tasks in the real world. To address these challenges, we introduce ArtVIP, a comprehensive open-source dataset comprising high-quality digital-twin articulated objects, accompanied by indoor-scene assets. Crafted by professional 3D modelers adhering to unified standards, ArtVIP ensures visual realism through precise geometric meshes and high-resolution textures, while physical fidelity is achieved via fine-tuned dynamic parameters. Meanwhile, the dataset pioneers embedded modular interaction behaviors within assets and pixel-level affordance annotations. Feature-map visualization and optical motion capture are employed to quantitatively demonstrate ArtVIP’s visual and physical fidelity, and its applicability is validated through imitation learning and reinforcement learning experiments. Provided in USD format with detailed production guidelines, ArtVIP is fully open-source, benefiting the research community and advancing robot learning research.


Poster
P3-#1205
Tensor learning with orthogonal, Lorentz, and symplectic symmetries

Wilson Gregory ⋅ Josué Tonelli-Cueto ⋅ Nicholas Marshall ⋅ Andrew S Lee ⋅ Soledad Villar

Tensors are a fundamental data structure for many scientific contexts, such as time series analysis, materials science, and physics, among many others. Improving our ability to produce and handle tensors is essential to efficiently address problems in these domains. In this paper, we show how to exploit the underlying symmetries of functions that map tensors to tensors. More concretely, we develop universally expressive equivariant machine learning architectures on tensors that exploit that, in many cases, these tensor functions are equivariant with respect to the diagonal action of the orthogonal, Lorentz, and/or symplectic groups. We showcase our results on three problems coming from material science, theoretical computer science, and time series analysis. For time series, we combine our method with the increasingly popular path signatures approach, which is also invariant with respect to reparameterizations. Our numerical experiments show that our equivariant models perform better than corresponding non-equivariant baselines.


Blog Track Poster
P3-#1204
Discretisation invariance

Vladimir Fanaskov ⋅ Ivan Oseledets

Discretisation invariance, a recent innovation in scientific machine learning, is a requirement that ensures an architecture can process inputs of different resolutions. In this post, we formally define this property, provide examples, generate datasets, train architectures, and discuss whether discretisation invariance is living up to its promise.


Poster
P3-#1203
Contact-guided Real2Sim from Monocular Video with Planar Scene Primitives

Zihan Wang ⋅ Jiashun Wang ⋅ Jeff Tan ⋅ Yiwen Zhao ⋅ Jessica Hodgins ⋅ Shubham Tulsiani ⋅ Deva Ramanan

We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human--scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion-tracking policies with scene interactions to fail. In contrast, our key insight is to fit simulation-ready convex planar primitives to a depth-based point cloud reconstruction of the scene via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we use human--scene contact modeling (e.g., using human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion-tracking failure rates from 55.2\% to 6.9\% on human-centric video benchmarks (EMDB, PROX), while delivering 43\% faster RL simulation throughput. This demonstrates CRISP's ability to generate physically valid human motion and interaction environments at scale, advancing real-to-sim applications for robotics. Code and interactive demos are available at our project website: https://crisp-real2sim.github.io/CRISP-Real2Sim


Poster
P3-#1202
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim ⋅ Yihuai Gao ⋅ Tsung-Yi Lin ⋅ Yen-Chen Lin ⋅ Yunhao Ge ⋅ Grace Lam ⋅ Percy Liang ⋅ Shuran Song ⋅ Ming-Yu Liu ⋅ Chelsea Finn ⋅ Jinwei Gu

Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5\% and 67.1\% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at https://research.nvidia.com/labs/dir/cosmos-policy/.


Poster
P3-#1201
OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

Fanqi Lin ⋅ Ruiqian Nai ⋅ Yingdong Hu ⋅ Jiacheng You ⋅ Junming Zhao ⋅ Yang Gao

General-purpose robots capable of performing diverse tasks require synergistic reasoning and acting capabilities. However, recent dual-system approaches, which separate high-level reasoning from low-level acting, often suffer from challenges such as limited mutual understanding of capabilities between systems and latency issues. This paper introduces OneTwoVLA, a single unified vision-language-action model that can perform both acting (System One) and reasoning (System Two). Crucially, OneTwoVLA adaptively switches between two modes: explicitly reasoning at critical moments during task execution, and generating actions based on the most recent reasoning at other times. To further unlock OneTwoVLA's reasoning and generalization capabilities, we design a scalable pipeline for synthesizing embodied reasoning-centric vision-language data, used for co-training with robot data. We validate OneTwoVLA's effectiveness through extensive experiments, highlighting its superior performance across four key capabilities: long-horizon task planning, error detection and recovery, natural human-robot interaction, and generalizable visual grounding, enabling the model to perform long-horizon, highly dexterous manipulation tasks such as making hotpot or mixing cocktails.


Poster
P3-#1301
RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots

Soroush Nasiriany ⋅ Sep Nasiriany ⋅ Abhiram Maddukuri ⋅ Yuke Zhu

Recent advances in robot learning have accelerated progress toward generalist robots that can perform everyday tasks in human environments. Yet it remains difficult to gauge how close we are to this vision. The field lacks a reproducible, large-scale benchmark for systematic evaluation. To fill this gap, we present RoboCasa365, a comprehensive simulation benchmark for household mobile manipulation. Built on the RoboCasa platform, RoboCasa365 introduces 365 everyday tasks across 2,500 diverse kitchen environments, with over 600 hours of human demonstration data and over 1600 hours of synthetically generated demonstration data---making it one of the most diverse and large-scale resources for studying generalist policies. RoboCasa365 is designed to support systematic evaluations for different problem settings, including multi-task learning, robot foundation model training, and lifelong learning. We conduct extensive experiments on this benchmark with state-of-the-art methods and analyze the impacts of task diversity, dataset scale, and environment variation on generalization. Our results provide new insights into what factors most strongly affect the performance of generalist robots and inform strategies for future progress in the field.


Poster
P3-#1302
Vid2World: Crafting Video Diffusion Models to Interactive World Models

Siqiao Huang ⋅ Jialong Wu ⋅ Qixing Zhou ⋅ Shangchen Miao ⋅ Mingsheng Long

World models, which predict future transitions from past observation and action sequences, have shown great promise for improving data efficiency in sequential decision-making. However, existing world models often require extensive domain-specific training and still produce low-fidelity, coarse predictions, limiting their usefulness in complex environments. In contrast, video diffusion models trained on large-scale internet data have demonstrated impressive capabilities in generating high-quality videos that capture diverse real-world dynamics. In this work, we present Vid2World, a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models. To bridge the gap, Vid2World systematically explores video diffusion causalization, reshaping both the architecture and training objective of pre-trained models to enable autoregressive generation. Additionally, it incorporates a causal action guidance mechanism to enhance action controllability in the resulting interactive world models. Extensive experiments across multiple domains, including robot manipulation, 3D game simulation, and open-world navigation, demonstrate that our method offers a scalable and effective pathway for repurposing highly capable video diffusion models into interactive world models.


Poster
P3-#1303
Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM

Zicheng Zhang ⋅ Ke Wu ⋅ Xiangting Meng ⋅ Keyu Liu ⋅ Jieru Zhao ⋅ Wenchao Ding

Monocular 3D Gaussian Splatting SLAM suffers from critical limitations in time efficiency, geometric accuracy, and multi-view consistency. These issues stem from the time-consuming $\textit{Train-from-Scratch}$ optimization and the lack of inter-frame scale consistency from single-frame geometry priors. We contend that a feed-forward paradigm, leveraging multi-frame context to predict Gaussian attributes directly, is crucial for addressing these challenges. We present Flash-Mono, a system composed of three core modules: a feed-forward prediction frontend, a 2D Gaussian Splatting mapping backend, and an efficient hidden-state-based loop closure module. We trained a recurrent feed-forward frontend model that progressively aggregates multi-frame visual features into a hidden state via cross attention and jointly predicts camera poses and per-pixel Gaussian properties. By directly predicting Gaussian attributes, our method bypasses the burdensome per-frame optimization required in optimization-based GS-SLAM, achieving a $\textbf{10x}$ speedup while ensuring high-quality rendering. The power of our recurrent architecture extends beyond efficient prediction. The hidden states act as compact submap descriptors, facilitating efficient loop closure and global $\mathrm{Sim}(3)$ optimization to mitigate the long-standing challenge of drift. For enhanced geometric fidelity, we replace conventional 3D Gaussian ellipsoids with 2D Gaussian surfels. Extensive experiments demonstrate that Flash-Mono achieves state-of-the-art performance in both tracking and mapping quality, highlighting its potential for embodied perception and real-time reconstruction applications.


Poster
P3-#1304
BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning

Yitang Li ⋅ Zhengyi Luo ⋅ Tonghe Zhang ⋅ Cunxi Dai ⋅ Anssi Kanervisto ⋅ Andrea Tirinzoni ⋅ Haoyang Weng ⋅ Kris Kitani ⋅ Mateusz Guzek ⋅ Ahmed Touati ⋅ Alessandro Lazaric ⋅ Matteo Pirotta ⋅ Guanya Shi

Building Behavioral Foundation Models (BFMs) for humanoid robots has the potential to unify diverse control tasks under a single, promptable generalist policy. However, existing approaches are either exclusively deployed on simulated humanoid characters, or specialized to specific tasks such as tracking. We propose BFM-Zero, a framework that learns an effective shared latent representation that embeds motions, goals, and rewards into a common space, enabling a single policy to be prompted for multiple downstream tasks without retraining. This well-structured latent space in BFM-Zero enables versatile and robust whole-body skills on a Unitree G1 humanoid in the real world, via diverse inference methods, including zero-shot motion tracking, goal reaching, and reward inference, and few-shot optimization-based adaptation. Unlike prior on-policy reinforcement learning (RL) frameworks, BFM-Zero builds upon recent advancements in unsupervised RL and Forward-Backward (FB) models, which offer an objective-centric, explainable, and smooth latent representation of whole-body motions. We further extend BFM-Zero with critical reward shaping, domain randomization, and history-dependent asymmetric learning to bridge the sim-to-real gap. Those key design choices are quantitatively ablated in simulation. A first-of-its-kind model, BFM-Zero establishes a step toward scalable, promptable behavioral foundation models for whole-body humanoid control. Webpage: https://lecar-lab.github.io/BFM-Zero/


Poster
P3-#1305
TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models

HoKyun Im ⋅ Euijin Jeong ⋅ Andrey Kolobov ⋅ Jianlong Fu ⋅ Youngwoon Lee

Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring *any* bimanual pretraining. Furthermore, it narrows the gap to state-of-the-art model $\pi_0$, which relies on extensive proprietary bimanual data and compute cost. These results establish our modular composition approach as a data-efficient and scalable path toward high-performance bimanual manipulation, leveraging public single-arm data.


Poster
P3-#1306
SLAP: Shortcut Learning for Abstract Planning

Y. Isabel Liu ⋅ Bowen Li ⋅ Benjamin Eysenbach ⋅ Tom Silver

Long-horizon decision-making with sparse rewards and continuous states and actions remains a fundamental challenge in AI and robotics. Task and motion planning (TAMP) is a model-based framework that addresses this challenge by planning hierarchically with abstract actions (options). These options are manually defined, limiting the agent to behaviors that we as human engineers know how to program (pick, place, move). In this work, we propose Shortcut Learning for Abstract Planning (SLAP), a method that leverages existing TAMP options to automatically discover new ones. Our key idea is to use model-free reinforcement learning (RL) to learn shortcuts in the abstract planning graph induced by the existing options in TAMP. Without any additional assumptions or inputs, shortcut learning leads to shorter solutions than pure planning, and higher task success rates than flat and hierarchical RL. Qualitatively, SLAP discovers dynamic physical improvisations (e.g., slap, wiggle, wipe) that differ significantly from the manually-defined ones. In experiments in four simulated robotic environments, we show that SLAP solves and generalizes to a wide range of tasks, reducing overall plan lengths by over 50\% and consistently outperforming planning and RL baselines.


Poster
P3-#1307
UniHM: Unified Dexterous Hand Manipulation with Vision Language Model

Zhenhao Zhang ⋅ Jiaxin Liu ⋅ Ye Shi ⋅ Jingya Wang

Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands. We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language instructions. To ensure physical realism, we introduce a physics-guided dynamic refinement module that performs segment-wise joint optimization under generative and temporal priors, yielding smooth and physically feasible manipulation sequences. Across multiple datasets and real-world evaluations, UniHM attains state-of-the-art results on both seen and unseen objects and trajectories, demonstrating strong generalization and high physical feasibility.


Poster
P3-#1308
DEAS: DEtached value learning with Action Sequence for Scalable Offline RL

Changyeon Kim ⋅ Haeone Lee ⋅ Younggyo Seo ⋅ Kimin Lee ⋅ Yuke Zhu

Offline reinforcement learning (RL) presents an attractive paradigm for training intelligent agents without expensive online interactions. However, current approaches still struggle with complex, long-horizon sequential decision making. In this work, we introduce DEtached value learning with Action Sequence (DEAS), a simple yet effective offline RL framework that leverages action sequences for value learning. These temporally extended actions provide richer information than single-step actions, enabling reduction of the effective planning horizon by considering longer sequences at once. However, directly adopting such sequences in actor-critic algorithms introduces excessive value overestimation, which we address through detached value learning that steers value estimates toward in-distribution actions that achieve high returns in the offline dataset. We demonstrate that DEAS consistently outperforms baselines on complex, long-horizon tasks from OGBench and can be applied to enhance the performance of large-scale Vision-Language-Action models that predict action sequences, significantly boosting performance in both RoboCasa Kitchen simulation tasks and real-world manipulation tasks.


Poster
P3-#1309
Compositional Diffusion with Guided search for Long-Horizon Planning

Utkarsh Mishra ⋅ David He ⋅ Yongxin Chen ⋅ Danfei Xu

Generative models have emerged as powerful tools for planning, with compositional approaches offering particular promise for modeling long-horizon task distributions by composing together local, modular generative models. This compositional paradigm spans diverse domains, from multi-step manipulation planning to panoramic image synthesis to long video generation. However, compositional generative models face a critical challenge: when local distributions are multimodal, existing composition methods average incompatible modes, producing plans that are neither locally feasible nor globally coherent. We propose Compositional Diffusion with Guided Search (CDGS), which addresses this \emph{mode averaging} problem by embedding search directly within the diffusion denoising process. Our method explores diverse combinations of local modes through population-based sampling, prunes infeasible candidates using likelihood-based filtering, and enforces global consistency through iterative resampling between overlapping segments. CDGS matches oracle performance on seven robot manipulation tasks, outperforming baselines that lack compositionality or require long-horizon training data. The approach generalizes across domains, enabling coherent text-guided panoramic images and long videos through effective local-to-global message passing. More details: https://cdgsearch.github.io/


Poster
P3-#1310
OccDriver: Future Occupancy Guided Dual-branch Trajectory Planner in Autonomous Driving

Zhao Huang ⋅ Bowen Zhang ⋅ Zhongzhu Li ⋅ Di Lin

Trajectory planning for autonomous driving is challenging due to agents' behavioral uncertainty and intricate multi-agent interaction modeling. Most existing studies generate trajectories without explicitly exploiting possible scene evolution, while world models predict consequences from ego behavior, enabling more informed planning decisions. Inspired by the world model, we propose OccDriver, a novel rasterized-to-vectorized dual-branch framework for trajectory planning. This pipeline performs a coarse-to-fine trajectory decoding process: The vectorized branch first generate multimodal coarse trajectories; Then the rasterized branch predicts future scene evolutions conditioned on each coarse trajectory via occupancy flow prediction; Lastly, the vectorized branch leverages intuitive future interaction evolution of each modality from the rasterized branch and produces refined trajectories. Several cross-modality (occupancy and trajectory) losses are further introduced to improve the consistency between trajectory and occupancy prediction. Additionally, we apply a contingency objective in both occupancy space, considering marginal and joint occupancy distributions in different planning scopes. Our model is assessed on the large-scale real-world nuPlan dataset and its associated planning benchmark. Experiments show that OccDriver achieves state-of-the-art in both Non-Reactive and Reactive closed-loop performance.


Poster
P3-#1311
BOLT: Decision‑Aligned Distillation and Budget-Aware Routing for Constrained Multimodal QA on Robots

Tengjun Ni ⋅ Xin Yuan ⋅ Shenghong Li ⋅ Kai Wu ⋅ Ren Liu ⋅ Wei Ni ⋅ Wenjie Zhang

Robotic systems can require multimodal reasoning under stringent constraints of latency, memory, and energy. Standard instruction tuning and token-level distillation fail to deliver decision quality, reliability, and interpretability under these constraints. We introduce BOLT, a decision-aligned distillation and budget-aware routing framework that treats multi-choice prediction as a decision surface to be aligned during training and selectively refined at inference. During training, BOLT introduces Option-level Decision Distillation to align student models directly on the decision surface of multi-choice answers, thereby eliminating prompt artifacts, improving calibration, and optimizing the exact output space. At inference, BOLT activates Budget-aware Test-time Augmentation, a calibrated router that uses low-cost signals such as confidence, margin, entropy, retrieval affinity, and agreement across short question decompositions to trigger high-resolution reevaluation, type-matched retrieval exemplars, or question decomposition only when their expected benefit outweighs cost. On Robo2VLM-1, a 2B BOLT student distilled from LLaVA-1.5-13B improves accuracy from 28.66 in zero-shot to 42.89 with decision distillation and to 50.50 with budgeted routing, surpassing the 13B teacher at 36.74. It lowers expected calibration error, strengthens the risk-coverage frontier, and slashes GPU memory from 26,878 MB for the teacher to 3,035 MB for the distilled student, and 3,817 MB with all augmentations enabled. By constraining outputs to valid options while exposing retrieved evidence and decomposition traces, BOLT reduces hallucination and provides transparent decision-making, enabling large-model quality on edge robots.


Poster
P3-#2001
RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning

Qianyue Hao ⋅ Sibo Li ⋅ Jian Yuan ⋅ Yong Li

Despite rapid advancements in large language models (LLMs), the token-level autoregressive nature constrains their complex reasoning capabilities. To enhance LLM reasoning, inference-time techniques, including Chain/Tree/Graph-of-Thought(s), successfully improve the performance, as they are fairly cost-effective by guiding reasoning through external logical structures without modifying LLMs' parameters. However, these manually predefined, task-agnostic frameworks are applied uniformly across diverse tasks, lacking adaptability. To improve this, we propose RL-of-Thoughts (RLoT), where we train a lightweight navigator model with reinforcement learning (RL) to generate task-adaptive logical structures at inference time, enhancing LLM reasoning. Specifically, we design five basic logic blocks from the perspective of human cognition. During the reasoning process, the trained RL navigator dynamically selects the suitable logic blocks and combines them into task-specific logical structures according to problem characteristics. Experiments across multiple reasoning benchmarks (AIME, MATH, GPQA, etc.) with multiple LLMs (GPT, Llama, Qwen, and DeepSeek) illustrate that RLoT outperforms established inference-time techniques in most cases and improves up to 13.4% in challenging situations. Remarkably, with less than 3K parameters, our RL navigator is able to make sub-10B LLMs comparable to 100B-scale counterparts. Moreover, the RL navigator demonstrates strong transferability: a model trained on one specific LLM-task pair can effectively generalize to unseen LLMs and tasks. Our code is open-source at https://github.com/tsinghua-fib-lab/RL-LLM-Reasoning.


Poster
P3-#1313
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Suhwan Choi ⋅ Jaeyoon Jung ⋅ Haebin Seong ⋅ Minchan Kim ⋅ Minyeong Kim ⋅ Yongjun Cho ⋅ Yoonshik Kim ⋅ Yu Park ⋅ Youngjae Yu ⋅ Yunsung Lee

Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments---particularly gaming---offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152× compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations and 1K+ hours of pseudo-labeled gameplay), our 1B-parameter model achieves 96.6\% success on LIBERO manipulation and 83.3\% on CANVAS navigation, matching or surpassing models up to 7$\times$ larger, such as $\pi_0$ (3.3B) and OpenVLA (7B). These results demonstrate that sensorimotor primitives learned from digital interactions transfer effectively to real-world physical tasks, establishing desktop pretraining as a practical paradigm for embodied AI. All resources are publicly available at https://worv-ai.github.io/d2e.


Poster
P3-#1314
JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation

Shuang Zeng ⋅ Dekang Qi ⋅ Xinyuan Chang ⋅ Feng Xiong ⋅ Shichao Xie ⋅ Xiaolong Wu ⋅ Shiyi Liang ⋅ Mu Xu ⋅ Xing Wei

Vision-and-Language Navigation (VLN) requires an embodied agent to navigate through unseen environments, guided by natural language instructions and a continuous video stream. Recent advances in VLN have been driven by the powerful semantic understanding of Multimodal Large Language Models (MLLMs). However, these methods typically rely on explicit semantic memory, such as building textual cognitive maps or storing historical visual frames. This type of method suffers from spatial information loss, computational redundancy, and memory bloat, which impede efficient navigation. Inspired by the implicit scene representation in human navigation, analogous to the left brain's semantic understanding and the right brain's spatial cognition, we propose JanusVLN, a novel VLN framework featuring a dual implicit neural memory that models spatial-geometric and visual-semantic memory as separate, compact, and fixed-size neural representations. This framework first extends the MLLM to incorporate 3D prior knowledge from the spatial-geometric encoder, thereby enhancing the spatial reasoning capabilities of models based solely on RGB input. Then, the historical key-value (KV) caches from the spatial-geometric and visual-semantic encoders are constructed into a dual implicit memory. By retaining only the KVs of tokens in the initial and sliding window, redundant computation is avoided, enabling efficient incremental updates. Extensive experiments demonstrate that JanusVLN outperforms over 20 recent methods to achieve SOTA performance. For example, the success rate improves by 10.5-35.5 compared to methods using multiple data types as input and by 3.6-10.8 compared to methods using more RGB training data. This indicates that the proposed dual implicit neural memory, as a novel paradigm, explores promising new directions for future VLN research.


Poster
P3-#1315
Learning to Grasp Anything By Playing with Random Toys

Dantong Niu ⋅ Yuvan Sharma ⋅ Baifeng Shi ⋅ Rachel Ding ⋅ Matteo Gioia ⋅ Haoru Xue ⋅ Henry Tsai ⋅ Konstantinos Kallidromitis ⋅ Anirudh Pai ⋅ S. Sastry ⋅ trevor darrell ⋅ Jitendra Malik ⋅ Roei Herzig

Robotic manipulation policies often struggle to generalize to novel objects, limiting their real-world utility. In contrast, cognitive science suggests that children develop generalizable dexterous manipulation skills by mastering a small set of simple toys and then applying that knowledge to more complex items. Inspired by this, we study if similar generalization capabilities can also be achieved by robots. Our results indicate robots can learn generalizable grasping using randomly assembled objects that are composed from just four shape primitives: spheres, cuboids, cylinders, and rings. We show that training on these "toys" enables robust generalization to real-world objects, yielding strong zero-shot performance. Crucially, we find the key to this generalization is an object-centric visual representation induced by our proposed detection pooling mechanism. Evaluated in both simulation and on physical robots, our model achieves a 67% real-world grasping success rate on the YCB dataset, outperforming state-of-the-art approaches that rely on substantially more in-domain data. We further study how zero-shot generalization performance scales by varying the number and diversity of training toys and the demonstrations per toy. We believe this work offers a promising path to scalable and generalizable learning in robotic manipulation.


Poster
P3-#1316
RFS: Reinforcement learning with Residual flow steering for dexterous manipulation

Entong Su ⋅ Tyler Westenbroek ⋅ Anusha Nagabandi ⋅ Abhishek Gupta

Imitation learning has emerged as an effective approach for bootstrapping sequential decision-making in robotics, achieving strong performance even in high-dimensional dexterous manipulation tasks. Recent behavior cloning methods further leverage expressive generative models, such as diffusion models and flow matching, to represent multimodal action distributions. However, policies pretrained in this manner often exhibit limited generalization and require additional fine-tuning to achieve robust performance at deployment time. Such adaptation must preserve the global exploration benefits of pretraining while enabling rapid correction of local execution errors. We propose Residual Flow Steering (RFS), a data-efficient reinforcement learning framework for adapting pretrained generative policies. RFS steers a pretrained flow-matching policy by jointly optimizing a residual action and a latent noise distribution, enabling complementary forms of exploration: local refinement through residual corrections and global exploration through latent-space modulation. This design allows efficient adaptation while retaining the expressive structure of the pretrained policy. We demonstrate the effectiveness of RFS on dexterous manipulation tasks, showing efficient fine-tuning both in simulation and in real-world settings when adapting pretrained base policies. Project website: https://weirdlabuw.github.io/rfs/


Poster
P3-#1317
HAMLET: Switch Your Vision-Language-Action Model into a History-Aware Policy

Myungkyu Koo ⋅ Daewon Choi ⋅ Taeyoung Kim ⋅ Kyungmin Lee ⋅ Changyeon Kim ⋅ Younggyo Seo ⋅ Jinwoo Shin

Inherently, robotic manipulation tasks are history-dependent: leveraging past context could be beneficial. However, most existing Vision-Language-Action models (VLAs) have been designed without considering this aspect, i.e., they rely solely on the current observation, ignoring preceding context. In this paper, we propose HAMLET, a scalable framework to adapt VLAs to attend to the historical context during action prediction. Specifically, we introduce moment tokens that compactly encode perceptual information at each timestep. Their representations are initialized with time-contrastive learning, allowing them to better capture temporally distinctive aspects. Next, we employ a lightweight memory module that integrates the moment tokens across past timesteps into memory features, which are then leveraged for action prediction. Through empirical evaluation, we show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy, especially demonstrating significant improvements on long-horizon tasks that require historical context. In particular, on top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks, surpassing the baseline performance by 47.2%. Furthermore, HAMLET pushes prior art performance from 64.1% to 66.4% on RoboCasa Kitchen (100-demo setup) and from 95.6% to 97.6% on LIBERO, highlighting its effectiveness even under generic robot-manipulation benchmarks.


Poster
P3-#1318
Rodrigues Network for Learning Robot Actions

Jialiang Zhang ⋅ Haoran Geng ⋅ Yang You ⋅ Congyue Deng ⋅ Pieter Abbeel ⋅ Jitendra Malik ⋅ Leonidas Guibas

Understanding and predicting articulated actions is important in robot learning. However, common architectures such as MLPs and Transformers lack inductive biases that reflect the underlying kinematic structure of articulated systems. To this end, we propose the Neural Rodrigues Operator, a learnable generalization of the classical forward kinematics operation, designed to inject kinematics-aware inductive bias into neural computation. Building on this operator, we design the Rodrigues Network (RodriNet), a novel neural architecture specialized for processing actions. We evaluate the expressivity of our network on two synthetic tasks on kinematic and motion prediction, showing significant improvements compared to standard backbones. We further demonstrate its effectiveness in two realistic applications: (i) imitation learning on robotic benchmarks with the Diffusion Policy, and (ii) single-image 3D hand reconstruction. Our results suggest that integrating structured kinematic priors into the network architecture improves action learning in various domains.


Poster
P3-#1320
VITA: Vision-to-Action Flow Matching Policy

Dechen Gao ⋅ BOQI ZHAO ⋅ Andrew Lee ⋅ Ian Chuang ⋅ Hanchu Zhou ⋅ Hang Wang ⋅ Zhe Zhao ⋅ Junshan Zhang ⋅ Iman Soltani

Conventional flow matching and diffusion-based policies sample through iterative denoising from standard noise distributions (e.g., Gaussian), and require conditioning modules to repeatedly incorporate visual information during the generative process, incurring substantial time and memory overhead. To reduce the complexity, we develop VITA, VIsion-To-Action policy, a noise-free and conditioning-free flow matching policy learning framework that directly flows from visual representations to latent actions. Since the source of the flow is visually grounded, VITA eliminates the need of visual conditioning during generation. As expected, bridging vision and action is challenging, because actions are lower-dimensional, less structured, and sparser than visual representations; moreover, flow matching requires the source and target to have the same dimensionality. To overcome this, we introduce an action autoencoder that maps raw actions into a structured latent space aligned with visual latents, trained jointly with flow matching. To further prevent latent action space collapse during end-to-end training, we propose flow latent decoding, which anchors the latent generation process by backpropagating the action reconstruction loss through the flow matching ODE (ordinary differential equation) solving steps. We evaluate VITA on 9 simulation and 5 real-world tasks from ALOHA and Robomimic. VITA achieves 1.5x-2.0x faster inference compared to conventional methods with conditioning modules, while outperforming or matching state-of-the-art policies.


Poster
P3-#1321
AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild

Xiaolou Sun ⋅ Wufei Si ⋅ Wenhui Ni ⋅ Yuntian Li ⋅ Dongming Wu ⋅ Fei Xie ⋅ Runwei Guan ⋅ He-Yang Xu ⋅ Henghui Ding ⋅ Yuan Wu ⋅ Yutao Yue ⋅ Yongming Huang ⋅ Hui Xiong

Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned aerial vehicles (UAVs) relies on detailed, pre-specified instructions to guide the UAV along predetermined routes. However, real-world outdoor exploration typically occurs in unknown environments where detailed navigation instructions are unavailable. Instead, only coarse-grained positional or directional guidance can be provided, requiring UAVs to autonomously navigate through continuous planning and obstacle avoidance. To bridge this gap, we propose AutoFly, an end-to-end Vision-Language-Action (VLA) model for autonomous UAV navigation. AutoFly incorporates a pseudo-depth encoder that derives depth-aware features from RGB inputs to enhance spatial reasoning, coupled with a progressive two-stage training strategy that effectively aligns visual, depth, and linguistic representations with action policies. Moreover, existing VLN datasets have fundamental limitations for real-world autonomous navigation, stemming from their heavy reliance on explicit instruction-following over autonomous decision-making and insufficient real-world data. To address these issues, we construct a novel autonomous navigation dataset that shifts the paradigm from instruction-following to autonomous behavior modeling through: (1) trajectory collection emphasizing continuous obstacle avoidance, autonomous planning, and recognition workflows; (2) comprehensive real-world data integration. Experimental results demonstrate that AutoFly achieves a 3.9\% higher success rate compared to state-of-the-art VLA baselines, with consistent performance across simulated and real environments.


Poster
P3-#1322
PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model

Wenqi Liang ⋅ Gan Sun ⋅ Yao He ⋅ Jiahua Dong ⋅ Suyan Dai ⋅ Ivan Laptev ⋅ Salman Khan ⋅ Yang Cong

Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image–text–action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by $10.1\%\sim28.7\%$ over OpenVLA, while requiring only $1.5\%$ of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments. The dataset and code will be released as open source.


Poster
P3-#1323
Scaling up Memory for Robotic Control via Experience Retrieval

Ajay Sridhar ⋅ Jennifer Pan ⋅ Satvik Sharma ⋅ Chelsea Finn

Humans rely on memory to perform tasks; our goal is to endow robot policies with the same ability. Naively conditioning on long observation histories is computationally expensive and brittle under covariate shift, while indiscriminate subsampling of history leads to irrelevant or redundant information. We propose a hierarchical policy framework, where the high-level policy is trained to select and track previous task-relevant keyframes from its experience. The high-level policy uses selected keyframes and the most recent frames when generating text instructions for a low-level policy to execute. This design is compatible with existing vision-language-action (VLA) models and enables the system to efficiently reason over long-horizon dependencies. In our experiments, we fine-tune Qwen2.5-VL-7B-Instruct and $\pi_{0.5}$ as the high-level and low-level policies respectively, using demonstrations supplemented with minimal language annotations. Our approach, MemER, outperforms prior methods on three real-world long-horizon robotic manipulation tasks that require minutes of memory. Videos and code can be found at https://jen-pan.github.io/memer/.

Real-world time series data are inherently multivariate, often exhibiting complex inter-channel dependencies. Each channel is typically sampled at its own period and is prone to missing values due to various practical and operational constraints. These characteristics pose three fundamental challenges involving channel dependency, sampling asynchrony, and missingness, all of which must be addressed simultaneously to enable robust and reliable forecasting in practical settings. However, existing architectures typically address only parts of these challenges in isolation and still rely on simplifying assumptions, leaving unresolved the combined challenges of asynchronous channel sampling, test-time missing blocks, and intricate inter-channel dependencies. To bridge this gap, we propose ChannelTokenFormer, a Transformer-based forecasting framework with a flexible architecture designed to explicitly capture cross-channel interactions, accommodate channel-wise asynchronous sampling, and effectively handle missing values. Extensive experiments on public benchmark datasets reflecting practical settings, along with one private real-world industrial dataset, demonstrate the superior robustness and accuracy of ChannelTokenFormer under challenging real-world conditions.

Irregular temporal data, characterized by varying recording frequencies, differing observation durations, and missing values, presents significant challenges across fields like mobility, healthcare, and environmental science. Existing research communities often overlook or address these challenges in isolation, leading to fragmented tools and methods. To bridge this gap, we introduce a unified framework, and the first standardized dataset repository for irregular time series classification, built on a common array format to enhance interoperability. This repository comprises 34 datasets on which we benchmark 12 classifier models from diverse domains and communities. This work aims to centralize research efforts and enable a more robust evaluation of irregular temporal data analysis methods.


Poster
P3-#1326
Beyond Accuracy: Are Time Series Foundation Models Well-Calibrated?

Coen Adler ⋅ Yuxin Chang ⋅ Samar Abdi ⋅ Felix Draxler ⋅ Padhraic Smyth

The recent development of foundation models for time series data has generated considerable interest in using such models across a variety of applications. Although foundation models achieve state-of-the-art predictive performance, their calibration properties remain relatively underexplored, despite the fact that calibration can be critical for many practical applications. In this paper, we investigate the calibration-related properties of five recent time series foundation models and two competitive baselines. We perform a series of systematic evaluations assessing model calibration (i.e., over- or under-confidence), effects of varying prediction heads, and calibration under long-term autoregressive forecasting. We find that time series foundation models are consistently better calibrated than baseline models and tend not to be either systematically over- or under-confident, in contrast to the overconfidence often seen in other deep learning models.


Poster
P3-#1426
TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models

Tong Guan ⋅ Zijie Meng ⋅ Dianqi Li ⋅ Shiyu Wang ⋅ Chao-Han Huck Yang ⋅ Qingsong Wen ⋅ Zuozhu Liu ⋅ Sabato Siniscalchi ⋅ Ming Jin ⋅ Shirui Pan

Recent advances in multimodal time series learning underscore a paradigm shift from analytics centered on basic patterns toward advanced time series understanding and reasoning. However, existing multimodal time series datasets mostly remain at the level of surface alignment and question answering, without reaching the depth of genuine reasoning. The absence of well-defined tasks that genuinely require time series reasoning, along with the scarcity of high-quality data, has limited progress in building practical time series reasoning models (TSRMs). To this end, we introduce Time Series Reasoning Suite (TSR-Suite), which formalizes four atomic tasks that span three fundamental capabilities for reasoning with time series: (1) perception, acquired through scenario understanding and causality discovery; (2) extrapolation, realized via event-aware forecasting; and (3) decision-making, developed through deliberation over perception and extrapolation. TSR-Suite is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs. It contains more than 23K samples, of which 2.3K are carefully curated through a human-guided hierarchical annotation process. Building on this foundation, we introduce TimeOmni-1, the first unified reasoning model designed to address diverse real-world problems demanding time series reasoning. The model is trained in multiple stages, integrating a mixture of task scenarios, novel reward functions, and tailored optimizations. Experiments show that TimeOmni-1 delivers strong out-of-distribution generalization across all tasks and achieves a high rate of valid responses. It significantly improves causality discovery accuracy (64.0% vs. 35.9% with GPT-4.1) and raises the valid response rate by over 6% compared to GPT-4.1 on the event-aware forecasting task.


Poster
P3-#1425
FACT: Fine-grained Across-variable Convolution for Multivariate Time Series Forecasting

Huiqiang Wang ⋅ Jieming Shi ⋅ Qing Li

Modeling the relationships among variables has become increasingly important, particularly in high-dimensional multivariate time series forecasting tasks. However, most existing methods primarily focus on capturing coarse-grained correlations between variables, overlooking a finer and more dynamic aspect: the variable interactions often manifest differently as time progresses. To address this limitation, we propose FACT, an Fine-grained Across-variable Convolution architecture for multivariate Time series forecasting that explicitly models fine-grained variable interactions from both the time and frequency domains. Technically, we introduce a depth-wise convolution block DConvBlock, which leverages a depth-wise convolution architecture with channel-specific kernels to model dynamic variable interactions at each granularity. To further enhance efficiency, we reconfigure the original one-dimensional variables into a two-dimensional space, reducing the variable distance and the required model layers. Then DConvBlock incorporates multi-dilated 2D convolutions with progressively increasing dilation rates, enabling the model to capture fine-grained and dynamic variable interactions while efficiently attaining a global reception field. Extensive experiments on twelve benchmark datasets demonstrate that FACT not only achieves state-of-the-art forecasting accuracy but also delivers substantial efficiency gains, significantly reducing both training time and memory consumption compared to attention mechanism.


Poster
P3-#1424
Perturbed Dynamic Time Warping: A Probabilistic Framework and Generalized Variants

Xiangqian Sun ⋅ Chaoqun Wang ⋅ Wei Zhang

Dynamic Time Warping (DTW) is a classical method for measuring similarity between time series, but its non-differentiability hinders integration into end-to-end learning frameworks. To address this, soft-DTW replaces the minimum operator with a smooth soft-min, enabling differentiability and efficient computation. Motivated by soft-DTW, we propose perturbed-DTW, a differentiable framework of DTW obtained by adding random perturbations to warping costs and taking the expected minimum. Under Gumbel noise, perturbed-DTW exactly recovers soft-DTW, providing a natural probabilistic interpretation of soft-DTW. We further generalize this framework by extending the Gumbel noise to the broader family of generalized extreme value (GEV) distributions, leading to a new class of soft-DTW variants. Building on this insight, we introduce nested-soft-DTW (ns-DTW), which integrates GEV perturbations into the dynamic programming formulation of perturbed-DTW. This extension induces alignments with tunable skewness, offering greater flexibility in modeling diverse alignment structures. We validate ns-DTW on barycenter computation, clustering, and classification, demonstrating its effectiveness over existing approaches.


Poster
P3-#1423
SmellNet: A Dataset for Sensor-Based Smell Recognition and Mixture Prediction

Dewei Feng ⋅ Wei Dai ⋅ Carol Li ⋅ Alistair Pernigo ⋅ Paul Liang

The ability of AI to sense and identify various substances based on their smell alone can have profound impacts on allergen detection (e.g. detecting peanut contamination or allergens in food), monitoring the manufacturing process, and sensing hormones that indicate emotional states, stress levels, and diseases. Despite these broad impacts, there are few standardized datasets, and therefore little progress, for training and evaluating AI systems' ability to "smell" in the real-world. In this paper, we use small gas and chemical sensors to create SmellNet, a comparatively large dataset for sensor-based machine olfaction that digitizes a diverse range of smells in the natural world. SmellNet contains about 828,000 time-series data points across 50 substances, spanning nuts, spices, herbs, fruits, and vegetables, and 43 mixtures among them with fixed ingredient volumetric ratios, with 68 hours of data collected. Using SmellNet, we developed ScentFormer, a Transformer-based architecture combining temporal differencing and sliding-window augmentation for smell data. For the SmellNet-Base classification tasks, ScentFormer achieves 63.3% Top-1 accuracy with GC-MS supervision, and for the SmellNet-Mixture distribution prediction tasks, ScentFormer achieves 50.2% Top-1@0.1 on the test-seen split. ScentFormer's ability to generalize across conditions and capture transient chemical dynamics demonstrates the promise of temporal modeling in sensor-based olfactory AI. SmellNet and ScentFormer lay the groundwork for sensor-based olfactory applications across healthcare, food and beverage, environmental monitoring, manufacturing, and entertainment.


Poster
P3-#1422
MambaSL: Exploring Single-Layer Mamba for Time Series Classification

Yoo-Min Jung ⋅ Leekyung Kim

Despite recent advances in state space models (SSMs) such as Mamba across various sequence domains, research on their standalone capacity for time series classification (TSC) has remained limited. We propose MambaSL, a framework that minimally redesigns the selective SSM and projection layers of a single-layer Mamba, guided by four TSC-specific hypotheses. To address benchmarking limitations—restricted configurations, partial University of East Anglia (UEA) dataset coverage, and insufficiently reproducible setups—we re-evaluate 20 strong baselines across all 30 UEA datasets under a unified protocol. As a result, MambaSL achieves state-of-the-art performance with statistically significant average improvements, while ensuring reproducibility via public checkpoints for all evaluated models. Together with visualizations, these results demonstrate the potential of Mamba-based architectures as a TSC backbone.

Forecasting rare events in multivariate time-series data is a central challenge in machine learning, complicated by severe class imbalance, long-range dependencies, and distributional uncertainty. We introduce EVEREST, a transformer-based architecture for probabilistic rare-event forecasting that delivers calibrated predictions and tail-aware risk estimation, with auxiliary interpretability through attention-based signal attribution. EVEREST integrates four key components: (i) a learnable attention bottleneck for soft aggregation of temporal dynamics; (ii) an evidential head for estimating aleatoric and epistemic uncertainty via a Normal–Inverse–Gamma distribution; (iii) an extreme-value head that models tail risk using a Generalized Pareto Distribution; and (iv) a lightweight precursor head for early-event detection. These modules are jointly optimised with a composite loss combining focal loss, evidential negative log-likelihood, and a tail-sensitive EVT penalty, and act only at training time; deployment uses a single classification head with no inference overhead. We evaluate EVEREST on a real-world benchmark spanning a decade of space-weather data and demonstrate state-of-the-art performance, including True Skill Statistic (TSS) scores of 0.973, 0.970, and 0.966 at 24, 48, and 72-hour horizons for C-class flares. The model is compact (≈0.81M parameters), efficient to train on commodity hardware, and applicable to other high-stakes domains such as industrial monitoring, weather, and satellite diagnostics. Limitations include reliance on fixed-length inputs and exclusion of image-based modalities, motivating future extensions to streaming and multimodal forecasting.


Poster
P3-#1420
Reasoning on Time-Series for Financial Technical Analysis

Kelvin Koa ⋅ Jan Chen ⋅ Yunshan Ma ⋅ Zheng Huanhuan ⋅ Tat-Seng Chua

While Large Language Models have been used to produce interpretable stock forecasts, they mainly focus on analyzing textual reports but not historical price data, also known as Technical Analysis. This task is challenging as it switches between domains: the stock price inputs and outputs lie in the time-series domain, while the reasoning step should be in natural language. In this work, we introduce Verbal Technical Analysis (VTA), a novel framework that combine verbal and latent reasoning to produce stock time-series forecasts that are both accurate and interpretable. To reason over time-series, we convert stock price data into textual annotations and optimize the reasoning trace using an inverse Mean Squared Error (MSE) reward objective. To produce time-series outputs from textual reasoning, we condition the outputs of a time-series backbone model on the reasoning-based attributes. Experiments on stock datasets across U.S., Chinese, and European markets show that VTA achieves state-of-the-art forecasting accuracy, while the reasoning traces also perform well on evaluation metrics judged by industry experts.


Poster
P3-#1419
PhaseFormer: From Patches to Phases for Efficient and Effective Time Series Forecasting

Yiming Niu ⋅ Jinliang Deng ⋅ Yongxin Tong

Periodicity is a fundamental characteristic of time series data and has long played a central role in forecasting. Recent deep learning methods strengthen the exploitation of periodicity by treating patches as basic tokens, thereby improving predictive effectiveness. However, their efficiency remains a bottleneck due to large parameter counts and heavy computational costs. This paper provides, for the first time, a clear explanation of why patch-level processing is inherently inefficient, supported by strong evidence from real-world data. To address these limitations, we introduce a phase perspective for modeling periodicity and present an efficient yet effective solution, PhaseFormer. PhaseFormer features phase-wise prediction through compact phase embeddings and efficient cross-phase interaction enabled by a lightweight routing mechanism. Extensive experiments demonstrate that PhaseFormer achieves state-of-the-art performance on the evaluated benchmarks with around 1k parameters, consistently across benchmark datasets. Notably, it excels on large-scale and complex datasets, where models with comparable efficiency often struggle. This work marks a significant step toward truly efficient and effective time series forecasting. Code is available at this repository: https://github.com/neumyor/PhaseFormer_TSL


Poster
P3-#1418
Adaptive Conformal Anomaly Detection with Time Series Foundation Models for Signal Monitoring.

Natalia Martinez Gil ⋅ Fearghal O'Donncha ⋅ Wesley Gifford ⋅ Nianjun Zhou ⋅ Dhaval Patel ⋅ Roman Vaculin

We propose a post-hoc adaptive conformal anomaly detection method for monitoring time series that leverages predictions from pre-trained foundation models without requiring additional fine-tuning. Our method yields an interpretable anomaly score directly interpretable as a false alarm rate (p-value), facilitating transparent and actionable decision-making. It employs weighted quantile conformal prediction bounds and adaptively learns optimal weighting parameters from past predictions, enabling calibration under distribution shifts and stable false alarm control, while preserving out-of-sample guarantees. As a model-agnostic solution, it integrates seamlessly with foundation models and supports rapid deployment in resource-constrained environments. This approach addresses key industrial challenges such as limited data availability, lack of training expertise, and the need for immediate inference, while taking advantage of the growing accessibility of time series foundation models. Experiments on both synthetic and real-world datasets show that the proposed approach delivers strong performance, combining simplicity, interpretability, robustness, and adaptivity.


Poster
P3-#1417
Point-wise Anomaly Detection via Fold-bifurcation ODE

SheoYon Jhin ⋅ Noseong Park

Anomaly detection in time series is essential for applications from industrial monitoring to financial risk management. Recent methods --- including forecasting error models, representation learning, augmentation, and weak-label learning --- have achieved strong results for specific anomaly types such as sudden point or gradual collective anomalies. While many prior works report window-level metrics that may mask errors, several recent methods evaluate at the point level as well. Our goal is to use a stricter point-wise protocol to make masking effects explicit. We introduce FOLD (Point-wise Anomaly Detection via fold-bifurcation), a framework that reframes detection as tracking a system’s proximity to a critical transition. FOLD extracts stress signals from a forecasting model and integrates them with a fold-bifurcation inspired ODE to produce the risk state, flagging anomalies once it crosses a threshold calibrated on normal data. This requires no anomaly labels and no additional detector training, enabling a parameter-free and efficient detection process. By modeling anomalies as stress accumulation toward a tipping point, FOLD naturally aligns with point-wise detection, providing a unifying and interpretable perspective that complements type-specific methods. Experiments on 40 benchmarks against 34 state-of-the-art baselines show that FOLD achieves competitive or superior performance, with particular strength under strict point-wise evaluation.


Poster
P3-#1416
Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind

Zhitao He ⋅ Zongwei LYU ⋅ Yi R. Fung

Although artificial intelligence (AI) has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and underexplored challenge. This is because rebuttal is a complex process of strategic communication under severe information asymmetry rather than a simple technical debate. Consequently, current approaches struggle as they largely imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) framework that models reviewer mental state, formulates persuasion strategy, and generates evidence-based response. To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach. Our training process consists of two stages, beginning with a supervised fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a reinforcement learning phase leveraging the self-reward mechanism for scalable self-improvement. For reliable and efficient automated evaluation, we further develop Rebuttal-RM, a specialized evaluator trained on over 100K samples of multi-source rebuttal data, which achieves scoring consistency with human preferences surpassing powerful judge GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations.


Poster
P3-#1415
How to train data-efficient LLMs

Noveen Sachdeva ⋅ Benjamin Coleman ⋅ Wang-Cheng Kang ⋅ Jianmo Ni ⋅ Lichan Hong ⋅ Ed H. Chi ⋅ James Caverlee ⋅ Julian McAuley ⋅ Derek Cheng

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, \ie, techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, AskLLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose density sampling, which models the data distribution to select a diverse sample. Testing the effect of $22$ different data curation techniques on the pre-training of T5-style of models, involving hundreds of pre-training runs and post fine-tuning evaluation tasks, we find that AskLLM and density are the best methods in their respective categories. While coverage sampling techniques often recover the performance of training on the entire dataset, training on data curated via AskLLM consistently outperforms full-data training---even when we sample only $10$\% of the original dataset, while converging up to $70$\% faster.


Poster
P3-#1414
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Mohammad Tavakoli ⋅ Alireza Salemi ⋅ Carrie Ye ⋅ Mohamed Abdalla ⋅ Hamed Zamani ⋅ J Mitchell

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT–a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%–12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.


Poster
P3-#1413
Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

Yiming Du ⋅ Baojun Wang ⋅ Yifan Xiang ⋅ Zhaowei Wang ⋅ Wenyu Huang ⋅ Boyang XUE ⋅ Bin Liang ⋅ Xingshan Zeng ⋅ Fei Mi ⋅ Haoli Bai ⋅ Lifeng Shang ⋅ J Pan ⋅ Yuxin Jiang ⋅ Kam-Fai Wong

Temporal reasoning over long, multi-session dialogues is a critical capability for conversational agents. As dialogue histories grow in length and accumulate noise, existing long-context models struggle to accurately identify temporally pertinent information, significantly impairing reasoning performance. To address this, we introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL). It employs a coarse-to-fine strategy, first pruning the dialogue history into a candidate set with temporal and retriever filters, followed by an RL agent that selects the precise evidence. The RL training is guided by a multi-level reward function optimizing (i) accuracy, (ii) evidence grounding, and (iii) temporal consistency. This temporal consistency reward provides a dense signal by evaluating alignment at both the session-level (range proximity) and the utterance-level (evidence density), enabling the agent to resolve subtle chronological ambiguities. On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0\%, establishing a new state-of-the-art performance for open-source models and outperforming a 14B baseline by 10.2\%. Ablation studies show temporal consistency and evidence grounding rewards jointly contributing to a 15.0\% performance gain.Moreover, Memory-T1 maintains robustness up to 128k tokens, where baseline models collapse, proving effectiveness against noise in extensive dialogue histories.


Poster
P3-#1412
Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Yueqi Song ⋅ Ketan Ramaneti ⋅ Zaid Sheikh ⋅ Ziru Chen ⋅ Boyu Gou ⋅ Tianbao Xie ⋅ Yiheng Xu ⋅ Danyang Zhang ⋅ Apurva Gandhi ⋅ Fan Yang ⋅ Joseph Liu ⋅ Tianyue Ou ⋅ Zhihao Yuan ⋅ Frank F Xu ⋅ Shuyan Zhou ⋅ Xingyao Wang ⋅ Xiang Yue ⋅ Tao Yu ⋅ Huan Sun ⋅ Yu Su ⋅ Graham Neubig

Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the Agent Data Protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed supervised finetuning on the unified data, and demonstrated an average performance gain of $\sim$20\% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.


Poster
P3-#1411
Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

Xu Wang ⋅ Chenkai Xu ⋅ Yijie Jin ⋅ Jiachun Jin ⋅ Hao Zhang ⋅ Kai Yu ⋅ Zhijie Deng

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs to achieve rapid convergence.We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than $\mathbf{2.5\times}$ inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to the vanilla dLLMs like LLaDA and Dream, the acceleration can be more than $\mathbf{50\times}$ while maintaining comparable output quality.


Poster
P3-#1410
Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models

Haolin He ⋅ Xingjian Du ⋅ Renhe Sun ⋅ Zheqi Dai ⋅ Yujia Xiao ⋅ Mingru Yang ⋅ Jiayi Zhou ⋅ Xiquan Li ⋅ Zhengxi Liu ⋅ Zining Liang ⋅ Chunyat Wu ⋅ Qianhua He ⋅ Tan Lee ⋅ Xie Chen ⋅ Wei-Long Zheng ⋅ Weiqiang Wang ⋅ Mark D. Plumbley ⋅ Jian Liu ⋅ Qiuqiang Kong

Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2\% on MMAU-test-mini, 75.6\% on MMAU, 67.0\% on MMAR, and 71.7\% on MMSU, establishing new state-of-the-art performance.


Poster
P3-#1409
Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

Heyang Gao ⋅ Zexu Sun ⋅ Erxue Min ⋅ Hengyi Cai ⋅ Shuaiqiang Wang ⋅ Dawei Yin ⋅ Xu Chen

Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long-horizon problems. Aligning these agents via preference-based methods like Direct Preference Optimization (DPO) is a promising direction, yet it faces a critical granularity mismatch. Trajectory-level DPO provides stable signals but blur where credit should be assigned within long trajectories, whereas step-level DPO offers fine-grained supervision but can be statistically noisy and data-inefficient when Monte Carlo rollouts are limited, and can be hard to fully exploit multi-step structured behaviors that only reveal their effect over several actions. To balance this trade-off, we introduce Hierarchical Preference Learning (HPL), a hierarchical framework that optimizes LLM agents by leveraging preference signals at multiple, synergistic granularities. While HPL incorporates trajectory- and step-level DPO for global and local policy stability, its core innovation lies in group-level preference optimization guided by a dual-layer curriculum. Our approach first decomposes expert trajectories into semantically coherent action groups and then generates contrasting suboptimal groups to enable preference learning at a fine-grained, sub-task level. Then, instead of treating all preference pairs equally, HPL introduces a curriculum scheduler that organizes the learning process from simple to complex. This curriculum is structured along two axes: the group length, representing sub-task complexity, and the sample difficulty, defined by the reward gap between preferred and dispreferred action groups. Experiments on three challenging agent benchmarks show that HPL outperforms existing state-of-the-art methods. Our analyses demonstrate that the hierarchical DPO loss effectively integrates preference signals across multiple granularities, while the dual-layer curriculum is crucial for enabling the agent to solve a wide range of tasks, from simple behaviors to complex multi-step sequences.


Poster
P3-#1408
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists

Jie Ruan ⋅ Inderjeet Nair ⋅ Shuyang Cao ⋅ Amy Liu ⋅ Sheza Munir ⋅ Micah Pollens-Dempsey ⋅ Yune-Ting Chiang ⋅ Lucy Kates ⋅ Nicholas David ⋅ Sihan Chen ⋅ Ruxin Yang ⋅ Yuqian Yang ⋅ Jihyun Gump ⋅ Tessa Bialek ⋅ Vivek Sankaran ⋅ Margo Schlanger ⋅ Lu Wang

This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items of model outputs are then compared with corresponding items of reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 15 popular large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer Gemini-2.5-Pro achieving only a 33.4 F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, but far from correct; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable, reproducible, and low-cost usage.


Poster
P3-#1407
BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions

Nan Huo ⋅ Xiaohan Xu ⋅ Jinyang Li ⋅ Per Jacobsson ⋅ Shipei Lin ⋅ Bowen Qin ⋅ Binyuan Hui ⋅ Xiaolong Li ⋅ Ge Qu ⋅ Shuzheng Si ⋅ Linheng Han ⋅ Edward Alexander ⋅ Xintong Zhu ⋅ Rui Qin ⋅ Ruihan Yu ⋅ Yiyao Jin ⋅ Feige Zhou ⋅ Weihao Zhong ⋅ Yun Chen ⋅ Hongyu Liu ⋅ Chenhao Ma ⋅ Fatma Ozcan ⋅ Yannis Papakonstantinou ⋅ Reynold Cheng

Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short of capturing this complexity, either by treating conversation histories as static context or by limiting evaluation to narrow, read-only (SELECT-ONLY) operations, thereby potentially failing to reflect the challenges encountered in production-grade database assistant. In this work, we introduce BIRD-INTERACT, a benchmark that restores this missing realism through: (1) a comprehensive interaction environment that couples each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from execution errors without human supervision; (2) two evaluation settings reflecting real-world interaction settings which contain a pre-defined conversational protocol (c-Interact) and a more open-ended agentic setting (a-Interact) in which the model autonomously decides when to query the user simulator or explore the DB environment; (3) a challenging task suite that covers the full CRUD spectrum for both business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks, requiring LLMs to engage in dynamic interaction. The suite is organized into two sets: a full set (BIRD-INTERACT-FULL) of 600 tasks which unfold up to 11,796 dynamic interactions for a comprehensive overview of performance and a lite set (BIRD-INTERACT-LITE) of 300 tasks, with simplified databases for detailed behavioral analysis of interactions, and fast development of methods. Our empirical results highlight the difficulty of BIRD-INTERACT: the most recent flagship model GPT-5 completes only 8.67% of tasks in the c-Interact setting and 17.00% in the a-Interact setting on the full task suite. Further analysis via memory grafting and Interaction Test-time Scaling (ITS) validates the importance of effective interaction for achieving success in dynamic text-to-SQL tasks.


Poster
P3-#1406
Learning Facts at Scale with Active Reading

Jessy Lin ⋅ Vincent-Pierre Berges ⋅ Xilun Chen ⋅ Scott Yih ⋅ Gargi Ghosh ⋅ Barlas Oguz

LLMs are known to store vast amounts of knowledge in their parametric memory. However, learning and recalling facts from this memory is known to be unreliable, depending largely on the prevalence of particular facts in the training data and other factors which are poorly understood. Practitioners are lacking tools which will allow them to ensure that the models learn a given body of knowledge reliably and consistently. To this end, we propose Active Reading: a framework where we train models to study a given set of material with self-generated learning strategies. First, we demonstrate models trained with Active Reading on expert domains absorb significantly more knowledge than vanilla finetuning and other data augmentations. We train expert 8B models that achieve 66% on a Wikipedia-grounded subset of SimpleQA (+313% relative over vanilla finetuning) and 26% on FinanceBench (+160% relative over vanilla finetuning) by applying Active Reading to the source documents for each benchmark. Finally, we show that Active Reading can be utilized at pre-training scale to build more factual models. As a demonstration of this, we release WikiExpert-8B, a Wikipedia-expert model trained on 1 trillion generated tokens, which outcompetes models with hundreds of billions of parameters on factual QA.


Poster
P3-#1405
ATLAS: Constraints-Aware Multi-Agent Collaboration for Real-World Travel Planning

Jihye Choi ⋅ Jinsung Yoon ⋅ Jiefeng Chen ⋅ Somesh Jha ⋅ Tomas Pfister

While Large Language Models (LLMs) have shown remarkable advancements in reasoning and tool use, they often fail to generate optimal, grounded solutions under complex constraints. Real-world travel planning exemplifies these challenges, evaluating agents’ abilities to handle constraints that are explicit, implicit, and even evolving based on interactions with dynamic environments and user needs. In this paper, we present ATLAS, a general multi-agent framework designed to effectively handle such complex nature of constraints awareness in real-world travel planning tasks. ATLAS introduces a principled approach to address the fundamental challenges of constraint-aware planning through dedicated mechanisms for dynamic constraint management, iterative plan critique, and adaptive interleaved search. ATLAS demonstrates state-of-the-art performance on the TravelPlanner benchmark, improving the final pass rate from 23.3% to 44.4% over its best alternative. More importantly, our work is the first to demonstrate quantitative effectiveness on real-world travel planning tasks with live information search and multi-turn feedback. In this realistic setting, ATLAS showcases its superior overall planning performance, achieving an 84% final pass rate which significantly outperforms baselines including ReAct (59%) and a monolithic agent (27%).


Poster
P3-#1404
EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

Li Zhou ⋅ Lutong Yu ⋅ You Lyu ⋅ Yihang Lin ⋅ Zefeng Zhao ⋅ Junyi Ao ⋅ Yuhao Zhang ⋅ Wang Benyou ⋅ Haizhou Li

Speech Language Models (SLMs) have made significant progress in spoken language understanding. Yet it remains unclear whether they can fully perceive non lexical vocal cues alongside spoken words, and respond with empathy that aligns with both emotional and contextual factors. Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for human‑like, emotionally intelligent conversation. We present EchoMind, the first interrelated, multi‑level benchmark that simulates the cognitive process of empathetic dialogue through sequential, context‑linked tasks: spoken‑content understanding, vocal‑cue perception, integrated reasoning, and response generation. All tasks share identical and semantically neutral scripts that are free of explicit emotional or contextual cues, and controlled variations in vocal style are used to test the effect of delivery independent of the transcript. EchoMind is grounded in an empathy‑oriented framework spanning 3 coarse and 12 fine‑grained dimensions, encompassing 39 vocal attributes, and evaluated using both objective and subjective metrics. Testing 12 advanced SLMs reveals that even state‑of‑the‑art models struggle with high-expressive vocal cues, limiting empathetic response quality. Analyses of prompt strength, speech source, and ideal vocal cue recognition reveal persistent weaknesses in instruction‑following, resilience to natural speech variability, and effective use of vocal cues for empathy. These results underscore the need for SLMs that integrate linguistic content with diverse vocal cues to achieve truly empathetic conversational ability.


Poster
P3-#1403
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He ⋅ Tian Liang ⋅ Jiahao Xu ⋅ Qiuzhi Liu ⋅ Xingyu Chen ⋅ Yue Wang ⋅ Linfeng Song ⋅ Dian Yu ⋅ Zhenwen Liang ⋅ Wenxuan Wang ⋅ Zhuosheng Zhang ⋅ Rui Wang ⋅ Zhaopeng Tu ⋅ Haitao Mi ⋅ Dong Yu

Reinforcement learning (RL) with large language models shows promise in complex reasoning. However, its progress is hindered by the lack of large-scale training data that is sufficiently challenging, contamination-free and verifiable. To solve this problem, we introduce DeepMath-103K, a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9), rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward. It further includes three distinct R1 solutions adaptable for diverse training paradigms such as supervised fine-tuning. Spanning a wide range of mathematical topics, DeepMath-103K fosters the development of generalizable and advancing reasoning. Notably, models trained on DeepMath-103K achieve leading results on challenging mathematical benchmarks and demonstrate generalization beyond math such as biology, physics and chemistry, underscoring its broad efficacy.


Poster
P3-#1402
CaTS: Calibrated Test-Time Scaling for Efficient LLM Reasoning

Chengsong Huang ⋅ Langlin Huang ⋅ Jixuan Leng ⋅ Jiacheng Liu ⋅ Jiaxin Huang

Increasing test-time computation is a straightforward approach to enhancing the quality of responses in Large Language Models (LLMs). While Best-of-N sampling and Self-Consistency with majority voting are simple and effective, they require a fixed number of sampling responses for each query, regardless of its complexity. This could result in wasted computation for simpler questions and insufficient exploration for more challenging ones. In this work, we argue that model confidence of responses can be used for improving the efficiency of test-time scaling. Unfortunately, LLMs are known to be overconfident and provide unreliable confidence estimation. To address this limitation, we introduce Self-Calibration by distilling Self-Consistency-derived confidence into the model itself. This enables reliable confidence estimation at test time with one forward pass. We then design Calibrated Test-Time Scaling (CaTS), adapting common repeated sampling methods, such as self-consistency and Best-of-N to handle queries of various difficulty. We also show that CaTS-SC is provably better than vanilla self-consistency. Experiments on three LLMs across nine datasets demonstrate the effectiveness of our approach. Specifically, applying confidence-based Early Stopping (CaTS-ES) to Best-of-N improves MathQA accuracy from 73.7 to 83.6 with a sample budget of 16 responses, demonstrating the effectiveness of the confidence-based sampling strategy at inference time.


Poster
P3-#1401
Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

Ekaterina Fadeeva ⋅ Maiya Goloburda ⋅ Aleksandr Rubashevskii ⋅ Roman Vashurin ⋅ Artem Shelmanov ⋅ Preslav Nakov ⋅ Mrinmaya Sachan ⋅ Maxim Panov

Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.


Poster
P3-#1501
OSCAR: Online Soft Compression for RAG

Maxime Louis ⋅ Thibault Formal ⋅ Hervé Déjean ⋅ Stéphane Clinchant

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, leading to improved accuracy and relevance. However, scaling RAG pipelines remains computationally expensive as context length grows. On one hand, hard compression methods have recently proposed to prune the retrieved text on-the-fly with a limited compression ration. On the other hand, soft compression method performs a costly offline compression thanks a dedicated LLM but with a higher compression rate. In this paper, we introduce OSCAR, a novel query-dependent online soft compression method for RAG. OSCAR bridges the gap between online hard and offline soft compression methods, bringing the best of both: OSCAR dynamically compresses retrieved information at inference time, eliminating storage overhead and enabling higher compression rates than existing methods. Our experiments demonstrate state-of-the-art performance with a 2-5x speed-up in inference and minimal, if any, accuracy loss, for LLMs ranging from 1B to 24B parameters.


Poster
P3-#1502
SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification

Rocky Klopfenstein ⋅ Yang He ⋅ Andrew Tremante ⋅ Yuepeng Wang ⋅ Nina Narodytska ⋅ Haoze Wu

Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.


Poster
P3-#1503
The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

Zhouqi Hua ⋅ Wenwei Zhang ⋅ Chengqi Lyu ⋅ Yuzhe Gu ⋅ Songyang Gao ⋅ Kuikun Liu ⋅ Dahua Lin ⋅ Kai Chen

Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLMs). Although existing studies have predominantly focused on data-driven approaches for particular arithmetic operations or symbolic manipulation tasks, these approaches tend to be task-specific with limited performance on individual tasks. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing machine, which operates over inputs of unbounded length. From this perspective, this paper proposes Turing mAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL uses computer programs to directly synthesize chain-of-thought (CoT) data that imitate the execution process of a Turing machine, which linearly expands the reasoning steps into atomic states to alleviate shortcut pattern learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access. To validate the universality and reliability of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B in individual tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing machine, instead of the human-like thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.


Poster
P3-#1504
Are LLMs Really Not Knowledgeable? Mining the Submerged Knowledge in LLMs' Memory

Xingjian Tao ⋅ Yiwei Wang ⋅ Yujun Cai ⋅ Zhicheng YANG ⋅ Jing Tang

Large language models (LLMs) have shown promise as parametric knowledge bases, but often underperform on question answering (QA) tasks due to hallucinations and uncertainty. While prior work attributes these failures to knowledge gaps in the model’s parameters, we uncover a complementary phenomenon: LLMs frequently retain correct knowledge even when generating incorrect or ``unsure'' answers. By analyzing the token-level output distributions, we find that correct answers often appear among high-probability candidates, despite not being selected. Motivated by this, we propose Hits@k, a novel metric to evaluate latent knowledge retention independent of answer surface form. Our experiments reveal that LLMs possess significantly more factual knowledge than is reflected by standard QA accuracy. Building on these insights, we further examine the prevailing few-shot QA paradigm. We find that prompting strategies which allow ``unsure'' outputs can inadvertently suppress correct answers by discouraging low-confidence generation. We design a set of quantitative experiments to measure this suppression effect, offering practical guidance for future prompt and decoding design in knowledge-intensive tasks.


Poster
P3-#1505
Measuring and Mitigating Rapport Bias of Large Language Models under Multi-Agent Social Interactions

Maojia Song ⋅ Tej Deep Pala ⋅ Ruiwen Zhou ⋅ Weisheng Jin ⋅ Amir Zadeh ⋅ Chuan Li ⋅ Dorien Herremans ⋅ Soujanya Poria

Large language models (LLMs) are increasingly deployed in multi-agent systems (MAS) as components of collaborative intelligence, where peer interactions dynamically shape individual decision-making. While prior work has largely focused on conformity bias, we broaden the scope to examine how LLMs build rapport from previous interactions, resist misinformation, and integrate peer input during collaboration, which are key factors for achieving collective intelligence under complex social dynamics. We introduce KAIROS, a benchmark simulating quiz contests with peer agents of varying reliability, offering fine-grained control over conditions such as expert–novice roles, noisy crowds, and adversarial peers. LLMs receive both historical interactions and current peer responses, allowing systematic investigation into how rapport, peer action, and self-confidence influence decisions. To mitigate this vulnerability, we evaluate prompting, supervised fine-tuning, and reinforcement learning using Group Relative Policy Optimization (GRPO) across multiple models. Our results show that model size plays a central role in moderating susceptibility to social influence: larger models exhibit stronger resilience and benefit from prompting-based mitigation, whereas smaller models are more vulnerable. For the latter, carefully configured GRPO training improves both robustness and overall performance. Our code and datasets are available at: https://anonymous.4open.science/r/KAIROS-4F71


Poster
P3-#1108
P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

RuiPeng Zhang ⋅ Zhihao Li ⋅ Haozhang Yuan ⋅ C.L.Philip Chen ⋅ Tong Zhang

Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target the perceptual bottleneck in attended regions or address insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are often vision-agnostic and their inherently off-policy nature limits their effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P$^2$-DPO), a novel training paradigm in which the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pairs construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P$^2$-DPO outperforms strong baselines that rely on costly human feedback on benchmarks. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P$^2$-DPO in addressing perceptual bottleneck in attended regions and improving Visual Robustness against degraded inputs.


Poster
P3-#1506
TIPS: Turn-level Information-Potential Reward Shaping for Search-Augmented LLMs

Yutao Xie ⋅ Nathaniel Thomas ⋅ Nick Hansen ⋅ Yang Fu ⋅ Li Li ⋅ Xiaolong Wang

Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.


Poster
P3-#1507
From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

Tianqiao Liu ⋅ Xueyi Li ⋅ Hao Wang ⋅ Haoxuan Li ⋅ Zhichao Chen ⋅ Weiqi Luo ⋅ Zitao Liu

Recent advances in large language models (LLMs) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech (S2S) conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive (AR) methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates AR text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order AR property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Comprehensive experiments on audio question answering (Audio-QA), automatic speech recognition (ASR), automated audio caption (AAC) and S2S benchmarks show that TtT consistently surpasses strong AR and NAR baselines, with additional ablation and training-strategy analyses confirming the contribution of each component.


Poster
P3-#1508
Plan and Budget: Effective and Efficient Test-Time Scaling on Reasoning Large Language Models

Junhong Lin ⋅ Xinyue Zeng ⋅ Jie Zhu ⋅ Song Wang ⋅ Julian Shun ⋅ Jun Wu ⋅ Dawei Zhou

Large Language Models (LLMs) have achieved remarkable success in complex reasoning tasks, but their inference remains computationally inefficient. We observe a common failure mode in many prevalent LLMs, overthinking, where models generate verbose and tangential reasoning traces even for simple queries. Recent work has tried to mitigate this by enforcing fixed token budgets, however, this can lead to underthinking, especially on harder problems. Through empirical analysis, we identify that this inefficiency often stems from unclear problem-solving strategies. To formalize this, we develop a theoretical model, BAM (Budget Allocation Model), which models reasoning as a sequence of sub-questions with varying uncertainty, and introduce the E3 metric to capture the trade-off between correctness and computation efficiency. Building on theoretical results from BAM, we propose Plan-and-Budget, a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling. Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to 70% accuracy gains, 39% token reduction, and 193.8% improvement in E3. Notably, it improves the efficiency of a smaller model (DS-Qwen-32B) to match the efficiency of a larger model (DS-LLaMA-70B), demonstrating Plan-and-Budget’s ability to close performance gaps without retraining. Our code is available at https://github.com/junhongmit/P-and-B.


Poster
P3-#1509
Efficient Reasoning with Balanced Thinking

Yulin Li ⋅ Tengyao Tu ⋅ Li Ding ⋅ Junjie Wang ⋅ Huiling Zhen ⋅ Yixin Chen ⋅ Yong Li ⋅ Zhuotao Tian

Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose \textsc{ReBalance}, a training-free framework that achieves efficient reasoning with balanced thinking. \textsc{ReBalance} leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs’ reasoning trajectories. A dynamic control function modulates this vector’s strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that \textsc{ReBalance} effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Code and models will be made publicly available.


Poster
P3-#1510
Physics-Informed Audio-Geometry-Grid Representation Learning for Universal Sound Source Localization

Min-Sang Baek ⋅ Gyeong-Su Kim ⋅ Donghyun Kim ⋅ Joon-Hyuk Chang

Sound source localization (SSL) is a fundamental task in spatial audio understanding, yet most deep neural network-based methods are constrained by fixed array geometries and predefined directional grids, limiting generalizability and scalability. To address these issues, we propose audio-geometry-grid representation learning (AGG-RL), a novel framework that jointly learns audio-geometry and grid representations in a shared latent space, enabling both geometry-invariant and grid-flexible SSL. Moreover, to enhance generalizability and interpretability, we introduce two physics-informed components: a learnable non-uniform discrete Fourier transform (LNuDFT), which optimizes the dense allocation of frequency bins in a non-uniform manner to emphasize informative phase regions, and a relative microphone positional encoding (rMPE), which encodes relative microphone coordinates in accordance with the nature of inter-channel time differences. Experiments on synthetic and real datasets demonstrate that AGG-RL achieves superior performance, particularly under unseen conditions. The results highlight the potential of representation learning with physics-informed design towards a universal solution for spatial acoustic scene understanding across diverse scenarios.


Poster
P3-#1511
Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs

Zishang Jiang ⋅ Jinyi Han ⋅ tingyun li ⋅ Xinyi Wang ⋅ Sihang Jiang ⋅ Zhaoqian Dai ⋅ Ma Shuguang ⋅ Fei Yu ⋅ Jiaqing Liang ⋅ Yanghua Xiao

Reinforcement Learning with Verifiable Rewards (RLVR) has become a widely adopted technique for enhancing the reasoning ability of Large Language Models (LLMs). However, the effectiveness of RLVR strongly depends on the capability of base models. This issue arises because it requires the model to have sufficient capability to perform high-quality exploration, which involves both effectiveness and diversity. Unfortunately, existing methods address this issue by imitating expert trajectories, which improve effectiveness but neglect diversity. To address this, we argue that the expert only needs to provide guidance only at critical decision points rather than the entire reasoning path. Based on this insight, we propose MENTOR: Mixed-policy Expert Navigation for Token-level Optimization of Reasoning, a framework that provides expert guidance only at critical decision points to perform effective and diverse exploration in RLVR. Extensive experiments show that MENTOR enables models capture the essence of expert strategies rather than surface imitation, thereby performing high-quality exploration and achieving superior overall performance. Our code is available online.


Poster
P3-#1512
Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

Ziyang Ma ⋅ Ruiyang Xu ⋅ Zhenghao Xing ⋅ Yunfei Chu ⋅ Yuxuan Wang ⋅ Jinzheng He ⋅ Jin Xu ⋅ Pheng-Ann Heng ⋅ Kai Yu ⋅ Junyang Lin ⋅ Ensiong Chng ⋅ Xie Chen

Fine-grained perception of multimodal information is critical for advancing human–AI interaction. With recent progress in audio–visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and accurately describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent ``co-growth'' between the level of detail and the degree of hallucination in current OLMs. To address this, we propose \textbf{Omni-Detective}, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: \textbf{Audio-Captioner} for audio-only detailed perception, and \textbf{Omni-Captioner} for audio–visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design \textbf{Omni-Cloze}, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority and human preference alignment of Omni-Cloze in evaluating such detailed captions. All the data pipeline, models, and the benchmark are open-source to facilitate further research for omni detailed perception.\footnote{\url{https://github.com/ddlBoJack/Omni-Captioner}}


Poster
P3-#1513
Prompt and Parameter Co-Optimization for Large Language Models

Xiaohe Bo ⋅ Rui Li ⋅ Zexu Sun ⋅ Quanyu Dai ⋅ Zeyu Zhang ⋅ Zihang Tian ⋅ Xu Chen ⋅ Zhenhua Dong

Prompt optimization and fine-tuning are two major approaches to improve the performance of Large Language Models (LLMs). They enhance the capabilities of LLMs from complementary perspectives: the former through explicit natural language, and the latter through implicit parameter updates. However, prior work has typically studied them in isolation, leaving their synergistic potential largely underexplored. To bridge this gap, in this paper, we introduce MetaTuner, a novel framework that jointly integrates prompt optimization and fine-tuning for LLM training. Specifically, we introduce two neural networks to generate prompts and parameters, respectively, while allowing them to share a common bottom encoding layer to enable knowledge sharing. By the guidance of the final supervised signals, our framework is optimized to discover the optimal combinations between the prompts and parameters. Given that prompt learning involves discrete optimization while fine-tuning operates in a continuous parameter space, we design a supervised regularization loss to train our framework effectively. Extensive experiments across diverse benchmarks show that our method consistently outperforms the baselines. To benefit the research community, we have released our project at https://github.com/BoXiaohe/MetaTuner.


Poster
P3-#1312
SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models

Ziyi Yang ⋅ Weizhou Shen ⋅ Chenliang Li ⋅ Ruijun Chen ⋅ Fanqi Wan ⋅ Ming Yan ⋅ Xiaojun Quan ⋅ Fei Huang

Progress in long-context reasoning for large language models (LLMs) has lagged behind other recent advances. This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals. In this paper, we propose SPELL, a multi-role self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning. SPELL integrates three cyclical roles—questioner, responder, and verifier—within a single model to enable continual self-improvement. The questioner generates questions from raw documents paired with reference answers; the responder learns to solve these questions based on the documents; and the verifier evaluates semantic equivalence between the responder’s output and the questioner's reference answer, producing reward signals to guide continual training. To stabilize training, we introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model’s evolving capabilities. Extensive experiments on six long-context benchmarks show that SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data. Notably, SPELL achieves an average 7.6-point gain in pass@8 on the strong reasoning model Qwen3-30B-A3B-Thinking, raising its performance ceiling and showing promise for scaling to even more capable models. Our code is available at https://github.com/Tongyi-Zhiwen/Qwen-Doc.


Poster
P3-#1514
Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization

Weixuan Wang ⋅ Minghao Wu ⋅ Barry Haddow ⋅ Alexandra Birch

Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.


Poster
P3-#1515
Scaling Knowledge Graph Construction through Synthetic Data Generation and Distillation

Prafulla Kumar Choubey ⋅ Xin Su ⋅ Man Luo ⋅ XIANGYU PENG ⋅ Caiming Xiong ⋅ Tiep Le ⋅ Shachar Rosenman ⋅ Vasudev Lal ⋅ Phil Mui ⋅ Ricky Ho ⋅ Phillip Howard ⋅ Chien-Sheng Wu

Document-level knowledge graph (KG) construction faces a fundamental scaling challenge: existing methods either rely on expensive large language models (LLMs), making them economically nonviable for large-scale corpora, or employ smaller models that produce incomplete and inconsistent graphs. We find that this limitation stems not from model capabilities but from insufficient training on high-quality document-level KG data. To address this gap, we introduce SynthKG, a multi-step data synthesis pipeline that generates high-quality document-KG pairs through systematic chunking, decontextualization, and structured extraction using LLMs. By fine-tuning a smaller LLM on synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG. Furthermore, we repurpose existing question-answering datasets to construct KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality (including models up to eight times larger) but also consistently improves in retrieval and question-answering tasks. Additionally, our proposed graph retrieval framework outperforms all KG-retrieval methods across multiple benchmark datasets.


Poster
P3-#1516
On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

Wenbo Shang ⋅ Yuxi Sun ⋅ Jing Ma ⋅ Xin Huang

Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.


Poster
P3-#1517
EntropyLong: Effective Long-Context Training via Predictive Uncertainty

jia junlong ⋅ Ziyang Chen ⋅ Xing W ⋅ Chaochen Gao ⋅ Zijia Lin ⋅ Songlin Hu ⋅ Binghui Guo

Training long-context language models to capture long-range dependencies requires specialized data construction. Current approaches, such as generic text concatenation or heuristic-based variants, frequently fail to guarantee genuine long-range dependencies. We propose \textbf{EntropyLong}, a novel data construction method that leverages predictive uncertainty to verify dependency quality. Our approach identifies high-entropy positions in documents, retrieves semantically relevant contexts from large corpora, and verifies their utility by assessing whether they reduce prediction entropy. This \textit{model-in-the-loop verification} ensures each dependency represents measurable information gain rather than spurious correlation. We construct training samples with long-range dependencies by combining original documents with these verified contextual supplements. Using FineWeb-Edu and Cosmopedia, we generate a dataset of 128K-length sequences with verified dependencies. Models trained on this data demonstrate significant improvements on RULER benchmarks, particularly in tasks requiring distant information. Following instruction fine-tuning, our models also achieve substantial gains on LongBench-v2, demonstrating enhanced long-context understanding. Extensive ablation studies further validate the necessity and effectiveness of entropy-based verification for long-context training.


Poster
P3-#1518
Knowing When to Quit: Probabilistic Early Exits for Speech Separation Networks

Kenny Olsen ⋅ Mads Østergaard ⋅ Karl Ulbæk ⋅ Søren Føns Nielsen ⋅ Rasmus Malik Høegh Lindrup ⋅ Bjørn Jensen ⋅ Morten Mørup

In recent years, deep learning-based single-channel speech separation has improved considerably, in large part driven by increasingly compute- and parameter-efficient neural network architectures. Most such architectures are, however, designed with a fixed compute and parameter budget and consequently cannot scale to varying compute demands or resources, which limits their use in embedded and heterogeneous devices such as mobile phones and hearables. To enable such use-cases we design a neural network architecture for speech separation and enhancement capable of early-exit, and we propose an uncertainty-aware probabilistic framework to jointly model the clean speech signal and error variance which we use to derive probabilistic early-exit conditions in terms of desired signal-to-noise ratios. We evaluate our methods on both speech separation and enhancement tasks where we demonstrate that early-exit capabilities can be introduced without compromising reconstruction, and that when trained on variable-length audio our early-exit conditions are well-calibrated and lead to considerable compute savings when used to dynamically scale compute at test time while remaining directly interpretable.


Poster
P3-#1519
Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction

Xiang Li ⋅ Jiabao Gao ⋅ Sipei Lin ⋅ Xuan Zhou ⋅ Chi Zhang ⋅ Bo Cheng ⋅ Jiale Han ⋅ Wang Benyou

The pursuit of human-like conversational agents has long been guided by the Turing test. For modern speech-to-speech (S2S) systems, a critical yet unanswered question is whether they can converse like humans. To tackle this, we conduct the first Turing test for S2S systems, collecting 2,968 human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants. Our results deliver a clear finding: no existing evaluated S2S system passes the test, revealing a significant gap in human-likeness. To diagnose this failure, we develop a fine-grained taxonomy of 18 human-likeness dimensions and crowd-annotate our collected dialogues accordingly. Our analysis shows that the bottleneck is not semantic understanding but stems from paralinguistic features, emotional expressivity, and conversational persona. Furthermore, we find that off-the-shelf AI models perform unreliably as Turing test judges. In response, we propose an interpretable model that leverages the fine-grained human-likeness ratings and delivers accurate and transparent human-vs-machine discrimination, offering a powerful tool for automatic human-likeness evaluation. Our work establishes the first human-likeness evaluation for S2S systems and moves beyond binary outcomes to enable detailed diagnostic insights, paving the way for human-like improvements in conversational AI systems.


Poster
P3-#1520
Influence-Preserving Proxies for Gradient-Based Data Selection in LLM FineTuning

Sirui Chen ⋅ Yunzhe Qi ⋅ Mengting Ai ⋅ Yifan Sun ⋅ Ruizhong Qiu ⋅ Jiaru Zou ⋅ Jingrui He

Supervised fine-tuning (SFT) relies critically on selecting training data that most benefits model's downstream performance. Gradient-based data selection methods such as TracIn and Influence Functions leverage influence to identify useful samples, but their computational cost scales poorly, making them impractical for multi-billion-parameter large language models (LLMs). A common alternative is to use off-the-shelf smaller models as proxies, but they remain suboptimal since their learning dynamics are unclear, their sizes cannot be flexibly adjusted, and they cannot be further aligned with the target model in terms of gradient-based influence estimation. To address these challenges, we introduce IProX, a two-stage framework that derives influence-preserving proxies directly from the target model. It first applies a low-rank compression stage to preserve influence information of the target model, and then an aligning stage to align both model gradients and logits, thereby constructing proxies that flexibly control computational cost while retaining the target model’s influence. Experimental results across diverse LLM families and evaluation tasks show that IProX consistently outperforms off-the-shelf proxies and baseline methods. On Qwen3-4B, a 1.5B proxy constructed with IProX achieves stronger performance than the larger 1.7B off-the-shelf proxy. Notably, on Llama3.2, IProX achieves better performance than baselines while reducing computational cost by more than half relative to the full 3B model. These results show that IProX provides effective influence-preserving proxies, making gradient-based data selection more scalable for LLMs.


Poster
P3-#1521
Incentive-Aligned Multi-Source LLM Summaries

Yanchen Jiang ⋅ Zhe Feng ⋅ Aranyak Mehta

Large language models (LLMs) are increasingly used in modern search and answer systems to synthesize multiple, sometimes conflicting, texts into a single response, yet current pipelines offer weak incentives for sources to be accurate and are vulnerable to adversarial content. We introduce Truthful Text Summarization (TTS), an incentive-aligned framework that improves factual robustness without ground-truth labels. TTS (i) decomposes a draft synthesis into atomic claims, (ii) elicits each source’s stance on every claim, (iii) scores sources with an adapted multi-task peer-prediction mechanism that rewards informative agreement, and (iv) filters unreliable sources before re-summarizing. We establish formal guarantees that align a source’s incentives with informative honesty, making truthful reporting the utility-maximizing strategy. Experiments show that TTS improves factual accuracy and robustness while preserving fluency, aligning exposure with informative corroboration and disincentivizing manipulation.


Poster
P3-#1522
Dynamic Early Exit in Reasoning Models

Chenxu Yang ⋅ Qingyi Si ⋅ Yongjie Duan ⋅ Zheliang Zhu ⋅ Chenyu Zhu ⋅ Qiaowei Li ⋅ Minghui Chen ⋅ Zheng Lin ⋅ Weipinng Wang

Recent advances in large reasoning language models (LRMs) rely on test-time scaling, which extends long chain-of-thought (CoT) generation to solve complex tasks. However, overthinking in long CoT not only slows down the efficiency of problem solving, but also risks accuracy loss due to the extremely detailed or redundant reasoning steps. We propose a simple yet effective method that allows LLMs to self-truncate CoT sequences by early exit during generation. Instead of relying on fixed heuristics, the proposed method monitors model behavior at potential reasoning transition points and dynamically terminates the next reasoning chain's generation when the model exhibits high confidence in a trial answer. Our method requires no additional training and can be seamlessly integrated into existing o1-like reasoning LLMs. Experiments on 10 reasoning benchmarks (e.g., GSM8K, MATH-500, AMC, GPQA, AIME and LiveCodeBench) show that the proposed method is consistently effective on 11 cutting-edge reasoning LLMs of varying series and sizes, reducing the length of CoT sequences by an average of 19.1% to 80.1% while improving accuracy by 0.3% to 5.0%.


Poster
P3-#1523
GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning

Longxi Gao ⋅ Li Zhang ⋅ Pengzhi Gao ⋅ WEI LIU ⋅ Jian Luan ⋅ Mengwei Xu

Training effective Vision-Language Models (VLMs) for GUI agents typically depends on large-scale annotated datasets, whose collection is both labor-intensive and error-prone. We introduce K-step GUI Transition, a self-supervised inverse dynamics task in which VLMs learn GUI dynamics by predicting the initial action that causes a transition between two GUI states. This approach eliminates the need for natural language instructions and enables scalable dataset construction from existing GUI trajectories or automated exploration. Building on this task, we propose GUI-Shift, a reinforcement learning (RL) framework that combines rule-based optimization with data filtering to improve VLM performance. We conduct extensive experiments using multiple VLM backbones across five benchmarks, spanning GUI task automation (AndroidControl, GUI Odyssey, AndroidWorld) and GUI grounding (ScreenSpot-v2, ScreenSpot-Pro). Our results show that training on GUI-Shift generalizes well to both GUI automation and grounding tasks, yielding up to an 11.2% increase in GUI automation accuracy. This study underscores the potential of self-supervised RL to leverage unlabeled GUI trajectories and offers a scalable alternative to training with annotated samples. GUI-Shift will be open-sourced at: https://github.com/UbiquitousLearning/GUI-Shift.


Poster
P3-#1524
Search Arena: Analyzing Search-Augmented LLMs

Mihran Miroyan ⋅ Tsung-Han Wu ⋅ Logan King ⋅ Tianle Li ⋅ Jiayi Pan ⋅ Xinyan Hu ⋅ Wei-Lin Chiang ⋅ Anastasios Angelopoulos ⋅ trevor darrell ⋅ Narges Norouzi ⋅ Joseph E Gonzalez

Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce \textbf{Search Arena}, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations and types of cited sources, even when the cited content does not directly support the associated claims, uncovering a gap between perceived and actual credibility. To assess cross-setting performance, we conduct cross-arena analyses by testing search-augmented LLMs in a general purpose chat environment and conventional LLMs in search-heavy settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research.


Poster
P3-#1525
Continuous Audio Language Models

Simon Rouard ⋅ Manu Orsini ⋅ Axel Roebel ⋅ Neil Zeghidour ⋅ Alexandre Défossez

Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are invertible, audio tokens are extracted from lossy codecs with a limited bitrate. As a consequence, increasing audio quality requires generating more tokens, which imposes a trade-off between fidelity and computational cost. We address this issue by studying Continuous Audio Language Models (CALM). These models instantiate a large Transformer backbone that produces a contextual embedding at every timestep. This sequential information then conditions an MLP that generates the next continuous frame of an audio VAE through consistency modeling. By avoiding lossy compression, CALM achieves higher quality at lower computational cost than their discrete counterpart. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating lightweight, high-quality audio generation. Samples are available at iclr-continuous-audio-language-models.github.io. Finally, we release Pocket TTS, an open-source 100M-parameter text-to-speech model that can run faster than real time on a laptop CPU: github.com/kyutai-labs/pocket-tts.


Poster
P3-#1526
SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

Gyuhyeon Seo ⋅ Jungwoo Yang ⋅ Junseong Pyo ⋅ Nalim Kim ⋅ Jonggeun Lee ⋅ Yohan Jo

We introduce $\textbf{SimuHome}$, a high-fidelity smart home simulator and a benchmark of 600 episodes for LLM-based smart home agents. Existing smart home benchmarks treat the home as a static system, neither simulating how device operations affect environmental variables over time nor supporting workflow scheduling of device commands. SimuHome is grounded in the Matter protocol, the industry standard that defines how real smart home devices communicate and operate. Agents interact with devices through SimuHome's APIs and observe how their actions continuously affect environmental variables such as temperature and humidity. Our benchmark covers state inquiry, implicit user intent inference, explicit device control, and workflow scheduling, each with both feasible and infeasible requests. For workflow scheduling, the simulator accelerates time so that scheduled workflows can be evaluated immediately. An evaluation of 18 agents reveals that workflow scheduling is the hardest category, with failures persisting across alternative agent frameworks and fine-tuning. These findings suggest that SimuHome's time-accelerated simulation could serve as an environment for agents to pre-validate their actions before committing them to the real world.


Poster
P3-#1626
DRBench: A Realistic Benchmark for Enterprise Deep Research

Amirhossein Abaskohi ⋅ Tianyi Chen ⋅ Miguel Muñoz-Mármol ⋅ Curtis Fox ⋅ Amrutha Varshini Ramesh ⋅ Étienne Marcotte ⋅ Xing Han Lu ⋅ Nicolas Chapados ⋅ Spandana Gella ⋅ Christopher Pal ⋅ Alexandre Drouin ⋅ Issam Laradji

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior benchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, "What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 100 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code and data are available at https://github.com/ServiceNow/drbench.


Poster
P3-#1625
EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Wayne Chi ⋅ Valerie Chen ⋅ Ryan Shar ⋅ Aditya Mittal ⋅ Jenny Liang ⋅ Wei-Lin Chiang ⋅ Anastasios Angelopoulos ⋅ Ion Stoica ⋅ Graham Neubig ⋅ Ameet Talwalkar ⋅ Chris Donahue

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EditBench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e.,~user instructions and code contexts collected in the wild. EditBench comprises of 545 problems, multiple natural and programming languages, and a diverse set of real-world use cases, ranging from resolving errors to adding features. EditBench introduces context-dependent problems that require the model to understand code context, highlighted code, and cursor position in addition to the user instruction. We evaluate 40 diverse LLMs and observe that EditBench is a challenging set of problems where only 3 models score over 60\%. We find that model performance varies across different categories of user instructions. Further, we find that varying levels of contextual information greatly affect task success rate, with performance varying up to 11\%, indicating the importance of evaluating with realistic context.


Poster
P3-#1624
Segment-Level Attribution for Selective Learning of Long Reasoning Traces

Siyuan Wang ⋅ Yanchen Liu ⋅ Xiang Ren

Large Reasoning Models (LRMs) achieve strong reasoning performance by generating long chains of thought (CoTs), yet only a small fraction of these traces meaningfully contributes to answer prediction, while the majority contains repetitive or truncated content. Such output redundancy is further propagated after supervised finetuning (SFT), as models learn to imitate verbose but uninformative patterns, which can degrade performance. To this end, we incorporate integrated gradient attribution to quantify each token's influence on final answers and aggregate them into two segment-level metrics: (1) \textit{attribution strength} measures the overall attribution magnitude; and (2) \textit{direction consistency} captures whether tokens' attributions within a segment are uniformly positive or negative (high consistency), or a mixture of both (moderate consistency). Based on these two metrics, we propose a segment-level selective learning framework to identify important segments with high attribution strength but moderate consistency that indicate reflective rather than shallow reasoning. The framework then applies selective SFT on these important segments while masking loss for unimportant ones. Experiments across multiple models and datasets show that our approach improves accuracy and output efficiency, enabling more effective learning from long reasoning traces.


Poster
P3-#1623
Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents

Tiannuo Yang ⋅ Zebin Yao ⋅ Bowen Jin ⋅ Lixiao Cui ⋅ Yusen Li ⋅ Gang Wang ⋅ Xiaoguang Liu ⋅ Willie Neiswanger

Large Language Model (LLM)-based search agents have shown remarkable capabilities in solving complex tasks by dynamically decomposing problems and addressing them through interleaved reasoning and retrieval. However, this interleaved paradigm introduces substantial efficiency bottlenecks. First, we observe that both highly accurate and overly approximate retrieval methods degrade system efficiency: exact search incurs significant retrieval overhead, while coarse retrieval requires additional reasoning steps during generation. Second, we identify inefficiencies in system design, including improper scheduling and frequent retrieval-induced stalls, which lead to cascading latency—where even minor delays in retrieval amplify end-to-end inference time. To address these challenges, we introduce SearchAgent-X, a high-efficiency inference framework for LLM-based search agents. SearchAgent-X leverages high-recall approximate retrieval and incorporates two key techniques: priority-aware scheduling and non-stall retrieval. Extensive experiments demonstrate that SearchAgent-X consistently outperforms state-of-the-art systems such as vLLM and HNSW-based retrieval across diverse tasks, achieving up to 3.4× higher throughput and 5× lower latency, without compromising generation quality. Code is available at https://github.com/tiannuo-yang/SearchAgent-X.


Poster
P3-#1622
LLMs are Single-threaded Reasoners: Demystifying the Working Mechanism of Soft Thinking

Junhong Wu ⋅ Jinliang Lu ⋅ Zixuan Ren ⋅ Gangqiang Hu ⋅ Zhi Wu ⋅ Dai Dai ⋅ hua wu

Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. In this paper, we investigate the $\textit{Soft Thinking}$ capabilities of various LLMs through a systematic analysis of their internal behavior using a suite of probing techniques. Contrary to the prevailing belief that Soft Thinking supports parallel exploration of diverse reasoning paths, our findings reveal that $\textbf{LLMs behave as single-threaded reasoners}$—they predominantly rely on the token with the highest probability in the soft input to predict the next step. This behavior induces a greedy feedback loop that suppresses alternative reasoning paths and undermines the benefits of transmitting richer information via Soft Tokens. To address this $\textit{Greedy Pitfall}$, we propose $\textbf{Stochastic Soft Thinking}$, which introduces stochasticity to break free from the greedy tendency. Our experiments demonstrate that incorporating $\textit{randomness}$—particularly with the $\textbf{Gumbel-Softmax trick}$—can alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking, resulting in superior performance across eight reasoning benchmarks.


Poster
P3-#1621
Diversity-Incentivized Exploration for Versatile Reasoning

Zican Hu ⋅ Shilin Zhang ⋅ Yafu Li ⋅ Jianhao (Elliott) Yan ⋅ Xuyang Hu ⋅ Leyang Cui ⋅ Xiaoye Qu ⋅ Chunlin Chen ⋅ Yu Cheng ⋅ Zhi Wang

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a crucial paradigm for incentivizing reasoning capabilities in Large Language Models (LLMs). Due to vast state-action spaces and reward sparsity in reasoning tasks, existing methods often struggle with deficient exploration and poor sample efficiency. In the paper, we propose DIVER (Diversity-Incentivized Exploration for VersatilE Reasoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning. We first conduct a primary empirical study to reveal a strong positive correlation between global diversity and reasoning capacity. Building on this insight, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. Incorporating the intrinsic reward, we develop a potential-based reward shaping mechanism to preserve optimal policy invariance and design simple heuristics to mitigate possible reward hacking. Experimental results show that DIVER outperforms competitive RLVR baselines with various exploration strategies on both in-domain and out-of-domain tasks, excelling in both Pass@1 and Pass@k evaluations. Our code is available at https://github.com/NJU-RL/DIVER.


Poster
P3-#1620
SciNav: A General Agent Framework for Scientific Coding Tasks

Tianshu Zhang ⋅ Huan Sun

Autonomous science agents, built on large language models (LLMs), are increasingly being investigated to generate hypotheses, design experiments, and produce reports. Prior science agents primarily focus on open-ended scientific problems, where such outputs—hypotheses, experiments, or analyses are inherently subjective and thus difficult to evaluate rigorously. In contrast, existing scientific coding benchmarks provide tasks with clearly defined, executable outputs that enable objective assessment. However, current agent-based approaches to these benchmarks remain engineering-driven pipelines, lacking principled framework design. This mismatch exposes a gap: the absence of end-to-end, principled science agent frameworks for scientific coding tasks. We address this gap by focusing on scientific coding tasks, where evaluation can be made rigorously, and introducing an agent framework SciNav (Scientific Navigator) that enables more effective solution exploration. Our framework is designed to operate efficiently under constrained search budgets, moving beyond reliance on pre-defined success metrics and prolonged search cycles. Inspired by findings that comparative judgments often reveal finer-grained quality differences and therefore provide greater discriminative power than absolute scoring, our framework leverages pairwise relative judgments within a tree search process to select top-K promising solution branches, prune low-potential ones, and progressively narrow down the solution candidates on the selected branches guided by relative comparisons. We demonstrate our agent's effectiveness across different types of tasks on two benchmarks. Experiments show that SciNav significantly outperforms direct prompting and prior agents like OpenHands and Self-Debug across different base models, task types, and difficulty levels, and exceeds different frontier comparators such as random selection and LLM absolute scoring. These results confirm the strength of our agent design and highlight the effectiveness of relative judgment–guided top-K search for high-quality scientific coding, marking a step toward more practical science agents.


Poster
P3-#1619
HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games

Jingcong Liang ⋅ Shijun Wan ⋅ Xuehai Wu ⋅ Yitong Li ⋅ Qianglong Chen ⋅ Duyu Tang ⋅ Siyuan Wang ⋅ zhongyu wei

Large Reasoning Models (LRMs) have demonstrated impressive performance on complex tasks, including logical puzzle games that require deriving solutions satisfying all constraints. However, whether they can flexibly apply appropriate rules to varying conditions, particularly when faced with non-canonical game variants, remains an open question. Existing corpora focus on popular puzzles like 9x9 Sudoku, risking overfitting to canonical formats and memorization of solution patterns, which can mask deficiencies in understanding novel rules or adapting strategies to new variants. To address this, we introduce HardcoreLogic, a challenging benchmark of over 5,000 puzzles across 10 games, designed to test the robustness of LRMs on the "long-tail" of logical games. HardcoreLogic systematically transforms canonical puzzles through three dimensions: Increased Complexity (IC), Uncommon Elements (UE), and Unsolvable Puzzles (UP), reducing reliance on shortcut memorization. Evaluations on a diverse set of LRMs reveal significant performance drops, even for models achieving top scores on existing benchmarks, indicating heavy reliance on memorized stereotypes. While increased complexity is the dominant source of difficulty, models also struggle with subtle rule variations that do not necessarily increase puzzle difficulty. Our systematic error analysis on solvable and unsolvable puzzles further highlights gaps in genuine reasoning. Overall, HardcoreLogic exposes the limitations of current LRMs and establishes a benchmark for advancing high-level logical reasoning.


Poster
P3-#1618
RESTRAIN: From Spurious Votes to Signals — Self-Training RL with Self-Penalization

Zhaoning Yu ⋅ Zhaolun Su ⋅ Leitian Tao ⋅ Haozhu Wang ⋅ Aashu Singh ⋅ Hanchao Yu ⋅ Jianyu Wang ⋅ Hongyang Gao ⋅ Weizhe Yuan ⋅ Jason E Weston ⋅ Ping Yu ⋅ Jing Xu

Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce REinforcement learning with Self-resTRAINt training (RESTRAIN), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model’s entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. This self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it boosts Pass@1 by up to +140.7% on AIME25, +36.2% on MMLU STEM, and +19.6% on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.


Poster
P3-#1617
UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice

Sitong Cheng ⋅ Bianweizhen ⋅ Xinsheng Wang ⋅ Ruibin Yuan ⋅ Jianyi Chen ⋅ Shunshun Yin ⋅ Yike Guo ⋅ Wei Xue

The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo/.


Poster
P3-#1616
Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns

Xuemiao Zhang ⋅ Can Ren ⋅ Chengying Tu ⋅ Rongxiang Weng ⋅ Shuo Wang ⋅ Hongfei Yan ⋅ Jingang Wang ⋅ Xunliang Cai

Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model's reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58\% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81\%.


Poster
P3-#1615
Learning from Synthetic Data Improves Multi-hop Reasoning

Anmol Kabra ⋅ Yilun Yin ⋅ Albert Gong ⋅ Kamilė Stankevičiūtė ⋅ Dongyoung Go ⋅ Johann Lee ⋅ Katie Luo ⋅ Carla Gomes ⋅ Kilian Weinberger

Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data, often sourced from human annotations, generated from frontier LLMs, or scored by LLM-based verifiers. All three have considerable limitations: human-annotated datasets are small and expensive to curate, LLM-generated data is hallucination-prone and costly, and LLM-based verifiers are inaccurate and slow. In this work, we investigate a cheaper alternative: RL fine-tuning on rule-generated synthetic data for multi-hop reasoning tasks. We discover that LLMs fine-tuned on synthetic data perform significantly better on popular real-world question-answering benchmarks, despite the synthetic data containing only fictional knowledge. On stratifying performance by question difficulty, we find that synthetic data teaches LLMs to compose knowledge---a fundamental and generalizable reasoning skill. Our work highlights rule-generated synthetic reasoning data as a free and scalable resource to improve LLM reasoning capabilities.


Poster
P3-#1614
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Pierre-Carl Langlais ⋅ Pavel Chizhov ⋅ Catherine Arnett ⋅ Carlos Hinostroza ⋅ Mattia Nee ⋅ Eliot Jones ⋅ Irène Girard ⋅ David Mach ⋅ Anastasia Stasenko ⋅ Ivan Yamshchikov

Large Language Models (LLMs) are pre-trained on large data from different sources and domains. These datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training. The data assembled in Common Corpus are either uncopyrighted or under permissive licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets. In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge. In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that they perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining. Common Corpus represents a key contribution to the ecosystem for open science research on Large Language Models.


Blog Track Poster
P3-#1613
Content Promotion as a Strategic Game: How to Design Agentic Publishers for the Evolving Search Ecosystem in the GenAI Era?

Tommy Mordo ⋅ Sagie Dekel ⋅ Tomer Kordonsky ⋅ Omer Madmon ⋅ Moshe Tennenholtz ⋅ Oren Kurland

With the rise of LLMs, publishers now operate in a dual world where traditional search and chat-like systems coexist. We propose a unified, game-theoretic view of this environment and highlight different tools, such as Multi-Agent Reinforcement Learning, that support the development of competitive content-optimization agents.


Poster
P3-#1612
Building spatial world models from sparse transitional episodic memories

Zizhan He ⋅ Maxime Daigle ⋅ Pouya Bashivan

Many animals possess a remarkable capacity to rapidly construct flexible cognitive maps of their environments. These maps are crucial for ethologically relevant behaviors such as navigation, exploration, and planning. Existing computational models typically require long sequential trajectories to build accurate maps, but neuroscience evidence suggests maps can also arise from integrating disjoint experiences governed by consistent spatial rules. We introduce the Episodic Spatial World Model (ESWM), a novel framework that constructs spatial maps from sparse, disjoint episodic memories. Across environments of varying complexity, ESWM predicts unobserved transitions from minimal experience, and the geometry of its latent space aligns with that of the environment. Because it operates on episodic memories that can be independently stored and updated, ESWM is inherently adaptive, enabling rapid adjustment to environmental changes. Furthermore, we demonstrate that ESWM readily enables near-optimal strategies for exploring novel environments and navigating between arbitrary points, all without the need for additional training. Our work demonstrates how neuroscience-inspired principles of episodic memory can advance the development of more flexible and generalizable world models.


Poster
P3-#1611
Readout Representation: Redefining Neural Codes by Input Recovery

Shunsuke Onoo ⋅ Yoshihiro Nagano ⋅ Yukiyasu Kamitani

Sensory representation is typically understood through a hierarchical-causal framework where progressively abstract features are extracted sequentially. However, this causal view fails to explain misrepresentation, a phenomenon better handled by an informational and teleological view based on decodable content and downstream functions. This creates a tension: how does a system that abstracts away details preserve the fine-grained information needed for downstream functions? We propose readout representation to resolve this, defining representation by the information recoverable from features, rather than their causal origin. Empirically, we show that inputs can be accurately reconstructed even from heavily perturbed mid-level features, demonstrating that a single input corresponds to a broad, redundant region of feature space, challenging the causal mapping perspective. To quantify this property, we introduce representation size, a metric linked to model robustness and representational redundancy. Our framework offers a new lens for analyzing how both biological and artificial neural systems learn complex features while maintaining robust, information-rich representations of the world.


Poster
P3-#1610
Mixture of Cognitive Reasoners: Modular Reasoning with Brain-Like Specialization

Badr AlKhamissi ⋅ C. De Sabbata ⋅ Greta Tuckute ⋅ Zeming Chen ⋅ Martin Schrimpf ⋅ Antoine Bosselut

Human cognitive behavior arises from the interaction of specialized brain networks dedicated to distinct functions, such as language, logic, and social reasoning. Inspired by this organization, we propose Mixture of Cognitive Reasoners (MiCRo): a modular, transformer-based architecture post-trained with a curriculum that induces functional specialization across experts. Concretely, we partition the layers of a pretrained language model into four expert modules aligned with well-studied cognitive networks in the human brain. MiCRo offers three key advantages over standard language models. (1) The specialized experts are interpretable and causally meaningful---ablating a module causes substantial drops on benchmarks requiring its specialized domain. (2) MiCRo's behavior can be dynamically steered at inference time by routing tokens to particular experts (e.g., favoring social over logical reasoning), enabling fine-grained control over outputs. (3) MiCRo outperforms or matches comparable baselines on both machine-learning reasoning benchmarks (e.g., GSM8K, BBH) and alignment to human behavior (CogBench), while maintaining interpretability. Taken together, cognitively grounded functional specialization yields models that are both more human-like and more human-interpretable.


Poster
P3-#1609
Difference Predictive Coding for Training Spiking Neural Networks

Ville Karlsson ⋅ Nicklas Fianda ⋅ Joni-Kristian Kamarainen

Predictive coding networks (PCNs) offer a local-learning alternative to backpropagation in which layers communicate residual errors, aligning well with biological computation and neuromorphic hardware. In this work we introduce Difference Predictive Coding (DiffPC), a spike-native PC formulation for spiking neural networks. DiffPC replaces dense floating-point messages with sparse ternary spikes, provides spike-compatible target and error updates, and employs adaptive threshold schedules for event-driven operation. We validate DiffPC on fully connected and convolutional architectures, demonstrating competitive performance on MNIST (99.3\%) and Fashion-MNIST (89.6\%), and outperforming a backpropagation baseline on CIFAR-10. Crucially, this performance is achieved with high communication sparsity, reducing data movement by over two orders of magnitude compared to standard predictive coding. DiffPC thus establishes a faithful, hardware-aligned framework for communication-efficient training on neuromorphic platforms.


Poster
P3-#1608
Discovering heterogeneous synaptic plasticity rules via large-scale neural evolution

Ziyuan Ye ⋅ Beichen Huang ⋅ Yujie Wu ⋅ Guozhang Chen ⋅ Jibin Wu

Synaptic plasticity is a fundamental substrate for learning and memory, where different synapse types exhibit distinct plasticity mechanisms. However, how functional behaviors emerge from heterogeneous synaptic plasticity mechanisms remains poorly understood. Here, we introduce a computational framework that harnesses Darwinian evolutionary principles to discover biologically plausible, heterogeneous synaptic plasticity rules within a biologically realistic model of the mouse primary visual cortex. Specifically, we parameterize several key factors related to synaptic plasticity, including presynaptic and postsynaptic spikes, their associated eligibility traces, and neuromodulatory signals. By integrating these factors via a truncated Taylor expansion, we construct a large-scale search space of candidate plasticity rules, with each rule containing over 2.6k optimizable parameters. Each rule is subsequently evaluated on both cross-domain visual task performance and biological validity. Leveraging a multi-objective evolutionary algorithm, we effectively navigate this high-dimensional search space to identify plasticity rules that are both biologically plausible and yield high task performance. We uncover diverse families of high-performing plasticity rules that achieve similar behavioral outcomes despite markedly different mathematical formulations, suggesting that real-world synaptic learning mechanisms may exhibit computational degeneracy. We further show that these biologically plausible rules are not only robust across network scales but also enable few-shot learning, offering a computational explanation for the emergence of innate ability.


Poster
P3-#1607
Only Brains Align with Brains: Cross-Region Alignment Patterns Expose Limits of Normative Models

Larissa Höfling ⋅ Matthias Tangemann ⋅ Lotta Piefke ⋅ Susanne Keller ⋅ Matthias Bethge ⋅ Katrin Franke

Neuroscientists and computer vision researchers use model–brain alignment benchmarks to compare artificial and biological vision systems. These benchmarks rank models according to alignment measures such as the similarity of representational geometry or the predictivity of neural responses from model activations. However, recent works have raised a number of problems with these rankings, most critically their lack of discriminative power, raising the conceptual question of what it means for a model to be ''brain-aligned''. Here we introduce alignment patterns - characteristic functional relationship profiles of each brain region to all others - and propose that models should reproduce these patterns to qualify as brain-aligned. First, we apply a standard benchmarking pipeline to a broad spectrum of vision models on the BOLD Moments video fMRI dataset across visual regions of interest (ROIs). We find diverse models appear equivalent in their brain alignment, reflecting the lack of discriminative power of conventional alignment benchmarks. Conventional alignment evaluation is a pointwise similarity test: it assesses whether a model is aligned to an individual ROI. It is therefore sensitive to the specific invariances and scaling properties of the chosen metric. In contrast, alignment pattern analysis (APA) is a second-order structural consistency test: a model aligned to a given ROI should reproduce that ROI’s characteristic cross-region alignment profile. Applying this test, we find that, while these patterns are highly stable across brains of different subjects, even top-ranked models often fail to capture them. Notably, models that appear effectively equivalent in alignment diverge sharply under the relational criterion, demonstrating the added discriminative value of APA. Finally, we argue for a clearer distinction between the criteria a model must meet to serve as a tool versus as a computational model. Conventional alignment measures may be sufficient for identifying neurally predictive models, but claims about computational or algorithmic similarity may require a stronger basis of evidence, including the reproducibility of relational alignment patterns.


Poster
P3-#1606
Low-Pass Filtering Improves Behavioral Alignment of Vision Models

Max Wolff ⋅ Thomas Klein ⋅ Evgenia Rusak ⋅ Felix Wichmann ⋅ Wieland Brendel

Despite their impressive performance on computer vision benchmarks, Deep Neural Networks (DNNs) still fall short of adequately modeling human visual behavior, as measured by error consistency and shape bias. Recent work hypothesized that behavioral alignment can be drastically improved through generative - rather than discriminative - classifiers, with far-reaching implications for models of human vision. Here, we instead show that the increased alignment of generative models can be largely explained by a seemingly innocuous resizing operation in the generative model which effectively acts as a low-pass filter. In a series of controlled experiments, we show that removing high-frequency spatial information from discriminative models like CLIP drastically increases their behavioral alignment. Simply blurring images at test-time - rather than training on blurred images - achieves a new state-of-the-art score on the model-vs-human benchmark, halving the current alignment gap between DNNs and human observers. Furthermore, low-pass filters are likely optimal, which we demonstrate by directly optimizing filters for alignment. To contextualize the performance of optimal filters, we compute the frontier of all possible pareto-optimal solutions to the benchmark, which was formerly unknown. We explain our findings by observing that the frequency spectrum of optimal Gaussian filters roughly matches the spectrum of band-pass filters implemented by the human visual system. We show that the contrast sensitivity function, describing the inverse of the contrast threshold required for humans to detect a sinusoidal grating as a function of spatiotemporal frequency, is approximated well by Gaussian filters of a specific width.

Generalization in EEG-based motor imagery (MI) brain-computer interfaces (BCIs) is hampered by cross-subject and cross-session variability. Although large-scale EEG pretraining has advanced representation learning, their practical deployment is hindered by the need for costly fine-tuning to overcome significant domain shifts. Test-time adaptation (TTA) methods that adapt models during inference offer a promising solution. However, existing EEG-TTA methods either rely on gradient-based fine-tuning (suffering from high computational cost and catastrophic forgetting) or data alignment strategies (failing to capture shifts in temporal predictive embeddings). To address these limitations, we propose BTTA-DG, a novel Bayesian Test-Time Adaptation framework that performs efficient, gradient-free adaptation by modeling the distribution of temporal predictive embeddings. Our approach first employs a lightweight SincAdaptNet with learnable filters to extract task-specific frequency bands. We then introduce a novel Dirichlet feature projection that maps temporal embeddings onto a compact and interpretable parameter space, effectively capturing the concentration of time-varying predictive evidence. Adaptation is achieved via a GMM-driven Bayesian inference mechanism, which models the historical distribution of these Dirichlet parameters and fuses this evidence with the model's prior predictions to calibrate outputs for the target domain. Extensive experiments show that BTTA‑DG significantly outperforms previous EEG‑TTA methods, achieving state‑of‑the‑art accuracy while running at real‑time speed. Furthermore, visualizations confirm the physiological interpretability of our learned filters and the robust class separability of our Dirichlet feature space.


Poster
P3-#1604
Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space

Felipe Toro-Hernández ⋅ Jesuino Vieira Filho ⋅ Rodrigo Cabral-Carvalho

Semantic representations can be framed as a structured, dynamic knowledge space through which humans navigate to retrieve and manipulate meaning. To investigate how humans traverse this geometry, we introduce a framework that represents concept production as navigation through embedding space. Using different transformer text embedding models, we construct participant-specific semantic trajectories based on cumulative embeddings and extract geometric and dynamical metrics, including distance to next, distance to centroid, entropy, velocity, and acceleration. These measures capture both scalar and directional aspects of semantic navigation, providing a computationally grounded view of semantic representation search as movement in a geometric space. We evaluate the framework on four datasets across different languages, spanning different property generation tasks: Neurodegenerative, Swear verbal fluency, Property listing task in Italian, and in German. Across these contexts, our approach distinguishes between clinical groups and concept types, offering a mathematical framework that requires minimal human intervention compared to typical labor-intensive linguistic pre-processing methods. Comparison with a non-cumulative approach reveals that cumulative embeddings work best for longer trajectories, whereas shorter ones may provide too little context, favoring the non-cumulative alternative. Critically, different embedding models yielded similar results, highlighting similarities between different learned representations despite different training pipelines. By framing semantic navigation as a structured trajectory through embedding space, bridging cognitive modeling with learned representation, thereby establishing a pipeline for quantifying semantic representation dynamics with applications in clinical research, cross-linguistic analysis, and the assessment of artificial cognition. https://github.com/jesuinovieira/semtraj-iclr2026


Poster
P3-#1603
Biologically Plausible Learning via Bidirectional Spike-Based Distillation

Yifei Wang ⋅ Zhangyanxun ⋅ Changze Lv ⋅ Yiyang Lu ⋅ Jingwen Xu ⋅ Xiaohua Wang ⋅ Di Yu ⋅ Xin Du ⋅ Xuanjing Huang ⋅ Xiaoqing Zheng

Developing biologically plausible learning algorithms that can achieve performance comparable to error backpropagation remains a longstanding challenge. Existing approaches often compromise biological plausibility by entirely avoiding the use of spikes for error propagation or relying on both positive and negative learning signals, while the question of how spikes can represent negative values remains unresolved. To address these limitations, we introduce Bidirectional Spike-based Distillation (BSD), a novel learning algorithm that jointly trains a feedforward and a backward spiking network. We formulate learning as a transformation between two spiking representations (i.e., stimulus encoding and concept encoding) so that the feedforward network implements perception and decision-making by mapping stimuli to actions, while the backward network supports memory recall by reconstructing stimuli from concept representations. Extensive experiments on diverse benchmarks, including image recognition, image generation, and sequential regression, show that BSD achieves performance comparable to networks trained with classical error backpropagation. These findings represent a significant step toward biologically grounded, spike-driven learning in neural networks.


Poster
P3-#1602
Spike-based Digital Brain: a novel fundamental model for brain activity analysis

Shaolong Wei ⋅ Qiyu Sun ⋅ Mingliang Wang ⋅ Liang Sun ⋅ Weiping Ding ⋅ Jiashuang Huang

Modeling the temporal dynamics of the human brain remains a core challenge in computational neuroscience and artificial intelligence. Traditional methods often ignore the biological spike characteristics of brain activity and find it difficult to reveal the dynamic dependencies and causal interactions between brain regions, limiting their effectiveness in brain function research and clinical applications. To address this issue, we propose a Spike-based Digital Brain (Spike-DB), a novel fundamental model that introduces the spike computing paradigm into brain time series modeling. Spike-DB encodes fMRI signals as spike trains and learns the temporal driving relationships between anchor and target regions to achieve high-precision prediction of brain activity and reveal underlying causal dependencies and dynamic relationship characteristics. Based on Spike-DB, we further conducted downstream tasks including brain disease classification, abnormal brain region identification, and effective connectivity inference. Experimental results on real-world epilepsy datasets and the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset show that Spike-DB outperforms existing mainstream methods in both prediction accuracy and downstream tasks, demonstrating its broad potential in clinical applications and brain science research. Our code is available at https://github.com/UAIBC-Brain/Spike-DB.


Poster
P3-#1601
A cross-species neural foundation model for end-to-end speech decoding

Yizi Zhang ⋅ Linyang He ⋅ Chaofei Fan ⋅ Tingkai Liu ⋅ Han Yu ⋅ Trung Le ⋅ Jingyuan Li ⋅ Scott Linderman ⋅ Lea Duncker ⋅ Francis Willett ⋅ Nima Mesgarani ⋅ Liam Paninski

Speech brain–computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end BraIn-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text ’24 and ’25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio-LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.


Poster
P3-#1701
PredNext: Explicit Cross-View Temporal Prediction for Unsupervised Learning in Spiking Neural Networks

Yiting Dong ⋅ Jianhao Ding ⋅ Zijie Xu ⋅ Tong Bu ⋅ Zhaofei Yu ⋅ Tiejun Huang

Spiking Neural Networks (SNNs), with their temporal processing capabilities and biologically plausible dynamics, offer a natural platform for unsupervised representation learning. However, current unsupervised SNNs predominantly employ shallow architectures or localized plasticity rules, limiting their ability to model long-range temporal dependencies and maintain temporal feature consistency. This results in semantically unstable representations, thereby impeding the development of deep unsupervised SNNs for large-scale temporal video data. We propose PredNext, which explicitly models temporal relationships through cross-view future Step Prediction and Clip Prediction. This plug-and-play module seamlessly integrates with diverse self-supervised objectives. We firstly establish standard benchmarks for SNN self-supervised learning on UCF101, HMDB51, and MiniKinetics, which are substantially larger than conventional DVS datasets. PredNext delivers significant performance improvements across different tasks and self-supervised methods. PredNext achieves performance comparable to ImageNet-pretrained supervised weights, through unsupervised training solely on UCF101. Additional experiments demonstrate that PredNext, distinct from forced consistency constraints, substantially improves temporal feature consistency while enhancing network generalization capabilities. This work provides a effective foundation for unsupervised deep SNNs on large-scale temporal video data.


Poster
P3-#1702
Functional MRI Time Series Generation via Wavelet-Based Image Transform and Spectral Flow Matching for Brain Disorder Identification

Hwa Hui Tew ⋅ Junn Yong Loo ⋅ Leong Yu ⋅ Julia Lau ⋅ Ding Fan ⋅ Hernando Ombao ⋅ Raphaël Phan ⋅ Chee Tan ⋅ Chee-Ming Ting

Functional Magnetic Resonance Imaging (fMRI) provides non-invasive access to dynamic brain activity by measuring blood oxygen level-dependent (BOLD) signals over time. However, the resource-intensive nature of fMRI acquisition limits the availability of high-fidelity samples required for data-driven brain analysis models. While modern generative models can synthesize fMRI data, they often remain challenging in replicating their inherent non-stationarity, intricate spatiotemporal dynamics, and physiological variations of raw BOLD signals. To address these challenges, we propose Dual-Spectral Flow Matching (DSFM), a novel fMRI generative framework that cascades dual frequency representation of BOLD signals with spectral flow matching. Specifically, our framework first converts BOLD signals into a wavelet decomposition map via a discrete wavelet transform (DWT) to capture globalized transient and multi-scale variations, and projects into the discrete cosine transform (DCT) space across brain regions and time to exploit localized energy compaction of low-frequency dominant BOLD coefficients. Subsequently, a spectral flow matching model is trained to generate class-conditioned cosine-frequency representation. The generated samples are reconstructed through inverse DCT and inverse DWT operations to recover physiologically plausible time-domain BOLD signals. This dual-transform approach imposes structured frequency priors and preserves key physiological brain dynamics. Ultimately, we demonstrate the efficacy of our approach through improved downstream fMRI-based brain network classification.


Poster
P3-#1703
EgoBrain: Synergizing Minds and Eyes For Human Action Understanding

Nie Lin ⋅ Yansen Wang ⋅ Dongqi Han ⋅ Wei-Bang Jiang ⋅ Jingyuan Li ⋅ Ryosuke Furuta ⋅ Yoichi Sato ⋅ Dongsheng Li

The integration of brain-computer interfaces (BCIs), in particular electroencephalography (EEG), with artificial intelligence (AI) has shown tremendous promise in decoding human cognition and behavior from neural signals. In particular, the rise of multimodal AI models have brought new possibilities that have never been imagined before. Here, we present \data --the world's first large-scale, temporally aligned multimodal dataset that synchronizes first-person (egocentric) vision and EEG of human brain over extended periods of time, establishing a new paradigm for human-centered behavior analysis. This dataset comprises 61 hours of synchronized 32-channel EEG recordings and first-person video from 40 participants engaged in 29 categories of daily activities. We then developed a muiltimodal learning framework to fuse EEG and vision for action understanding, validated across both cross-subject and cross-environment challenges, achieving an action recognition accuracy of 66.70\%. EgoBrain paves the way toward a unified framework for multimodal and egocentric brain–computer interfaces, bridging neural signals and first-person perception. Our dataset and code are publicly available at: https://huggingface.co/datasets/ut-vision/EgoBrain and https://github.com/ut-vision/EgoBrain.


Poster
P3-#1704
PSDNorm: Temporal Normalization for Deep Learning in Sleep Staging

Theo Gnassounou ⋅ Antoine Collas ⋅ Rémi Flamary ⋅ Alexandre Gramfort

Distribution shift poses a significant challenge in machine learning, particularly in biomedical applications using data collected across different subjects, institutions, and recording devices, such as sleep data. While existing normalization layers, BatchNorm, LayerNorm and InstanceNorm, help mitigate distribution shifts, when applied over the time dimension they ignore the dependencies and auto-correlation inherent to the vector coefficients they normalize. In this paper, we propose PSDNorm that leverages Monge mapping and temporal context to normalize feature maps in deep learning models for signals. Evaluations with architectures based on U-Net or transformer backbones trained on 10K subjects across 10 datasets, show that PSDNorm achieves state-of-the-art performance on unseen left-out datasets while being 4-times more data-efficient than BatchNorm.


Poster
P3-#1705
Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction

Xu Zhang ⋅ Ruijie Quan ⋅ Wenguan Wang ⋅ Yi Yang

Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single neural embedding, using it as static guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These designs make MindHier an efficient and cognitively aligned alternative to diffusion-based methods by enabling a hierarchical reconstruction process that synthesizes global semantics before refining local details, akin to human visual perception. Extensive experiments on the NSD dataset show that MindHier achieves superior semantic fidelity, 4.67$\times$ faster inference, and more deterministic results than the diffusion-based baselines.


Poster
P3-#1706
Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model

Sam Gijsen ⋅ Marc-Andre Schulz ⋅ Kerstin Ritter

The development of foundation models for functional magnetic resonance imaging (fMRI) time series holds significant promise for predicting phenotypes related to disease and cognition. Current models, however, are often trained using a mask-and-reconstruct objective on small brain regions. This focus on low-level information leads to representations that are sensitive to noise and temporal fluctuations, necessitating extensive fine-tuning for downstream tasks. We introduce Brain-Semantoks, a self-supervised framework designed specifically to learn abstract representations of brain dynamics. Its architecture is built on two core innovations: a semantic tokenizer that aggregates noisy regional signals into robust tokens representing functional networks, and a self-distillation objective that enforces representational stability across time. We show that this objective is stabilized through a novel training curriculum, ensuring the model robustly learns meaningful features from low signal-to-noise time series. We demonstrate that learned representations enable strong performance on a variety of downstream tasks even when only using a linear probe. Furthermore, we provide comprehensive scaling analyses indicating more unlabeled data reliably results in out-of-distribution performance gains without domain adaptation.


Poster
P3-#1707
Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI

Zheng Huang ⋅ Enpei Zhang ⋅ Weikang Qiu ⋅ Yinghao Cai ⋅ Carl Yang ⋅ Elynn Chen ⋅ Xiang Zhang ⋅ Rex Ying ⋅ Dawei Zhou ⋅ Yujun Yan

Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli—essentially images—from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pre-trained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively. We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision-based space or a joint text–image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object-centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute/relationship search module that automatically identifies key attributes and relationships that best align with the neural activity. Extensive experiments on real-world datasets demonstrate that our framework outperforms existing methods, achieving up to an 6% reduction in perceptual loss. These results highlight the importance of using structured text as an intermediate space to bridge fMRI signals and image reconstruction. Codes are available at https://github.com/GraphmindDartmouth/PRISM.


Poster
P3-#1708
Online Pseudo-Zeroth-Order Training of Neuromorphic Spiking Neural Networks

Mingqing Xiao ⋅ Qingyan Meng ⋅ Zongpeng Zhang ⋅ Di He ⋅ Dongsheng Li ⋅ Zhouchen Lin

Brain-inspired neuromorphic computing with spiking neural networks (SNNs) is a promising energy-efficient computational approach. However, successfully training deep SNNs in a more biologically plausible and neuromorphic-hardware-friendly way is still challenging. Most recent methods leverage spatial and temporal backpropagation (BP), not adhering to neuromorphic properties. Despite the efforts of some online training methods, tackling spatial credit assignments by alternatives with competitive performance as spatial BP remains a significant problem. In this work, we propose a novel method, online pseudo-zeroth-order (OPZO) training. Our method only requires a single forward propagation with noise injection and direct top-down signals for spatial credit assignment, avoiding spatial BP's problem of symmetric weights and separate phases for layer-by-layer forward-backward propagation. OPZO solves the large variance problem of zeroth-order methods by the pseudo-zeroth-order formulation and momentum feedback connections, while having more guarantees than random feedback. Combining online training, OPZO can pave paths to on-chip SNN training. Experiments on neuromorphic and static datasets with both fully connected and convolutional networks demonstrate the effectiveness of OPZO with competitive performance compared with spatial BP, as well as estimated low training costs.


Poster
P3-#1709
Interact-RAG: Reason and Interact with the Corpus, Beyond Black-Box Retrieval

Yulong Hui ⋅ Chao Chen ⋅ Zhihang Fu ⋅ Yihao Liu ⋅ Jieping Ye ⋅ Huanchen Zhang

Retrieval-Augmented Generation (RAG) has significantly enhanced LLMs by incorporating external information. However, prevailing agentic RAG approaches are constrained by a critical limitation: they treat the retrieval process as a black-box querying operation. This confines agents' actions to query issuing, hindering its ability to tackle complex information-seeking tasks. To address this, we introduce Interact-RAG, a new paradigm that elevates the LLM agent from a passive query issuer into an active manipulator of the retrieval process. We dismantle the black-box with a Corpus Interaction Engine, equipping the agent with a set of action primitives for fine-grained control over information retrieval. To further empower the agent on the entire RAG pipeline, we first develop a reasoning-enhanced workflow, which enables both zero-shot execution and the synthesis of interaction trajectories. We then leverage this synthetic data to train a fully autonomous end-to-end agent via Supervised Fine-Tuning (SFT), followed by refinement with Reinforcement Learning (RL). Extensive experiments across six benchmarks demonstrate that Interact-RAG significantly outperforms other advanced methods, validating the efficacy of our reasoning-interaction strategy.


Poster
P3-#1710
Web-CogReasoner: Towards Multimodal Knowledge-Induced Cognitive Reasoning for Web Agents

Yuhan Guo ⋅ cong guo ⋅ Aiwen Sun ⋅ Hongliang He ⋅ Xinyu Yang ⋅ Yue Lu ⋅ Yingji Zhang ⋅ Xuntao Guo ⋅ Dong Zhang ⋅ Jianzhuang Liu ⋅ Jiang Duan ⋅ Yijia Xiao ⋅ Liangjian Wen ⋅ Hai-Ming Xu ⋅ Yong Dai

Multimodal large-scale models have significantly advanced the development of web agents, enabling them to perceive and interact with the digital environment in a manner analogous to human cognition. In this paper, we argue that web agents must first acquire sufficient knowledge to engage in cognitive reasoning effectively. Therefore, we decompose a web agent's capabilities into two essential stages: knowledge content learning and cognitive processes. To formalize this, we propose Web-CogKnowledge Framework, which categorizes knowledge into Factual, Conceptual, and Procedural domains. In this framework, knowledge content learning corresponds to the agent's processes of Memorizing and Understanding, which rely on the former two types of knowledge, respectively, representing the "what" of learning. Conversely, cognitive processes correspond to Exploring, grounded in Procedural knowledge, defining the "how" of reasoning and action. To facilitate knowledge acquisition, we construct the Web-CogDataset, a structured resource curated from 14 real-world websites, designed to instill the core knowledge necessary for a web agent systematically. This dataset serves as the agent's conceptual grounding—the "nouns" upon which comprehension is built—as well as the basis for learning how to reason and act. Building on this foundation, we operationalize these processes through a novel knowledge-driven Chain-of-Thought (CoT) reasoning framework, developing and training our proposed multimodal web agent, the Web-CogReasoner. Extensive experimentation reveals its significant superiority over existing models, particularly in its capacity for generalization to unseen tasks where its structured knowledge proves decisive. To facilitate rigorous and systematic evaluation, we introduce the Web-CogBench, a comprehensive evaluation suite designed to assess and compare agent performance across the delineated knowledge domains and cognitive capabilities. Our code and data are open sourced at https://github.com/Gnonymous/Web-CogReasoner.


Poster
P3-#1711
NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation

Penghai Zhao ⋅ Jinyu Tian ⋅ Qinghua Xing ⋅ Xin Zhang ⋅ Zheng Li ⋅ Jianjun Qian ⋅ Ming-Ming Cheng ⋅ Xiang Li

The ability to estimate the quality of scientific papers is central to how both humans and AI systems will advance scientific knowledge in the future. However, existing LLM-based estimation methods suffer from high inference cost, whereas the faster direct score regression approach is limited by scale inconsistencies. We present NAIPv2, a debiased and efficient framework for paper quality estimation. NAIPv2 employs pairwise learning within domain-year groups to reduce inconsistencies in reviewer ratings and introduces the Review Tendency Signal (RTS) as a probabilistic integration of reviewer scores and confidences. To support training and evaluation, we further construct NAIDv2, a large-scale dataset of 24,276 ICLR submissions enriched with metadata and detailed structured content. Trained on pairwise comparisons but enabling efficient pointwise prediction at deployment, NAIPv2 achieves state-of-the-art performance (78.2\% AUC, 0.432 Spearman), while maintaining scalable, linear-time efficiency at inference. Notably, on unseen NeurIPS submissions, it further demonstrates strong generalization, with predicted scores increasing consistently across decision categories from Rejected to Oral. These findings establish NAIPv2 as a debiased and scalable framework for automated paper quality estimation, marking a step toward future scientific intelligence systems.


Poster
P3-#1712
HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models

Zhaolu Kang ⋅ Junhao Gong ⋅ Jiaxu Yan ⋅ Wanke Xia ⋅ Yian Wang ⋅ Zhuo Cheng ⋅ Wenhao Cao ⋅ Ziwen Wang ⋅ ZhiYuan Feng ⋅ Huaxuan Ding ⋅ Siqi He ⋅ Shannan Yan ⋅ Xiaomin He ⋅ Junzhe Chen ⋅ Chaoya Jiang ⋅ Wei Ye ⋅ Kaidong Yu ⋅ Xuelong Li

Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.


Poster
P3-#1713
ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations

Alex Gu ⋅ Bartosz Piotrowski ⋅ Fabian Gloeckle ⋅ Kaiyu Yang ⋅ Aram Markosyan

Neural theorem proving has advanced rapidly in the past year, reaching IMO gold-medalist capabilities and producing formal proofs that span thousands of lines. Although such proofs are mechanically verified by formal systems like Lean, their excessive length renders them difficult for humans to comprehend and limits their usefulness for mathematical insight. Proof simplification is therefore a critical bottleneck. Yet, training data for this task is scarce, and existing methods—mainly agentic scaffolding with off-the-shelf LLMs—struggle with the extremely long proofs generated by RL-trained provers. We introduce ProofOptimizer, the first language model trained to simplify Lean proofs without requiring additional human supervision. ProofOptimizer is trained via expert iteration and reinforcement learning, using Lean to verify simplifications and provide training signal. At inference time, it operates within an iterative proof-shortening workflow, progressively reducing proof length. Experiments show that ProofOptimizer substantially compresses proofs generated by state-of-the-art RL-trained provers on standard benchmarks, reducing proof length by 87% on miniF2F, 57% on PutnamBench, and 50% on Seed-Prover's IMO 2025 proofs. Beyond conciseness, the simplified proofs check faster in Lean and further improve downstream prover performance when reused as training data for supervised finetuning.


Poster
P3-#1714
BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving

Shu Liu ⋅ Wenlin Chen ⋅ Weihao Li ⋅ Zheng Wang ⋅ Lijin Yang ⋅ Jianing Huang ⋅ YipinZhang ⋅ Zhongzhan Huang ⋅ Ze Cheng ⋅ Hao Yang

Diffusion-based planners have shown strong potential for autonomous driving by capturing multi-modal driving behaviors. A key challenge is how to effectively guide these models for safe and reactive planning in closed-loop settings, where the ego vehicle's actions influence future states. Recent work leverages typical expert driving behaviors (i.e., anchors) to guide diffusion planners but relies on a truncated diffusion schedule that introduces an asymmetry between the forward and denoising processes, diverging from the core principles of diffusion models. To address this, we introduce BridgeDrive, a novel anchor-guided diffusion bridge policy for closed-loop trajectory planning. Our approach formulates planning as a diffusion bridge that directly transforms coarse anchor trajectories into refined, context-aware plans, ensuring theoretical consistency between the forward and reverse processes. BridgeDrive is compatible with efficient ODE solvers, enabling real-time deployment. We achieve state-of-the-art performance on the Bench2Drive closed-loop evaluation benchmark, improving the success rate by 7.72% and 2.45% over prior arts with PDM-Lite and LEAD datasets, respectively. Project page: https://github.com/shuliu-ethz/BridgeDrive.


Poster
P3-#1715
Dual-Scale World Memory for LLM Agents towards Hard-Exploration Problems

Minsoo Kim ⋅ seung-won hwang

LLM-based agents have seen promising advances, yet are still limited in hard-exploration tasks which require agents to perform sustained exploration under sparse feedback. We present GLoW, a novel approach leveraging a dual-scale textual world memory, maintaining a trajectory frontier of high-value discoveries at the global scale, while learning from local trial-and-error in exploration through a Multi-path Advantage Reflection mechanism which infers advantage-based progress signals to guide exploration. To evaluate our framework for hard-exploration, we tackle the Jericho benchmark suite of text-based games, where GLoW achieves a new state-of-the-art performance for LLM-based approaches. Compared to state-of-the-art RL-based methods, our approach achieves comparable performance while requiring 100-800× fewer environment interactions. When scaled to stronger LLMs, GLoW surpasses all prior methods on 4 out of 6 difficult and extreme Jericho games.


Poster
P3-#1716
When MLLMs Meet Compression Distortion: A Coding Paradigm Tailored to MLLMs

Jinming Liu ⋅ Zhaoyang Jia ⋅ Jiahao Li ⋅ Bin Li ⋅ Xin Jin ⋅ Wenjun Zeng ⋅ Yan Lu

The increasing deployment of powerful Multimodal Large Language Models (MLLMs), typically hosted on cloud platforms, urgently requires effective compression techniques to efficiently transmit signal inputs (e.g., images, videos) from edge devices with minimal bandwidth usage. However, conventional image codecs are optimized for fidelity to serve the Human Visual System (HVS) and ill-suited for MLLMs, in which diverse downstream tasks are jointly considered. In this paper, we first systematically analyze the impact of compression artifacts on several mainstream MLLMs. We find that: Compression distortion unevenly impacts different-level image features, leading to varying effects on MLLMs' downstream tasks depending on their feature-level reliance. Motivated by this discovery, we propose an image Codec TAilored to MLLMs (CoTAM) designed to adaptively protect multi-level features and suit different demands of downstream tasks. The encoder leverages CLIP's shallow-layer attention to generate an importance map for bit allocation, preserving critical semantic regions. Concurrently, the decoder integrates a lightweight adapter with a multi-level loss function to ensure the faithful reconstruction both of low-level details and high-level semantic context for robust synthesis of cross-level features. Extensive experiments validate that our method achieves up to 35.99\% bitrate saving while maintaining the same performance on the MLLM tasks, outperforming previous SOTA neural codecs. The code is released at https://github.com/jmliu206/CoTAM.


Poster
P3-#1717
More Than What Was Chosen: LLM-based Explainable Recommendation Beyond Noisy User Preferences

Chung Park ⋅ Hyeongjun Yun ⋅ Taesan Kim ⋅ Junui Hong ⋅ Dongjoon Hong ⋅ Mira Myong ⋅ Jihoon Oh ⋅ MinCheol Cho ⋅ Kijung Park ⋅ Min Choi ⋅ Jihwan Seok ⋅ Jaegul Choo

Recommender systems traditionally rely on the principle of Revealed Preference (RP), which assumes that observed user behaviors faithfully reflect underlying interests. While effective at scale, this assumption is fragile in practice, as real-world choices are often noisy and inconsistent. Thus, even LLM-based recommendation models (LLM-Rec) equipped with advanced reasoning capabilities may fail to capture genuine user preferences and often produce rationales of limited persuasiveness. To address this issue, we introduce the concept of Coherent Preference (CP), which complements RP by favoring items that are logically and causally coherent with user interaction history. Building on this perspective, we propose Conflict-Aware Direct Preference Optimization (C-APO), an LLM-Rec framework that jointly optimizes RP and CP while adaptively reconciling their agreement and conflict, delivering robust recommendation performance and logically consistent rationales. We construct a unified ordering approach that combines the RP signal, based on chosen versus unobserved items, with the CP signal, which ranks items by their logical consistency with past interaction history. In this unified preference ordering, we dynamically adjust the influence of each signal depending on whether RP and CP agree or conflict, allowing the model to better capture user intent and generate more plausible recommendations. On the Amazon Review dataset, our approach consistently outperforms approximately 20 state-of-the-art baseline models in both recommendation performance and rationale quality, achieving a 1.65$\times$ relative improvement in click-through rate during deployment, thereby demonstrating its practical utility. The code and dataset are available at https://github.com/cpark88/C-APO.


Poster
P3-#1718
Evaluating Text Creativity across Diverse Domains: a Dataset and Large Language Model Evaluator

Qian Cao ⋅ Xiting Wang ⋅ Yuzhuo Yuan ⋅ Yahui Liu ⋅ Fang Luo ⋅ Ruihua Song

Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly to support further research.


Poster
P3-#1719
Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

Justin Lin ⋅ Eliot Jones ⋅ Donovan Jasper ⋅ Ethan Ho ⋅ Anna Wu ⋅ Arnold Yang ⋅ Neil Perry ⋅ Andy Zou ⋅ Matt Fredrikson ⋅ Zico Kolter ⋅ Percy Liang ⋅ Dan Boneh ⋅ Daniel Ho

We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of $\sim$8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82\% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. AI agents offer advantages in systematic enumeration, parallel exploitation, and cost---certain ARTEMIS variants cost $18/hour versus $60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.


Poster
P3-#1720
Sci2Pol: Evaluating and Fine-tuning LLMs on Scientific-to-Policy Brief Generation

Weimin Wu ⋅ Alexander Furnas ⋅ Eddie Yang ⋅ Gefei Liu ⋅ Akhil Pandey Akella ⋅ Xuefeng Song ⋅ Dashun Wang ⋅ Han Liu

We propose Sci2Pol-Bench and Sci2Pol-Corpus, the first benchmark and training dataset for evaluating and fine-tuning large language models (LLMs) on policy brief generation from a scientific paper. We build Sci2Pol-Bench on a five-stage taxonomy to mirror the human writing process: (i) Autocompletion, (ii) Understanding, (iii) Summarization, (iv) Generation, and (v) Verification. It features 18 tasks in multiple-choice and open-ended formats. Specifically, for the Generation stage, we show that BERTScore and ROUGE scores fail to capture the quality of brief writing, and introduce a new LLM-based evaluation metric aligned with expert judgement. Using this benchmark, we evaluate 13 leading open-source and commercial LLMs to uncover key limitations. To improve LLM performance on brief writing, we curate the Sci2Pol-Corpus for fine-tuning. We start by linking each cited scientific paper to its corresponding policy document, drawn from 5.6 million policy records. This produces 140,000 candidate pairs. We then employ an LLM-as-a-judge to filter high-quality examples, followed by in-context polishing using three expert-written samples as references. This process yields a final set of 639 new pairs. Finally, we fine-tune three models on Sci2Pol-Corpus: LLaMA-3.1-8B, Gemma-12B, and Gemma-27B. Fine-tuning leads to consistent performance improvements across Sci2Pol-Bench. Notably, after fine-tuning, Gemma-27B surpasses the much larger GPT-4o and DeepSeek-V3 (671B). These demonstrate the effectiveness of our corpus in bridging the gap between science and policy.


Poster
P3-#1721
Music Flamingo: Scaling Music Understanding in Audio Language Models

Sreyan Ghosh ⋅ Arushi Goel ⋅ Lasha Koroshinadze ⋅ Sang-gil Lee ⋅ Zhifeng Kong ⋅ Joao Santos ⋅ Ramani Duraiswami ⋅ Dinesh Manocha ⋅ Wei Ping ⋅ Mohammad Shoeybi ⋅ Bryan Catanzaro

We introduce Music Flamingo, a novel large audio–language model, designed to advance music (including song) understanding in foundational audio models. While audio–language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question–answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio–language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition towards layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do. Demo: https://musicflamingo.github.io


Poster
P3-#1723
SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation

Hongrui Wang ⋅ Fan Zhang ⋅ Zhiyuan Yu ⋅ Ziya Zhou ⋅ Xi Chen ⋅ Can Yang ⋅ Yang Wang

Multi-track music generation has garnered significant research interest due to its precise mixing and remixing capabilities. However, existing models often overlook essential attributes such as rhythmic stability and synchronization, leading to a focus on differences between tracks rather than their inherent properties. In this paper, we introduce SyncTrack, a synchronous multi-track waveform music generation model designed to capture the unique characteristics of multi-track music. SyncTrack features a novel architecture that includes track-shared modules to establish a common rhythm across all tracks and track-specific modules to accommodate diverse timbres and pitch ranges. Each track-shared module employs two cross-track attention mechanisms to synchronize rhythmic information, while each track-specific module utilizes learnable instrument priors to better represent timbre and other unique features. Additionally, we enhance the evaluation of multi-track music quality by introducing rhythmic consistency through three novel metrics: Inner-track Rhythmic Stability (IRS), Cross-track Beat Synchronization (CBS), and Cross-track Beat Dispersion (CBD). Experiments demonstrate that SyncTrack significantly improves the multi-track music quality by enhancing rhythmic consistency.


Poster
P3-#1724
HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

Ke Li ⋅ Zheng Yang ⋅ Zhongbin Zhou ⋅ Xuefeng ⋅ Zhonglin Jiang ⋅ Wenxiao Wang

Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to the Optimal Brain Surgeon theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $\mathcal{O}(d^4)$, where $d$ is the model’s dimensionality, to $\mathcal{O}(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of pruning ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at pruning ratios of $20\% \sim 25\%$ in most models, while also reducing FLOPs by nearly $20\%$. The code can be found at [https://github.com/LLIKKE/HEAPr](https://github.com/LLIKKE/HEAPr).


Poster
P3-#1824
Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Yusong Wu ⋅ Stephen Brade ⋅ Teng Ma ⋅ Tia-Jane Fowler ⋅ Enning Yang ⋅ Berker Banar ⋅ Aaron Courville ⋅ Natasha Jaques ⋅ Anna Huang

Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player’s future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking'', affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.


Poster
P3-#1823
USTBench: Benchmarking and Dissecting Spatiotemporal Reasoning Capabilities of LLMs as Urban Agents

Siqi Lai ⋅ Yansong Ning ⋅ Zirui Yuan ⋅ Zhixi Chen ⋅ Hao Liu

Large language models (LLMs) have shown emerging potential in spatiotemporal reasoning, making them promising candidates for building urban agents that support diverse urban downstream applications. Despite these benefits, existing studies primarily focus on evaluating urban LLM agent on outcome-level metrics (e.g., prediction accuracy, traffic efficiency), offering limited insight into their underlying reasoning processes. As a result, the strengths and limitations of urban LLM agents in spatiotemporal reasoning remain poorly understood. To this end, we introduce USTBench, the first benchmark to evaluate LLMs’ spatiotemporal reasoning abilities as urban agents across four decomposed dimensions: spatiotemporal understanding, forecasting, planning, and reflection. Specifically, USTBench supports five diverse urban decision-making and four spatiotemporal prediction tasks, all running within our constructed interactive city environment UAgentEnv. The benchmark includes 62,466 structured QA pairs for process-based evaluation and standardized end-to-end task assessments, enabling fine-grained diagnostics and broad task-level comparison across diverse urban scenarios. Through extensive evaluation of fourteen leading LLMs, we reveal that although LLMs show promising potential across various urban downstream tasks, they still struggle in long-horizon planning and reflective adaptation in dynamic urban contexts. Notably, recent advanced reasoning models (e.g., DeepSeek-R1) trained on general logic or mathematical problems do not consistently outperform non-reasoning LLMs. This discrepancy highlights the need for domain-specialized adaptation methods to enhance urban spatiotemporal reasoning. Overall, USTBench provides a foundation to build more adaptive and effective LLM-based urban agents and broad smart city applications. Our project is available at https://github.com/usail-hkust/USTBench.


Poster
P3-#1822
Aria: an Agent for Retrieval and Iterative Auto-Formalization via Dependency Graph

Wang Hanyu ⋅ Ruohan Xie ⋅ Wang Yutong ⋅ Guoxiong Gao ⋅ XintaoYu ⋅ Bin Dong

Accurate auto-formalization of theorem statements is essential for advancing automated discovery and verification of research-level mathematics, yet remains a major bottleneck for LLMs due to hallucinations, semantic mismatches, and their inability to synthesize new definitions. To tackle these issues, we present Aria (Agent for Retrieval and Iterative Autoformalization), a system for conjecture-level formalization in Lean that emulates human expert reasoning via a two-phase Graph-of-Thought process: recursively decomposing statements into a dependency graph and then constructing formalizations from grounded concepts. To ensure semantic correctness, we introduce AriaScorer, a checker that retrieves definitions from Mathlib for term-level grounding, enabling rigorous and reliable verification. We evaluate Aria on diverse benchmarks. On ProofNet, it achieves 91.6\% compilation success rate and 68.5\% final accuracy, surpassing previous methods. On FATE-X, a suite of challenging algebra problems from research literature, it outperforms the best baseline with 44.0\% vs. 24.0\% final accuracy. On a dataset of homological conjectures, Aria reaches 42.9\% final accuracy while all other models score 0\%.


Poster
P3-#1821
Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning

Xingrui Zhuo ⋅ Jiapu Wang ⋅ Gongqing Wu ⋅ Zhongyuan Wang ⋅ Jichen Zhang ⋅ Shirui Pan ⋅ Xindong Wu

Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have demonstrated strong capabilities for open-domain knowledge reasoning. As a result, the latest research has focused on LLM-based KGFMs that integrate LLM knowledge with KG context for inductive KGR. However, the intrinsic knowledge of LLMs may be overshadowed by sparse KG context, leading to LLM knowledge distortion, which can cause irreversible damage to model reasoning. Moreover, existing LLM-based KGR methods still struggle to fully constrain generative hallucinations in LLMs, severely limiting the credibility of reasoning results. To address these limitations, we propose a Knowledge Reasoning Language Model (KRLM) that achieves unified coordination between LLM knowledge and KG context throughout the KGR process. Specifically, we design a Knowledge Reasoning Language (KRL) instruction format and a KRL tokenizer to align LLM knowledge with KG representations. Then, we propose a KRL attention layer that coordinates intrinsic LLM knowledge with additional KG context through a dynamic knowledge memory mechanism. Finally, a structure-aware next-entity predictor is proposed, which strictly constrains the reasoning results within a trustworthy knowledge domain. Extensive experimental results on 25 real-world inductive KGR datasets demonstrate the significant superiority of the proposed KRLM in both zero-shot reasoning and fine-tuning scenarios.


Poster
P3-#1820
LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis

Md Ahsanul Haque ⋅ Ismail Hossain ⋅ Mahmuduzzaman Kamol ⋅ Jahangir Alam ⋅ Suresh Kumar Amalapuram ⋅ Sajedul Talukder ⋅ Mohammad Saidur Rahman

Machine learning (ML)-based malware detection systems often fail to account for the dynamic nature of real-world training and test data distributions. In practice, these distributions evolve due to frequent changes in the Android ecosystem, adversarial development of new malware families, and the continuous emergence of both benign and malicious applications. Prior studies have shown that such concept drift—distributional shifts in benign and malicious samples, leads to significant degradation in detection performance over time. Despite the practical importance of this issue, existing datasets are often outdated and limited in temporal scope, diversity of malware families, and sample scale, making them insufficient for the systematic evaluation of concept drift in malware detection. To address this gap, we present LAMDA, the largest and most temporally diverse Android malware benchmark to date, designed specifically for concept drift analysis. LAMDA spans 12 years (2013–2025, excluding 2015), includes over 1 million samples (approximately 37\% labeled as malware), and covers 1,380 malware families and 150,000 singleton samples, reflecting the natural distribution and evolution of real-world Android applications. We empirically demonstrate LAMDA's utility by quantifying the performance degradation of standard ML models over time and analyzing feature stability across years. As the most comprehensive Android malware dataset to date, LAMDA enables in-depth research into temporal drift, generalization, explainability, and evolving detection challenges.


Poster
P3-#1819
floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL

Bhavya Agrawalla ⋅ Michal Nauman ⋅ Khush Agrawal ⋅ Aviral Kumar

A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically, they represent value functions in a monolithic fashion, without iterative compute. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it with techniques from flow-matching, typically used in generative modeling. This velocity field underneath the flow is trained using a TD-learning objective, which bootstraps from values produced by a target velocity field, computed by running multiple steps of numerical integration. Crucially, floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of challenging offline RL benchmarks and online fine-tuning tasks, floq improves performance by nearly 1.8x. floq scales capacity far better than standard TD-learning architectures, highlighting the potential of iterative computation for value learning.

Accurate prediction of the need for invasive mechanical ventilation (IMV) in intensive care units (ICUs) patients is crucial for timely interventions and resource allocation. However, variability in patient populations, clinical practices, and electronic health record (EHR) systems across institutions introduces domain shifts that degrade the generalization performance of predictive models during deployment. Test-Time Training (TTT) has emerged as a promising approach to mitigate such shifts by adapting models dynamically during inference without requiring labeled target-domain data. In this work, we introduce Adaptive Test-Time Training (AdaTTT), an enhanced TTT framework tailored for EHR-based IMV prediction in ICU settings. We begin by deriving information-theoretic bounds on the test-time prediction error and demonstrate that it is constrained by the uncertainty between the main and auxiliary tasks. To enhance their alignment, we introduce a self-supervised learning framework with pretext tasks: reconstruction and masked feature modeling optimized through a dynamic masking strategy that emphasizes features critical to the main task. Additionally, to improve robustness against domain shifts, we incorporate prototype learning and employ Partial Optimal Transport (POT) for flexible, partial feature alignment while maintaining clinically meaningful patient representations. Experiments across multi-center ICU cohorts demonstrate competitive classification performance on different test-time adaptation benchmarks.


Poster
P3-#1817
AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

Guibin Zhang ⋅ Junhao Wang ⋅ Junjie Chen ⋅ Wangchunshu Zhou ⋅ Kun Wang ⋅ Shuicheng YAN

Large Language Model (LLM)-based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of \textbf{agentic system failure attribution}. Current state-of-the-art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below $10\\%$. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a lightweight failure tracer trained with multi-granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi-agent interactions. On {Who\&When} benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up $18.18\\%$, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with $4.8\sim14.2\\%$ performance gains, empowering self-correcting and self-evolving agentic AI.


Poster
P3-#1816
From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance

Zhe Li ⋅ Yangyang Wei ⋅ Boan Zhu ⋅ Yibo Peng ⋅ Tao Huang ⋅ Pengwei Wang ⋅ Zhongyuan Wang ⋅ Cheng Chi ⋅ Chang Xu ⋅ Shanghang Zhang

Natural language offers a natural interface for humanoid robots, but existing text-to-motion pipelines remain cumbersome and unreliable. They typically decode human motion, retarget it to robot morphology, and then track it with a physics-based controller. However, this multi-stage process is prone to cumulative errors, introduces high latency, and yields weak coupling between semantics and control. These limitations call for a more direct pathway from language to action, one that eliminates fragile intermediate stages. Therefore, we present RoboGhost, a retargeting-free framework that directly conditions humanoid policies on language-grounded motion latents. By bypassing explicit motion decoding and retargeting, RoboGhost enables a diffusion-based policy to denoise executable actions directly from noise, preserving semantic intent and supporting fast, reactive control. A hybrid causal transformer–diffusion design further ensures long-horizon consistency while maintaining stability and diversity, yielding rich latent representations for precise humanoid behavior. Extensive experiments demonstrate that RoboGhost substantially reduces deployment latency, improves success rates and tracking accuracy, and produces smooth, semantically aligned locomotion on real humanoids. Beyond text, the framework naturally extends to other modalities such as images, audio, and music, providing a general foundation for vision–language–action humanoid systems.


Poster
P3-#1815
An Ensemble Framework for Unbiased Language Model Watermarking

Yihan Wu ⋅ Ruibo Chen ⋅ Georgios Milis ⋅ Heng Huang

As large language models become increasingly capable and widely deployed, verifying the provenance of machine-generated content is critical to ensuring trust, safety, and accountability. Watermarking techniques have emerged as a promising solution by embedding imperceptible statistical signals into the generation process. Among them, unbiased watermarking is particularly attractive due to its theoretical guarantee of preserving the language model's output distribution, thereby avoiding degradation in fluency or detectability through distributional shifts. However, existing unbiased watermarking schemes often suffer from weak detection power and limited robustness, especially under short text lengths or distributional perturbations. In this work, we propose ENS, a novel ensemble framework that enhances the detectability and robustness of logits-based unbiased watermarks while strictly preserving their unbiasedness. ENS sequentially composes multiple independent watermark instances, each governed by a distinct key, to amplify the watermark signal. We theoretically prove that the ensemble construction remains unbiased in expectation and demonstrate how it improves the signal-to-noise ratio for statistical detectors. Empirical evaluations on multiple LLM families show that ENS substantially reduces the number of tokens needed for reliable detection and increases resistance to smoothing and paraphrasing attacks without compromising generation quality.


Poster
P3-#1814
Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification

Weihao Zeng ⋅ Keqing He ⋅ Chuqiao Kuang ⋅ Xiaoguang Li ⋅ Junxian He

Test-time compute can be scaled both sequentially and in parallel. Sequential scaling involves lengthening the generation process, while parallel scaling involves verifying and selecting among multiple candidate outputs. Combining these two strategies has led to the most powerful AI systems, such as Grok 4 Heavy, GPT-5 Pro, and Gemini-2.5 Pro Deep Think. A key observation is that, in certain contexts (e.g., solving Sudoku puzzles), verifying responses can be substantially easier than generating them. This property, referred to as \emph{asymmetric verification}, highlights the strong potential of test-time scaling. In this work, we study both sequential and parallel test-time scaling of deep search agents, motivated by the intuition that verification in this setting is often much easier than generation. In experiments, we first show that sequential scaling methods, such as budget forcing, can be effective initially but eventually degrade performance when over-applied in agentic search. Due to asymmetric verification, however, we are able to achieve substantial improvements by allocating only a modest amount of compute to the verifier. We conduct experiments with flagship open-source models, including GLM-4.5, K2, Qwen3-2507 and Tongyi-DeepResearch, and extend them to their ``Heavy'' variants through test-time scaling. These deep research agents achieve improvements of up to 20 absolute points on benchmarks such as BrowseComp. Remarkably, as an open-source alternative, GLM-4.5 Heavy reaches accuracy of {\bf 54.0\%} on BrowseComp, {\bf 66.0\%} on GAIA, and {\bf 68.0\%} on xbench-DeepSearch, placing it on par with the best proprietary choices such as OpenAI Deep Research and o3. Tongyi-DeepResearch Heavy pushes performance even further, attaining {\bf 69.0\%} accuracy on BrowseComp.


Poster
P3-#1813
YuE: Scaling Open Foundation Models for Long-Form Music Generation

Ruibin Yuan ⋅ Hanfeng Lin ⋅ Shuyue Guo ⋅ Ge Zhang ⋅ Jiahao Pan ⋅ Yongyi Zang ⋅ Haohe Liu ⋅ Yiming Liang ⋅ Wenye Ma ⋅ Xingjian Du ⋅ Xeron Du ⋅ Zhen Ye ⋅ Tianyu Zheng ⋅ Zhengxuan Jiang ⋅ Yinghao MA ⋅ Minghao Liu ⋅ Zeyue Tian ⋅ Ziya Zhou ⋅ Liumeng Xue ⋅ Xingwei Qu ⋅ Yizhi Li ⋅ Shangda Wu ⋅ Tianhao Shen ⋅ Ziyang Ma ⋅ Jun Zhan ⋅ Chunhui Wang ⋅ Yatian Wang ⋅ Xiaowei Chi ⋅ Xinyue Zhang ⋅ Zhenzhu Yang ⋅ XiangzhouWang ⋅ Shansong Liu ⋅ Lingrui Mei ⋅ Peng Li ⋅ JUNJIE WANG ⋅ Jianwei Yu ⋅ Guojian Pang ⋅ Xu Li ⋅ Zihao Wang ⋅ Xiaohuan Zhou ⋅ Lijun Yu ⋅ Emmanouil Benetos ⋅ Yong Chen ⋅ Chenghua Lin ⋅ Xie Chen ⋅ Gus Xia ⋅ Zhaoxiang Zhang ⋅ Chao Zhang ⋅ Wenhu Chen ⋅ Xinyu Zhou ⋅ Xipeng Qiu ⋅ Roger Dannenberg ⋅ JIAHENG LIU ⋅ Jian Yang ⋅ Wenhao Huang ⋅ Wei Xue ⋅ Xu Tan ⋅ Yike Guo

We tackle the task of long-form music generation, particularly the challenging \textbf{lyrics-to-song} problem, by introducing \textbf{YuE (乐)}, a family of open-source music generation foundation models. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through \textbf{track-decoupled next-token prediction} to overcome dense mixture signals, and \textbf{structural progressive conditioning} for long-context lyrical alignment. In addition, we redesign the \textbf{in-context learning} technique for music generation, enabling bidirectional content creation, style cloning, and improving musicality. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility (as of 2025-01). We strongly encourage readers to \textbf{listen to our demo}\footnote{\url{https://map-yue.github.io/}}.

In autonomous driving, multi-agent collaborative perception enhances sensing capabilities by enabling agents to share perceptual data. A key challenge lies in handling heterogeneous features from agents equipped with different sensing modalities or model architectures, which complicates data fusion. Existing approaches often require retraining encoders or designing interpreter modules for pairwise feature alignment, but these solutions are not scalable in practice. To address this, we propose GT-Space, a flexible and scalable collaborative perception framework for heterogeneous agents. GT-Space constructs a common feature space from ground-truth labels, providing a unified reference for feature alignment. With this shared space, agents only need a single adapter module to project their features, eliminating the need for pairwise interactions with other agents. Furthermore, we design a fusion network trained with contrastive losses across diverse modality combinations. Extensive experiments on simulation datasets (OPV2V and V2XSet) and a real-world dataset (RCooper) demonstrate that GT-Space consistently outperforms baselines in detection accuracy while delivering robust performance. Our code will be released at https://github.com/KingScar/GT-Space.


Poster
P3-#1811
Variational Reasoning for Language Models

Xiangxin Zhou ⋅ Zichen Liu ⋅ Haonan Wang ⋅ Chao Du ⋅ Min Lin ⋅ Chongxuan Li ⋅ Liang Wang ⋅ Tianyu Pang

We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models.


Poster
P3-#1810
Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers

Haosong Liu ⋅ Yuge Cheng ⋅ Wenxuan Miao ⋅ Zihan Liu ⋅ Aiyue Chen ⋅ Jing Lin ⋅ Yiwu Yao ⋅ Chen Chen ⋅ Jingwen Leng ⋅ Minyi Guo ⋅ Yu Feng

Video diffusion transformers (vDiTs) have made tremendous progress in text-to-video generation, but their high computational demands pose a major challenge for practical deployment. While existing studies propose acceleration methods to reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce Astraea, a framework that searches for near-optimal configurations for vDiT-based video generation with a performance target. At its core, Astraea proposes a lightweight token selection mechanism and a memory-efficient, GPU-parallel sparse attention strategy, enabling linear reductions in execution time with minimal impact on generation quality. Meanwhile, to determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, Astraea achieves up to 2.4x inference speedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs) while retaining better video quality compared to the state-of-the-art methods (<0.5% loss on the VBench score compared to the baseline vDiT models).

Modern optimization faces a fundamental challenge: local gradient-based methods provide no global information about the objective function $L$ landscape, often leading to suboptimal convergence and sensitivity to initialization. We introduce a novel optimization framework that leverages resurgence theory from complex analysis to extract global structural information from divergent asymptotic series. Our key insight is that the factorially divergent perturbative expansions of parameter space partition functions encode precise information about all critical objective function value in the landscape through their Borel transform singularities. The algorithm works by computing the statistical mechanical partition function $Z(g) = \int e^{-L(\theta)/g} d\theta$ for small coupling $g\ll 1$, extracting its asymptotic series coefficients, and identifying Borel plane singularities that correspond one-to-one with critical objective function values. These target values provide global guidance to local optimizers, enabling principled learning rate adaptation and escape from suboptimal regions. Unlike heuristic adaptive methods, targets are theoretically grounded in the geometry of the optimization landscape.


Poster
P3-#1808
Product of Experts for Visual Generation

Yunzhi Zhang ⋅ Carson Murtuza-Lanier ⋅ Zizhang Li ⋅ Yilun Du ⋅ Jiajun Wu

Modern neural models capture rich priors and have complementary knowledge over shared data domains, e.g., images and videos. Integrating diverse knowledge from multiple sources—including visual generative models, visual language models, and sources with human-crafted knowledge such as graphics engines and physics simulators remains under-explored. We propose a probabilistic framework that combines information from these heterogeneous models, where expert models jointly shape a product distribution over outputs. To sample from this product distribution for controllable image/video synthesis tasks, we introduce an annealed MCMC sampler in combination with SMC-style resampling to enable efficient inference-time model composition. Our framework empirically yields better controllability than monolithic methods and additionally provides flexible user interfaces for specifying visual generation goals.


Poster
P3-#1807
Gauge-invariant representation holonomy

Vasileios Sevetlidis ⋅ George Pavlidis

Deep networks learn internal representations whose geometry—how features bend, rotate, and evolve—affects both generalization and robustness. Existing similarity measures such as CKA or SVCCA capture pointwise overlap between activation sets, but miss how representations change along input paths. Two models may appear nearly identical under these metrics yet respond very differently to perturbations or adversarial stress. We introduce representation holonomy, a gauge-invariant statistic that measures this path dependence. Conceptually, holonomy quantifies the “twist” accumulated when features are parallel-transported around a small loop in input space: flat representations yield zero holonomy, while nonzero values reveal hidden curvature. Our estimator fixes gauge through global whitening, aligns neighborhoods using shared subspaces and rotation-only Procrustes, and embeds the result back to the full feature space. We prove invariance to orthogonal (and affine, post-whitening) transformations, establish a linear null for affine layers, and show that holonomy vanishes at small radii. Empirically, holonomy scales with loop radius and depth, separates models that appear similar under CKA, and correlates with adversarial and corruption robustness across training regimes. It also tracks training dynamics as features form and stabilize. Together, these results position representation holonomy as a practical and scalable diagnostic for probing the geometric structure of learned representations beyond pointwise similarity.


Poster
P3-#1806
Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning

Ang Li ⋅ Charles L. Wang ⋅ Deqing Fu ⋅ Kaiyu Yue ⋅ Zikui Cai ⋅ Wang Zhu ⋅ Ollie Liu ⋅ Peng Guo ⋅ Willie Neiswanger ⋅ Furong Huang ⋅ Tom Goldstein ⋅ Micah Goldblum

Humans often rely on visual aids, such as diagrams or sketches, when tackling complex problems. Teaching multimodal models to adopt similar strategies, a process known as Visual Chain of Thought (visual CoT), is much more difficult. The main challenges are: (1) weak performance of off-the-shelf visual CoT, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce Zebra-CoT a diverse large-scale interleaved text-image reasoning dataset with 182,384 reasoning traces across 18 domains with over 50 distinct tasks. This dataset is specifically designed to train models to natively perform visual CoT. We emphasize four categories of tasks where sketching or visual reasoning is especially natural, spanning (a) scientific questions such as geometry, physics, and algorithms; (b) 2D visual reasoning tasks like visual search and jigsaw puzzles; (c) 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; and (d) visual logic problems and strategic games like chess. Fine-tuning Anole‑7B model on Zebra-CoT yields a +12\% improvement in our test‑set accuracy and up to +13\% performance gains on standard VLM benchmarks. Similarly, fine-tuning Bagel‑7B produces models capable of generating high-quality interleaved visual reasoning chains, underscoring Zebra-CoT's effectiveness in advancing multimodal reasoning.


Poster
P3-#1805
Distillation of Large Language Models via Concrete Score Matching

Yeongmin Kim ⋅ Donghyeok Shin ⋅ Mina Kang ⋅ Byeonghu Na ⋅ Il-chul Moon

Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient inference. Existing KD objectives typically match student and teacher probabilities via softmax, which blurs valuable logit information. While direct logit distillation (DLD) mitigates softmax smoothing, it fails to account for logit shift invariance, thereby restricting the solution space. We propose Concrete Score Distillation (CSD), a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set. We resolve the training instability and quadratic complexity of discrete score-matching in autoregressive LLMs, and the resulting CSD objective aligns relative logit differences across all vocabulary pairs between student and teacher with flexible weighting. We provide both mode-seeking and mode-covering instances within our framework and evaluate CSD on task-agnostic instruction-following, task-specific, and general chat capability distillation using GPT-2-1.5B, OpenLLaMA-7B, and Gemma-7B-IT, Qwen2.5-7B-IT, and Gemma2-9B-IT teachers. Experiments show that CSD consistently surpasses recent KD objectives, achieves favorable fidelity–diversity trade-offs, and yields complementary gains when combined with on-policy techniques, demonstrating its scalability and effectiveness for LLM distillation. Code: https://github.com/aailab-kaist/CSD.


Poster
P3-#1804
PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models

Mingue Park ⋅ Jisung Hwang ⋅ Seungwoo Yoo ⋅ Kyeongmin Yeo ⋅ Minhyuk Sung

We introduce $\texttt{PairFlow}$, a lightweight preprocessing step for training Discrete Flow Models (DFMs) to achieve few-step sampling without requiring a pretrained teacher. DFMs have recently emerged as a new class of generative models for discrete data, offering strong performance. However, they suffer from slow sampling due to their iterative nature. Existing acceleration methods largely depend on finetuning, which introduces substantial additional training overhead. $\texttt{PairFlow}$ addresses this issue with a lightweight preprocessing step. Inspired by ReFlow and its extension to DFMs, we train DFMs from coupled samples of source and target distributions, without requiring any pretrained teacher. At the core of our approach is a closed-form inversion for DFMs, which allows efficient construction of paired source–target samples. Despite its extremely low cost, taking only up to 1.7\% of the compute needed for full model training, $\texttt{PairFlow}$ matches or even surpasses the performance of two-stage training involving finetuning. Furthermore, models trained with our framework provide stronger base models for subsequent distillation, yielding further acceleration after finetuning. Experiments on molecular data as well as binary and RGB images demonstrate the broad applicability and effectiveness of our approach.


Poster
P3-#1803
Speculative Speculative Decoding

Tanishq Kumar ⋅ Tri Dao ⋅ Avner May

Autoregressive decoding is bottlenecked by its sequential nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then verifying them in parallel with a single target model forward pass. However, speculative decoding itself relies on a sequential dependence between speculation and verification. We introduce speculative speculative decoding (SSD) to parallelize these operations. While a verification is ongoing, the draft model predicts likely verification outcomes and prepares speculations pre-emptively for them. If the actual verification outcome is then in the predicted set, a speculation can be returned immediately, eliminating drafting overhead entirely. We identify three key challenges presented by speculative speculative decoding, and suggest principled methods to solve each. The result is Saguaro, an optimized SSD algorithm. Our implementation is up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.


Poster
P3-#1802
OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

Mingxin Huang ⋅ Yongxin Shi ⋅ Dezhi Peng ⋅ Songxuan Lai ⋅ Zecheng Xie ⋅ Lianwen Jin

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of a dedicated and systematic benchmark. To address this gap, we propose OCR-Reasoning, a novel benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. Specifically, OCR-Reasoning comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing text-rich image understanding benchmarks that only provide a final answer, this benchmark additionally provides a detailed step-by-step reasoning process. This dual annotation enables the evaluation of both the models' final answers and their reasoning processes, thereby offering a holistic assessment of text-rich reasoning capabilities. By leveraging this benchmark, we conducted a comprehensive evaluation of the latest MLLMs. Our results demonstrate that even the most advanced MLLMs exhibit substantial difficulties in text-rich image reasoning tasks, with none achieving an accuracy above 50\% on our benchmark, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.


Poster
P3-#1801
GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs

Tao Feng ⋅ Haozhen Zhang ⋅ Zijie Lei ⋅ Peixuan Han ⋅ Jiaxuan You

LLM routing has achieved promising results in integrating the strengths of di- verse models while balancing efficiency and performance. However, to support more realistic and challenging applications, routing must extend into agentic LLM settings—where task planning, multi-round cooperation among heterogeneous agents, and memory utilization are indispensable. To address this gap, we pro- pose GraphPlanner, a heterogeneous graph memory-augmented agentic router for multi-agent LLMs that generates routing workflows for each query and sup- ports both inductive and transductive inference. GraphPlanner formulates workflow generation as a Markov Decision Process (MDP), where at each step it selects both the LLM backbone and the agent role (Planner, Executor, Sum- marizer). By leveraging a heterogeneous graph, denoted as GARNet, to capture interaction memories among queries, agents, and responses, GraphPlanner integrates historical memory and workflow memory into richer state represen- tations. The entire pipeline is optimized with reinforcement learning, jointly improving task-specific performance and computational efficiency. We evalu- ate GraphPlanner across 14 diverse LLM tasks and demonstrate that: (1) GraphPlanner outperforms strong single- and multi-round routers, improv- ing accuracy by up to 9.3% while reducing GPU cost from 186.26 GiB to 1.04 GiB; (2) GraphPlanner generalizes robustly to unseen tasks and LLMs, exhibiting strong zero-shot capabilities; and (3) GraphPlanner effectively leverages historical memories, supporting both inductive and transductive infer- ence for more adaptive routing. Our code for GraphPlanner is released at https://github.com/ulab-uiuc/GraphPlanner.


Poster
P3-#1901
Instance-wise Adaptive Scheduling via Derivative-Free Meta-Learning

Hefang Qing ⋅ Miao Zhang ⋅ Yaoxin Wu ⋅ Weinan Huang ⋅ Jianhao Yang ⋅ Wen Song ⋅ Gang Wang

Deep Reinforcement Learning has achieved remarkable progress in solving NP-hard scheduling problems. However, existing methods primarily focus on optimizing average performance over training instances, overlooking the core objective of solving each individual instance with high quality. While several instance-wise adaptation mechanisms have been proposed, they are test-time approaches only and cannot share knowledge across different adaptation tasks. Moreover, they largely rely on gradient-based optimization, which could be ineffective in dealing with combinatorial optimization problems. We address the above issues by proposing an instance-wise meta-learning framework. It trains a meta model to acquire a generalizable initialization that effectively guides per-instance adaptation during inference, and overcomes the limitations of gradient-based methods by leveraging a derivative-free optimization scheme that is fully GPU parallelizable. Experimental results on representative scheduling problems demonstrate that our method consistently outperforms existing learning-based scheduling methods and instance-wise adaptation mechanisms under various task sizes and distributions.


Poster
P3-#1902
Sharp asymptotic theory for Q-learning with \texttt{LD2Z} learning rate and its generalization

Soham Bonnerjee ⋅ Zhipeng Lou ⋅ Wei Biao Wu

Despite the sustained popularity of Q-learning as a practical tool for policy determination, a majority of relevant theoretical literature deals with either constant ($\eta_t\equiv \eta$) or polynomially decaying ($\eta_t = \eta t^{-\alpha}$) learning schedules. However, it is well known the these choices suffer from either persistent bias or prohibitively slow convergence. In contrast, the recently proposed linear decay to zero (\texttt{LD2Z}: $\eta_t=\eta(1-t/n)$) schedule has shown appreciable empirical performance, but its theoretical and statistical properties remain largely unexplored, especially in the Q-learning setting. We address this gap in the literature by first considering a general class of power-law decay to zero (\texttt{PD2Z}-$\nu$: $\eta_t=\eta(1-t/n)^{\nu}$). Proceeding step-by-step, we present a sharp non-asymptotic error bound for Q-learning with \texttt{PD2Z}-$\nu$ schedule, which then is used to derive a central limit theory for a new \textit{tail} Polyak-Ruppert averaging estimator. Finally, we also provide a novel time-uniform Gaussian approximation (also known as \textit{strong invariance principle}) for the partial sum process of Q-learning iterates, which facilitates bootstrap-based inference. All our theoretical results are complemented by extensive numerical experiments. Beyond being new theoretical and statistical contributions to the Q-learning literature, our results definitively establish that \texttt{LD2Z} and in general \texttt{PD2Z}-$\nu$ achieve a best-of-both-worlds property: they inherit the rapid decay from initialization (characteristic of constant step-sizes) while retaining the asymptotic convergence guarantees (characteristic of polynomially decaying schedules). This dual advantage explains the empirical success of \texttt{LD2Z} while providing practical guidelines for inference through our results.


Poster
P3-#1903
PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs

Tengxuan Liu ⋅ Shiyao Li ⋅ Jiayi Yang ⋅ Tianchen Zhao ⋅ Feng Zhou ⋅ Xiaohui Song ⋅ Guohao Dai ⋅ Shengen Yan ⋅ Huazhong Yang ⋅ Yu Wang

Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead. Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively studied in short-context scenarios. However, directly applying existing methods to long-CoT LLMs causes significant performance degradation due to the following two reasons: (1) Large cumulative error: Existing methods fail to adequately leverage available memory, and they directly quantize the KV Cache during each decoding step, leading to large cumulative quantization error. (2) Short-context calibration: Due to Rotary Positional Embedding (RoPE), the use of short-context data during calibration fails to account for the distribution of less frequent channels in the Key Cache, resulting in performance loss. We propose Progressive Mixed-Precision KV Cache Quantization (PM-KVQ) for long-CoT LLMs to address the above issues in two folds: (1) To reduce cumulative error, we design a progressive quantization strategy to gradually lower the bit-width of KV Cache in each block. Then, we propose block-wise memory allocation to assign a higher bit-width to more sensitive transformer blocks. (2) To increase the calibration length without additional overhead, we propose a new calibration strategy with positional interpolation that leverages short calibration data with positional interpolation to approximate the data distribution of long-context data. Extensive experiments on 7B–70B long-CoT LLMs show that PM-KVQ improves reasoning benchmark performance by up to 8% over SOTA baselines under the same memory budget and achieves 2.73–5.18$\times$ throughput over the original 16-bit LLMs. Our code is available at https://github.com/thu-nics/PM-KVQ.


Poster
P3-#1905
Trion: FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of LLMs

Ionut-Vlad Modoranu ⋅ Mher Safaryan ⋅ Erik Schultheis ⋅ Maksim Riabinin ⋅ Artem Chumachenko ⋅ Dan Alistarh

Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple \texttt{matmul} with the DCT matrix in $O(n^3)$ time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via \texttt{Makhoul}'s $N$-point algorithm based on Fast Fourier Transform (FFT) in $O(n^2 \log(n))$ time, yielding speed-ups for low-end GPUs. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to $25\%$ across different model sizes. Our code is available at \href{https://github.com/IST-DASLab/Trion}{\texttt{https://github.com/IST-DASLab/Trion}}.


Poster
P3-#1906
Distilling the Thought, Watermarking the Answer: A Principle Semantic Guided Watermark for Reasoning Large Language Models

Shuliang Liu ⋅ Xingyu Li ⋅ Hongyi Liu ⋅ Dong Fang ⋅ Bingchen Duan ⋅ Zheng Qi ⋅ Lingfeng Su ⋅ Xuming Hu

Reasoning Large Language Models (RLLMs) excelling in complex tasks present unique challenges for digital watermarking, as existing methods often disrupt logical coherence or incur high computational costs. Token-based watermarking techniques can corrupt the reasoning flow by applying pseudo-random biases, while semantic-aware approaches improve quality but introduce significant latency or require auxiliary models. This paper introduces ReasonMark, a novel watermarking framework specifically designed for reasoning-intensive LLMs. Our approach decouples generation into an undisturbed Thinking Phase and a watermarked Answering Phase. We propose a Criticality Score to identify semantically pivotal tokens from the reasoning trace, which are distilled into a Principal Semantic Vector (PSV). The PSV then guides a semantically-adaptive mechanism that modulates watermark strength based on token-PSV alignment, ensuring robustness without compromising logical integrity. Extensive experiments show ReasonMark surpasses state-of-the-art methods by reducing text Perplexity by 0.35, increasing translation BLEU score by 0.164, and raising mathematical accuracy by 0.67 points. These advancements are achieved alongside a 0.34% higher watermark detection AUC and stronger robustness to attacks, all with a negligible increase in latency. This work enables the traceable and trustworthy deployment of reasoning LLMs in real-world applications.


Poster
P3-#1907
Self-Improving Loops for Visual Robotic Planning

Calvin Luo ⋅ Zilai Zeng ⋅ Mingxi Jia ⋅ Yilun Du ⋅ Chen Sun

Video generative models trained on expert demonstrations have been utilized as performant text-conditioned visual planners for solving robotic tasks. However, generalization to unseen tasks remains a challenge. Whereas improved generalization may be facilitated by leveraging learned prior knowledge from additional pre-collected offline data sources, such as web-scale video datasets, in the era of experience we aim to design agents that can continuously improve in an online manner from self-collected behaviors. In this work we thus propose the Self-Improving Loops for Visual Robotic Planning (SILVR), where an in-domain video model iteratively updates itself on self-produced trajectories, and steadily improves its performance for a specified task of interest. We apply SILVR to a diverse suite of MetaWorld tasks, as well as two manipulation tasks on a real robot arm, and find that performance improvements continuously emerge over multiple iterations for novel tasks unseen during initial in-domain video model training. We demonstrate that SILVR is robust in the absence of human-provided ground-truth reward functions or expert-quality demonstrations, and is preferable to alternate approaches that utilize online experience in terms of performance and sample efficiency.


Poster
P3-#1908
Positional Encoding Field

Yunpeng Bai ⋅ Haoxiang Li ⋅ Qixing Huang

Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation, powering state-of-the-art image and video models. By representing images as patch tokens with positional encodings (PEs), DiTs combine Transformer scalability with spatial and temporal inductive biases. In this work, we revisit how DiTs organize visual content and discover that patch tokens exhibit a surprising degree of independence: even when PEs are perturbed, DiTs still produce globally coherent outputs, indicating that spatial coherence is primarily governed by PEs. Motivated by this finding, we introduce the Positional Encoding Field (PE-Field), which extends positional encodings from the 2D plane to a structured 3D field. PE-Field incorporates depth-aware encodings for volumetric reasoning and hierarchical encodings for fine-grained sub-patch control, enabling DiTs to model geometry directly in 3D space. Our PE-Field–augmented DiT achieves state-of-the-art performance on single-image novel view synthesis and generalizes to controllable spatial image editing.


Poster
P3-#1909
ZeroGR: A Generalizable and Scalable Framework for Zero-Shot Generative Retrieval

Weiwei Sun ⋅ Keyi Kong ⋅ xinyu ma ⋅ Shuaiqiang Wang ⋅ Dawei Yin ⋅ Maarten de Rijke ⋅ Zhaochun Ren ⋅ Yiming Yang

Generative retrieval (GR) reformulates information retrieval (IR) by framing it as the generation of document identifiers (docids), thereby enabling end-to-end optimization and seamless integration with generative language models (LMs). Despite notable progress under supervised training, GR still struggles to generalize to zero-shot IR scenarios, which are prevalent in real-world applications. To tackle this challenge, we propose ZeroGR, a zero-shot generative retrieval framework that uses natural language instructions to extend GR across a wide range of IR tasks. Specifically, ZeroGR is composed of three key components: (i) an LM-based docid generator that unifies heterogeneous documents (e.g., text, tables, code) into semantically meaningful docids; (ii) an instruction-tuned query generator that generates diverse types of queries from natural language task descriptions to enhance corpus indexing; and (iii) a reverse annealing decoding strategy to balance precision and recall during docid generation. Furthermore, we introduce OpenInstIR, the most diverse open-source instructed retrieval dataset. We investigate the impact of instruction fine-tuning scale and find that performance consistently improves as the number of IR tasks encountered during training increases. Extensive experiments on the BEIR and MAIR benchmarks demonstrate that \textsc{ZeroGR} achieves competitive performance across a wide range of retrieval tasks, establishing a new state-of-the-art among GR methods. Our code is available at https://github.com/sunnweiwei/ZeroGR.


Poster
P3-#1911
Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling

Kyungmin Lee ⋅ Sihyun Yu ⋅ Jinwoo Shin

Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce \emph{Decoupled MeanFlow}, a simple decoding strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion transformers on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1–4 steps. Notably, we find that training flow models and subsequently converting them is more efficient and effective than training flow maps from scratch. On ImageNet 256$\times$256 and 512$\times$512, our models attain 1-step FID of 2.16 and 2.12, respectively, surpassing prior art by a large margin. Furthermore, we achieve FID of 1.51 and 1.68 when increasing the steps to 4, which nearly matches the performance of flow models while delivering over 100$\times$ faster inference.


Poster
P3-#1912
Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering

Yavuz Faruk Bakman ⋅ Sungmin Kang ⋅ Zhiqi Huang ⋅ Duygu Nur Yaldiz ⋅ Catarina Belém ⋅ Chenyang Zhu ⋅ Anoop Kumar ⋅ Alfy Samuel ⋅ Daben Liu ⋅ Salman Avestimehr ⋅ Sai Karimireddy

Uncertainty Quantification (UQ) research has primarily focused on closed-book factual question answering (QA), while contextual QA remains unexplored, despite its importance in real-world applications. In this work, we focus on UQ for the contextual QA task and propose a theoretically grounded approach to quantify \emph{epistemic uncertainty}. We begin by introducing a task-agnostic, token-level uncertainty measure defined as the cross-entropy between the predictive distribution of the given model and the unknown true distribution. By decomposing this measure, we isolate the epistemic component and approximate the true distribution by a perfectly prompted, idealized model. We then derive an upper bound for epistemic uncertainty and show that it can be interpreted as semantic feature gaps in the given model’s hidden representations relative to the ideal model. We further apply this generic framework to the contextual QA task and hypothesize that three features approximate this gap: \emph{context-reliance} (using the provided context rather than parametric knowledge), \emph{context comprehension} (extracting relevant information from context), and \emph{honesty} (avoiding intentional lies). Using a top-down interpretability approach, we extract these features by using only a small number of labeled samples and ensemble them to form a robust uncertainty score. Experiments on multiple QA benchmarks in both in-distribution and out-of-distribution settings show that our method substantially outperforms state-of-the-art unsupervised (sampling-free and sampling-based) and supervised UQ methods, achieving up to a 13-point PRR improvement while incurring a negligible inference overhead. The code is available at https://github.com/Ybakman/Feature-Gaps.

Recent advances in Multimodal Large Language Models (MLLMs) have significantly advanced video understanding tasks, yet challenges remain in efficiently compressing visual tokens while preserving spatiotemporal interactions. Existing methods, such as LLaVA family, utilize simplistic pooling or interpolation techniques that overlook the intricate dynamics of visual tokens. To bridge this gap, we propose ST-GridPool, a novel training-free visual token enhancement method designed specifically for Video LLMs. Our approach integrates Pyramid Temporal Gridding (PTG), which captures multi-grained spatiotemporal interactions through hierarchical temporal gridding, and Norm-based Spatial Pooling (NSP), which preserves high-information visual regions by leveraging the correlation between token norms and semantic richness. Extensive experiments on various benchmarks demonstrate that ST-GridPool consistently enhances performance of Video LLMs without requiring costly retraining. Our method offers an efficient and plug-and-play solution for improving visual token representations. Our code is available in https://github.com/bingjunluo/ST-GridPool.


Poster
P3-#1914
VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery

Nonghai Zhang ⋅ Zeyu Zhang ⋅ Jiazi Wang ⋅ Yang Zhao ⋅ Hao Tang

Vision-Language Models (VLMs) have achieved significant progress in multimodal understanding tasks, demonstrating strong capabilities particularly in general tasks such as image captioning and visual reasoning. However, when dealing with specialized cultural heritage domains like 3D vase artifacts, existing models face severe data scarcity issues and insufficient domain knowledge limitations. Due to the lack of targeted training data, current VLMs struggle to effectively handle such culturally significant specialized tasks. To address these challenges, we propose the VaseVQA-3D dataset, which serves as the first 3D visual question answering dataset for ancient Greek pottery analysis, collecting 664 ancient Greek vase 3D models with corresponding question-answer data and establishing a complete data construction pipeline. We further develop the VaseVLM model, enhancing model performance in vase artifact analysis through domain-adaptive training. Experimental results validate the effectiveness of our approach, where our VaseVLM-7B-RL achieves 12.8\% improvement in R@1 accuracy and 6.6\% improvement in lexical similarity compared to the strongest baselines on the VaseVQA-3D dataset, significantly improving the recognition and understanding of 3D vase artifacts, providing new technical pathways for digital heritage preservation research.


Poster
P3-#1915
Reducing Class-Wise Performance Disparity via Margin Regularization

Beier Zhu ⋅ Kesen Zhao ⋅ Jiequan Cui ⋅ Qianru Sun ⋅ Yuan Zhou ⋅ Xun Yang ⋅ Hanwang Zhang

Deep neural networks often exhibit substantial disparities in class-wise accuracy, even when trained on class-balanced data—posing concerns for reliable deployment. While prior efforts have explored empirical remedies, a theoretical understanding of such performance disparities in classification remains limited. In this work, we present Margin Regularization for performance disparity Reduction ( $MR^2$ ), a theoretically principled regularization for classification by dynamically adjusting margins in both the logit and representation spaces. Our analysis establishes a novel margin-based, class-sensitive generalization bound that reveals how per-class feature variability contributes to error, motivating the use of larger margins for ''hard'' classes. Guided by this insight,$MR^2$ optimizes per-class logit margins proportional to feature spread and penalizes excessive representation margins to enhance intra-class compactness. Experiments on seven datasets—including ImageNet—and diverse pre-trained backbones (MAE, MoCov2, CLIP) demonstrate demonstrate that our $MR^2$ not only improves overall accuracy but also significantly boosts ''hard'' class performance without trading off ''easy'' classes, thus reducing the performance disparities. Codes are available in https://github.com/BeierZhu/MR2.


Poster
P3-#1916
DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains

Tian Liang ⋅ Wenxiang Jiao ⋅ Zhiwei He ⋅ Jiahao Xu ⋅ Haitao Mi ⋅ Dong Yu

Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like ''overthinking'' simple problems and ''underthinking'' complex ones. While existing methods that use supervised fine-tuning (SFT) or reinforcement learning (RL) with token-length rewards can improve efficiency, they often do so at the cost of accuracy. This paper introduces DeepCompress, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs. We challenge the prevailing approach of consistently favoring shorter reasoning paths, showing that longer responses can contain a broader range of correct solutions for difficult problems. DeepCompress employs an adaptive length reward mechanism that dynamically classifies problems as "Simple" or "Hard" in real-time based on the model's evolving capability. It encourages shorter, more efficient reasoning for "Simple" problems while promoting longer, more exploratory thought chains for "Hard" problems. This dual-reward strategy enables the model to autonomously adjust its Chain-of-Thought (CoT) length, compressing reasoning for well-mastered problems and extending it for those it finds challenging. Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.


Poster
P3-#1917
Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation

Collin Zhang ⋅ Fei Huang ⋅ Chenhan Yuan ⋅ Junyang Lin

Large language models (LLMs) often experience language confusion, which is the unintended mixing of languages during text generation. Current solutions to this problem either necessitate model retraining or cannot differentiate between harmful confusion and acceptable code-switching. This paper introduces the \textbf{Language Confusion Gate} (LCG), a lightweight, plug-in solution that filters tokens during decoding without altering the base LLM. The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed. Our method is based on the findings that language confusion is infrequent, correct-language tokens are usually among the top predictions, and output token embedding norms are larger for high-resource languages, which biases sampling. When evaluated across various models, including Qwen3, GPT-OSS, Gemma3, Llama3.1, LCG decreases language confusion significantly—often by an order of magnitude—without negatively impacting task performance.


Poster
P3-#1918
Empowering Multi-Robot Cooperation via Sequential World Models

Zijie Zhao ⋅ Honglei Guo ⋅ Shengqian Chen ⋅ Kaixuan Xu ⋅ Bo Jiang ⋅ Yuanheng Zhu ⋅ Dongbin Zhao

Model-based reinforcement learning (MBRL) has achieved remarkable success in robotics due to its high sample efficiency and planning capability. However, extending MBRL to physical multi-robot cooperation remains challenging due to the complexity of joint dynamics. To address this challenge, we propose the Sequential World Model (SeqWM), a novel framework that integrates the sequential paradigm into multi-robot MBRL. SeqWM employs independent, autoregressive agent-wise world models to represent joint dynamics, where each agent generates its future trajectory and plans its actions based on the predictions of its predecessors. This design lowers modeling complexity and enables the emergence of advanced cooperative behaviors through explicit intention sharing. Experiments on Bi-DexHands and Multi-Quadruped demonstrate that SeqWM outperforms existing state-of-the-art model-based and model-free baselines in both overall performance and sample efficiency, while exhibiting advanced cooperative behaviors such as predictive adaptation, temporal alignment, and role division. Furthermore, SeqWM has been successfully deployed on physical quadruped robots, validating its effectiveness in real-world multi-robot systems. Demos and code are available at: https://github.com/zhaozijie2022/seqwm


Poster
P3-#1919
YoNoSplat: You Only Need One Model for Feedforward 3D Gaussian Splatting

Botao Ye ⋅ Boqi Chen ⋅ Haofei Xu ⋅ Daniel Barath ⋅ Marc Pollefeys

Fast and flexible 3D scene reconstruction from unstructured image collections remains a significant challenge. We present YoNoSplat, a feedforward model that reconstructs high-quality 3D Gaussian Splatting representations from an arbitrary number of images. Our model is highly versatile, operating effectively with both posed and unposed, calibrated and uncalibrated inputs. YoNoSplat predicts local Gaussians and camera poses for each view, which are aggregated into a global representation using either predicted or provided poses. To overcome the inherent difficulty of jointly learning 3D Gaussians and camera parameters, we introduce a novel mixing training strategy. This approach mitigates the entanglement between the two tasks by initially using ground-truth poses to aggregate local Gaussians and gradually transitioning to a mix of predicted and ground-truth poses, which prevents both training instability and exposure bias. We further resolve the scale ambiguity problem by a novel pairwise camera-distance normalization scheme and by embedding camera intrinsics into the network. Moreover, YoNoSplat also predicts intrinsic parameters, making it feasible for uncalibrated inputs. YoNoSplat demonstrates exceptional efficiency, reconstructing a scene from 100 views (at 280×518 resolution) in just 2.69 seconds on an NVIDIA GH200 GPU. It achieves state-of-the-art performance on standard benchmarks in both pose-free and pose-dependent settings. Our project page is at https://botaoye.github.io/yonosplat/.


Poster
P3-#1920
Distributional Machine Unlearning via Selective Data Removal

Youssef Allouah ⋅ Rachid Guerraoui ⋅ Sanmi Koyejo

Machine learning systems increasingly face requirements to remove entire domains of information—such as toxic language or biases—rather than individual user data. This task presents a dilemma: full removal of the unwanted domain data is computationally expensive, while random partial removal is statistically inefficient. We find that a domain's statistical influence is often concentrated in a small subset of its data samples, suggesting a path between ineffective partial removal and unnecessary complete removal. We formalize this as distributional unlearning: a framework to select a small subset that balances forgetting an unwanted distribution while preserving a desired one. Using Kullback-Leibler divergence constraints, we derive the exact removal-preservation Pareto frontier for Gaussian distributions and prove that models trained on the edited data achieve corresponding log-loss bounds. We propose a distance-based selection algorithm and show it is quadratically more sample-efficient than random removal in the challenging low-divergence regime. Experiments across synthetic, text, and image datasets (Jigsaw, CIFAR-10, SMS spam) show our method requires 15–82% less deletion than full removal for strong unlearning effects, e.g., halving initial forget set accuracy. Ultimately, by showing a small forget set often suffices, our framework lays the foundations for more scalable and rigorous subpopulation unlearning.


Poster
P3-#1921
Teach2Eval: An Interaction-Driven LLMs Evaluation Method via Teaching Effectiveness

Yuhang Zhou ⋅ Xutian Chen ⋅ Yixin Cao ⋅ Yuchen Ni ⋅ Yu He ⋅ Siyu Tian ⋅ Xiang Liu ⋅ Yunwen Chen ⋅ Guangnan Ye ⋅ Xipeng Qiu ⋅ Hongfeng Chai

Recent progress in large language models (LLMs) has outpaced the development of effective evaluation methods. Evaluating LLMs with static, task-specific benchmarks is increasingly fragile due to contamination and saturation, and it fails to capture interactive reasoning. We introduce Teach2Eval, which reframes evaluation as teaching: a candidate model guides weaker students, and the students’ gains constitute the score. This interaction yields robustness to contamination and exposes orthogonal abilities with fine-grained metrics across Application, Judgment, Guidance, and Reflection. The framework scales automatically by exploiting natural error distributions from weak students, requiring neither bespoke rubrics nor human graders. Across 33 LLMs and 60 datasets, Teach2Eval achieves Spearman above 0.97 with human-preference leaderboards (e.g., Chatbot Arena/LiveBench), surpassing direct baselines, while offering actionable training signals (capability hierarchies, early overfitting) at low cost. We open-source our code and data at https://github.com/zhiqix/Teach2Eval.


Poster
P3-#1922
PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework

Sixiang Chen ⋅ Jianyu Lai ⋅ Jialin Gao ⋅ Tian Ye ⋅ Haoyu Chen ⋅ Hengyu Shi ⋅ Shitong Shao ⋅ Yunlong Lin ⋅ Song Fei ⋅ Zhaohu Xing ⋅ Yeying Jin ⋅ Junfeng Luo ⋅ Xiaoming Wei ⋅ Lei Zhu

Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised finetuning on HQ-Poster-100K; (iii) aesthetic-text reinforcement learning via best-of-n preference optimization; and (iv) joint vision–language feedback refinement. Each stage is supported by a fully automated data-construction pipeline tailored to its specific needs, enabling robust training without complex architectural modifications. Evaluated on multiple experiments, PosterCraft significantly outperforms open-source baselines in rendering accuracy, layout coherence, and overall visual appeal—approaching the quality of SOTA commercial systems.


Poster
P3-#1923
Beyond Speedup - Utilizing KV Cache for Sampling and Reasoning

Zeyu XING ⋅ Xing Li ⋅ Huiling Zhen ⋅ Mingxuan Yuan ⋅ Sinno Jialin Pan

KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: (i) Chain-of-Embedding, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and (ii) Fast/Slow Thinking Switching, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to $5.7\times$ with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference.


Poster
P3-#1924
Conditionally Whitened Generative Models for Probabilistic Time Series Forecasting

Yanfeng Yang ⋅ Siwei Chen ⋅ Pingping Hu ⋅ Zhaotong Shen ⋅ Yingjie Zhang ⋅ Zhuoran Sun ⋅ Shuai Li ⋅ Ziqi Chen ⋅ Kenji Fukumizu

Probabilistic forecasting of multivariate time series is challenging due to non-stationarity, inter-variable dependencies, and distribution shifts. While recent diffusion and flow matching models have shown promise, they often ignore informative priors such as conditional means and covariances. In this work, we propose Conditionally Whitened Generative Models (CW-Gen), a framework that incorporates prior information through conditional whitening. Theoretically, we establish sufficient conditions under which replacing the traditional terminal distribution of diffusion models, namely the standard multivariate normal, with a multivariate normal distribution parameterized by estimators of the conditional mean and covariance improves sample quality. Guided by this analysis, we design a novel Joint Mean-Covariance Estimator (JMCE) that simultaneously learns the conditional mean and sliding-window covariance. Building on JMCE, we introduce Conditionally Whitened Diffusion Models (CW-Diff) and extend them to Conditionally Whitened Flow Matching (CW-Flow). Experiments on five real-world datasets with six state-of-the-art generative models demonstrate that CW-Gen consistently enhances predictive performance, capturing non-stationary dynamics and inter-variable correlations more effectively than prior-free approaches. Empirical results further demonstrate that CW-Gen can effectively mitigate the effects of distribution shift.


Poster
P3-#2023
WorldEdit: Towards Open-World Image Editing with a Knowledge-Informed Benchmark

Wang Lin ⋅ Feng Wang ⋅ Majun Zhang ⋅ Wentao Hu ⋅ Tao Jin ⋅ Zhou Zhao ⋅ Fei Wu ⋅ Jingyuan Chen ⋅ Sucheng Ren ⋅ Alan Yuille

Recent advances in image editing models have demonstrated remarkable capabilities in executing explicit instructions, such as attribute manipulation, style transfer, and pose synthesis. However, these models often face challenges when dealing with implicit editing instructions, which describe the cause of a visual change without explicitly detailing the resulting outcome. These limitations arise because existing models rely on uniform editing strategies that are not equipped to handle the complex world knowledge and reasoning required for implicit instructions. To address this gap, we introduce WorldEdit, a dataset specifically designed to enable world-driven image editing. WorldEdit consists of high-quality editing samples, guided by paraphrased instructions that align with real-world causal logic. Furthermore, we provide WorldEdit-Test for evaluating the existing model's performance on causal editing scenarios. With WorldEdit, we use a two-stage training framework for fine-tuning models like Bagel, integrating with a causal verification reward. Our results show that the proposed dataset and methods significantly narrow the gap with GPT-4o and Nano-Banana, demonstrating competitive performance not only in instruction following but also in knowledge plausibility, where many open-source systems typically struggle.


Poster
P3-#2022
Multiple Token Divergence: Measuring and Steering In-Context Computation Density

Vincent Herrmann ⋅ Eric Alcaide ⋅ Michael Wand ⋅ Jürgen Schmidhuber

Measuring the in-context computational effort of language models is a key challenge, as metrics like next-token loss fail to capture reasoning complexity. Prior methods based on latent state compressibility can be invasive and unstable. We propose Multiple Token Divergence (MTD), a simple measure of computational effort defined as the KL divergence between a model's full output distribution and that of a shallow, auxiliary prediction head. MTD can be computed directly from pre-trained models with multiple prediction heads, requiring no additional training. Building on this, we introduce Divergence Steering, a novel decoding method to control the computational character of generated text. We empirically show that MTD is more effective than prior methods at distinguishing complex tasks from simple ones. On mathematical reasoning benchmarks, MTD correlates positively with problem difficulty. Lower MTD is associated with more accurate reasoning. MTD provides a practical, lightweight tool for analyzing and steering the computational dynamics of language models.


Poster
P3-#2021
Robustness in Text-Attributed Graph Learning: Insights, Trade-offs, and New Defenses

Runlin Lei ⋅ Lu Yi ⋅ Mingguo He ⋅ Pengyu Qiu ⋅ Zhewei Wei ⋅ Yongchao Liu ⋅ Chuntao Hong

While Graph Neural Networks (GNNs) and Large Language Models (LLMs) are powerful approaches for learning on Text-Attributed Graphs (TAGs), a comprehensive understanding of their robustness remains elusive. Current evaluations are fragmented, failing to systematically investigate the distinct effects of textual and structural perturbations across diverse models and attack scenarios. To address these limitations, we introduce a unified and comprehensive framework to evaluate robustness in TAG learning. Our framework evaluates classical GNNs, robust GNNs (RGNNs), and GraphLLMs across ten datasets from four domains, under diverse text-based, structure-based, and hybrid perturbations in both poisoning and evasion scenarios. Our extensive analysis reveals multiple findings, among which three are particularly noteworthy: 1) models have inherent robustness trade-offs between text and structure, 2) the performance of GNNs and RGNNs depends heavily on the text encoder and attack type, and 3) GraphLLMs are particularly vulnerable to training data corruption. To overcome the identified trade-offs, we introduce SFT-auto, a novel framework that delivers superior and balanced robustness against both textual and structural attacks within a single model. Our work establishes a foundation for future research on TAG security and offers practical solutions for robust TAG learning in adversarial environments. Our code is available at: https://github.com/Leirunlin/TGRB.


Poster
P3-#2020
ReLi3D: Relightable Multi-view 3D Reconstruction with Disentangled Illumination

Jan-Niklas Dihlmann ⋅ Mark Boss ⋅ Simon Donné ⋅ Andreas Engelhardt ⋅ Hendrik Lensch ⋅ Varun Jampani

Reconstructing 3D assets from images has long required separate pipelines for geometry reconstruction, material estimation, and illumination recovery, each with distinct limitations and computational overhead. We present MIDR-3D, the first unified end-to-end pipeline that simultaneously reconstructs complete 3D geometry, spatially-varying physically-based materials, and environment illumination from sparse multi-view images in under one second. Our key insight is that multi-view constraints can dramatically improve material and illumination disentanglement, a problem that remains fundamentally ill-posed for single image methods. Key to our approach is the fusion of the multi-view input via a transformer cross-conditioning architecture, followed by a novel unified two path prediction strategy. The first path predicts the object’s structure and appearance, while the second path predicts the environment illumination from image background or object reflections. This combined with a differentiable Monte Carlo multiple importance sampling renderer, creates an optimal illumination disentanglement training pipeline. Further with our mixed-domain training protocol, combining synthetic PBR datasets with real-world RGB captures, we establish generalizable results across geometry, material accuracy, and illumination quality. By unifying previously separate reconstruction tasks into a single feed-forward pass, we enable near-instantaneous generation of complete, relightable 3D assets.


Poster
P3-#2019
NAB: Neural Adaptive Binning for Sparse-View CT reconstruction

Wangduo Xie ⋅ Matthew Blaschko

Computed Tomography (CT) plays a vital role in inspecting the internal structures of industrial objects. Furthermore, achieving high-quality CT reconstruction from sparse views is essential for reducing production costs. While classic implicit neural networks have shown promising results for sparse reconstruction, they are unable to leverage shape priors of objects. Motivated by the observation that numerous industrial objects exhibit rectangular structures, we propose a novel \textbf{N}eural \textbf{A}daptive \textbf{B}inning (\textbf{NAB}) method that effectively integrates rectangular priors into the reconstruction process. Specifically, our approach first maps coordinate space into a binned vector space. This mapping relies on an innovative binning mechanism based on differences between shifted hyperbolic tangent functions, with our extension enabling rotations around the input-plane normal vector. The resulting representations are then processed by a neural network to predict CT attenuation coefficients. This design enables end-to-end optimization of the encoding parameters---including position, size, steepness, and rotation---via gradient flow from the projection data, thus enhancing reconstruction accuracy. By adjusting the smoothness of the binning function, NAB can generalize to objects with more complex geometries. This research provides a new perspective on integrating shape priors into neural network-based reconstruction. Extensive experiments demonstrate that NAB achieves superior performance on two industrial datasets. It also maintains robust on medical datasets when the binning function is extended to more general expression. The code is available at \url{https://github.com/Wangduo-Xie/NABCTreconstruction}.


Poster
P3-#2018
GIT-BO: High-Dimensional Bayesian Optimization with Tabular Foundation Models

Rosen Yu ⋅ Cyril Picard ⋅ Faez Ahmed

Bayesian optimization (BO) struggles in high dimensions, where Gaussian-process surrogates demand heavy retraining and brittle assumptions, slowing progress on real engineering and design problems. We introduce GIT-BO, a Gradient-Informed BO framework that couples TabPFN v2, a tabular foundation model that performs zero-shot Bayesian inference in context, with an active-subspace mechanism computed from the model’s own predictive-mean gradients. This aligns exploration to an intrinsic low-dimensional subspace via a Fisher-information estimate and selects queries with a UCB acquisition, requiring no online retraining. Across 60 problem variants spanning 20 benchmarks—nine scalable synthetic families and ten real-world tasks (e.g., power systems, Rover, MOPTA08, Mazda)—up to 500 dimensions, GIT-BO delivers a stronger performance–time trade-off than state-of-the-art GP-based methods (SAASBO, TuRBO, Vanilla BO, BAxUS), ranking highest in performance and with runtime advantages that grow with dimensionality. Limitations include memory footprint and dependence on the capacity of the underlying TFM.


Poster
P3-#2017
GenSR: Symbolic regression based on equation generative space

Qian Li ⋅ Yuxiao Hu ⋅ Juncheng Liu ⋅ Yuntian Chen

Symbolic Regression (SR) tries to reveal the hidden equations behind observed data. However, most methods search within a discrete equation space, where the structural modifications of equations rarely align with their numerical behavior, leaving fitting error feedback too noisy to guide exploration. To address this challenge, we propose GenSR, a generative latent space–based SR framework following the "map construction $\rightarrow$ coarse localization $\rightarrow$ fine search" paradigm. Specifically, GenSR first pretrains a dual-branch Conditional Variational Autoencoder (CVAE) to reparameterize symbolic equations into a generative latent space with symbolic continuity and local numerical smoothness. This space can be regarded as a well-structured "map" of the equation space, providing directional signals for search. At inference, the CVAE coarsely localizes the input data to promising regions in the latent space. Then, a modified CMA-ES refines the candidate region, leveraging smooth latent gradients. From a Bayesian perspective, GenSR reframes SR task as maximizing the conditional distribution $p({\rm Equ.}|{\rm Num.})$, with CVAE training achieving this objective through the Evidence Lower Bound (ELBO). This new perspective provides a theoretical guarantee for the effectiveness of GenSR. Extensive experiments show that GenSR jointly optimizes predictive accuracy, expression simplicity, and computational efficiency, while remaining robust under noise.


Poster
P3-#2016
Unleashing Guidance Without Classifiers for Human-Object Interaction Animation

Ziyin Wang ⋅ Sirui Xu ⋅ chuan guo ⋅ Bing Zhou ⋅ Jiangshan Gong ⋅ Jian Wang ⋅ Yu-Xiong Wang ⋅ Liang-Yan Gui

Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on handcrafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be further enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.


Poster
P3-#2015
DanceTogether: Generating Interactive Multi-Person Video without Identity Drifting

Junhao Chen ⋅ Mingjin Chen ⋅ Jianjin Xu ⋅ Xiang Li ⋅ Junting Dong ⋅ Mingze Sun ⋅ puhua jiang ⋅ Hongxiang Li ⋅ Yuhang Yang ⋅ Hao Zhao ⋅ Xiao-Xiao Long ⋅ Ruqi Huang

Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds “who” and “how” at every denoising step by fusing robust tracking masks with semantically rich but noisy pose heat maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 h of dual-skater footage with more than 7 000 distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a three-track benchmark centred on the DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a significant margin. Moreover, we show that a one-hour fine-tune yields convincing human-robot videos, underscoring broad generalisation to embodied-AI and HRI tasks. Extensive ablations confirm that persistent identity-action binding is critical to these gains. Together, our model, datasets, and benchmark lift CVG from single-subject choreography to compositionally controllable, multi-actor interaction, opening new avenues for digital production, simulation, and embodied intelligence.

In observational settings where treatment and outcome are confounded by unobserved factors but an observed mediator satisfies front-door conditions, estimating heterogeneous treatment effects remains underdeveloped. We introduce two debiased learners for heterogeneous front-door effects: FD-DR-Learner and FD-R-Learner. Both methods are constructed to be robust to nuisance estimation error, and we show they achieve fast quasi-oracle rates even when nuisance functions converge as slowly as $n^{-1/4}$. We provide error analyses that clarify their behavior under overlap and nuisance misspecification. In synthetic experiments varying sample size, nuisance noise, and overlap severity, both learners consistently outperform a plug-in baseline, with FD-R showing stronger stability under weak overlap. In a real-world case study using FARS data on primary seat-belt laws, the methods deliver reliable personalized effect estimates and interpretable heterogeneity patterns. Overall, the proposed learners offer practical and sample-efficient tools for heterogeneous causal estimation under front-door identification.


Poster
P3-#2013
Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation

Zhiwei Zhang ⋅ Xiaomin Li ⋅ Yudi Lin ⋅ Hui Liu ⋅ Ramraj Chandradevan ⋅ Linlin Wu ⋅ Minhua Lin ⋅ Fali Wang ⋅ Xianfeng Tang ⋅ Qi He ⋅ Suhang Wang

Large Language Models (LLMs) trained with reinforcement learning and verifiable rewards have achieved strong results on complex reasoning tasks. Recent work extends this paradigm to a multi-agent setting, where a meta-thinking agent proposes plans and monitors progress while a reasoning agent executes subtasks through sequential conversational turns. Despite promising performance, we identify a critical limitation: lazy agent behavior, in which one agent dominates while the other contributes little, undermining collaboration and collapsing the setup to an ineffective single agent. In this paper, we first provide a theoretical analysis showing why lazy behavior naturally arises in multi-agent reasoning. We then introduce a stable and efficient method for measuring causal influence, helping mitigate this issue. Finally, as collaboration intensifies, the reasoning agent risks getting lost in multi-turn interactions and trapped by previous noisy responses. To counter this, we propose a verifiable reward mechanism that encourages deliberation by allowing the reasoning agent to discard noisy outputs, consolidate instructions, and restart its reasoning process when necessary. Extensive experiments demonstrate that our framework alleviates lazy agent behavior and unlocks the full potential of multi-agent framework for complex reasoning tasks.


Poster
P3-#2024
Machine Unlearning under Retain–Forget Entanglement

JINGPU CHENG ⋅ Ping Liu ⋅ Qianxiao Li ⋅ CHI ZHANG

Forgetting a subset in machine unlearning is rarely an isolated task. Often, retained samples that are closely related to the forget set can be unintentionally affected, particularly when they share correlated features from pretraining or exhibit strong semantic similarities. To address this challenge, we propose a novel two-phase optimization framework specifically designed to handle such retain–forget entanglements. In the first phase, an augmented Lagrangian method increases the loss on the forget set while preserving accuracy on less-related retained samples. The second phase applies a gradient projection step, regularized by the Wasserstein-2 distance, to mitigate performance degradation on semantically related retained samples without compromising the unlearning objective. We validate our approach through comprehensive experiments on multiple unlearning tasks, standard benchmark datasets, and diverse neural architectures, demonstrating that it achieves effective and reliable unlearning while outperforming existing baselines in both accuracy retention and removal fidelity.


Poster
P3-#2012
Do 3D Large Language Models Really Understand 3D Spatial Relationships?

Xianzheng Ma ⋅ Tao Sun ⋅ Shuai Chen ⋅ Yash Bhalgat ⋅ Jindong Gu ⋅ Angel Chang ⋅ Iro Armeni ⋅ Iro Laina ⋅ Songyou Peng ⋅ Victor Prisacariu

Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, we introduce Real-3DQA, a more rigorous evaluation benchmark that filters out easy-to-guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real-3DQA confirm that existing 3D-LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D-reweighted training objective that leverages negative samples via explicit 3D-relation alignment, substantially enhancing 3D-LLMs’ performance in spatial reasoning tasks. Our findings underscore the need for robust benchmarks and tailored training strategies to advance genuine 3D vision-language understanding.


Poster
P3-#2011
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Akshit Sinha ⋅ Arvindh Arun ⋅ Shashwat Goel ⋅ Steffen Staab ⋅ Jonas Geiping

Does continued scaling of large language models (LLMs) yield diminishing returns? In this work, we show that short-task benchmarks may give an illusion of slowing progress, as even marginal gains in single-step accuracy can compound into exponential improvements in the length of tasks a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. So, we propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. First, we find that larger models can correctly execute significantly more turns even when small models have near-perfect single-turn accuracy. We then observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations---curiously, we observe a self-conditioning effect---models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. But, we find that thinking mitigates self-conditioning, and also enables execution of much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of tasks they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.


Poster
P3-#2010
Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration

Sukwon Yun ⋅ Jie Peng ⋅ Pingzhi Li ⋅ Wendong Fan ⋅ Jie Chen ⋅ James Y Zou ⋅ Guohao Li ⋅ Tianlong Chen

With an ever-growing zoo of LLMs and benchmarks, the need to orchestrate multiple models for improved task performance has never been more pressing. While frameworks like Mixture-of-Agents (MoA) attempt to coordinate LLMs, they often fall short in terms of (1) selecting relevant agents, (2) facilitating effective intra-agent communication, and (3) integrating responses efficiently. In this work, we propose Graph-of-Agents (GoA), a new graph-based framework for modeling multi-agent LLM communication. Our approach begins with node sampling, selecting only the most relevant agents by leveraging model cards that summarize each model’s domain, task specialization, and other characteristics. Next, we construct edges between the selected agents by evaluating their responses against one another to determine relevance ordering. Directed message passing is then performed from highly relevant agents to less relevant ones to enhance their responses, followed by reverse message passing to refine the original responses of the more relevant agents. Finally, the updated responses are aggregated via graph-based pooling (e.g., max or mean pooling) to produce a single, unified answer. We evaluate GoA on diverse multi-domain benchmarks (MMLU, MMLU-Pro, GPQA) and domain-specific benchmarks (MATH, HumanEval, MedMCQA), with an agent pool of 6 LLMs spanning multiple domains. Surprisingly, GoA achieves superior performance18 using only 3 selected agents, outperforming recent multi-agent LLM baselines that utilize all 6 agents simultaneously. By adopting a graph structure, GoA offers both scalability and effectiveness through structured message passing—positioning it as a strong candidate for navigating the challenges of the ever-growing LLM zoo.


Poster
P3-#2009
LH-DECEPTION: Simulating and Understanding LLM Deceptive Behaviors in Long-Horizon Interactions

Yang Xu ⋅ Xuanming Zhang ⋅ Min-Hsuan Yeh ⋅ Jwala Dhamala ⋅ Ousmane Dia ⋅ Rahul Gupta ⋅ Sharon Li

Deception is a pervasive feature of human communication and an emerging concern in large language models (LLMs). While recent studies document instances of LLM deception, most evaluations remain confined to single-turn prompts and fail to capture the long-horizon interactions in which deceptive strategies typically unfold. We introduce a new simulation framework, LH-Deception, for a systematic, empirical quantification of deception in LLMs under extended sequences of interdependent tasks and dynamic contextual pressures. LH-Deception is designed as a multi-agent system: a performer agent tasked with completing tasks and a supervisor agent that evaluates progress, provides feedback, and maintains evolving states of trust. An independent deception auditor then reviews full trajectories to identify when and how deception occurs. We conduct extensive experiments across 11 frontier models, spanning both closed-source and open-source systems, and find that deception is model-dependent, increases with event pressure, and consistently erodes supervisor trust. Qualitative analyses further reveal emergent, long-horizon phenomena, such as ``chains of deception", which are invisible to static, single-turn evaluations. Our findings provide a foundation for evaluating future LLMs in real-world, trust-sensitive contexts.


Poster
P3-#2008
Multi-agent Coordination via Flow Matching

Dongsu Lee ⋅ Daehee Lee ⋅ Amy Zhang

This work presents MAC-Flow, a simple yet expressive framework for multi-agent coordination. We argue that requirements of effective coordination are twofold: *(i)* a rich representation of the diverse joint behaviors present in offline data and *(ii)* the ability to act efficiently in real time. However, prior approaches often sacrifice one for the other, *i.e.*, denoising diffusion-based solutions capture complex coordination but are computationally slow, while Gaussian policy-based solutions are fast but brittle in handling multi-agent interaction. MAC-Flow addresses this trade-off by first learning a flow-based representation of joint behaviors, and then distilling it into decentralized one-step policies that preserve coordination while enabling fast execution. Across four different benchmarks, including $12$ environments and $34$ datasets, MAC-Flow alleviates the trade-off between performance and computational cost, specifically achieving about $\boldsymbol{\times14.5}$ faster inference compared to diffusion-based MARL methods, while maintaining good performance. At the same time, its inference speed is similar to that of prior Gaussian policy-based offline MARL methods.


Poster
P3-#2007
MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task

Yuchen Yan ⋅ Yongliang Shen ⋅ Yang Liu ⋅ Jin Jiang ⋅ Xin Xu ⋅ Mengdi Zhang ⋅ Jian Shao ⋅ Yueting Zhuang

Mathematical reasoning represents a critical frontier in advancing large language models (LLMs). While step-by-step approaches have emerged as the dominant paradigm for mathematical problem-solving in LLMs, the quality of reasoning steps in training data fundamentally constrains the performance of the models. Recent studies have demonstrated that more detailed intermediate steps can enhance model performance, yet existing methods for step expansion either require more powerful external models or incur substantial computational costs. In this paper, we introduce MathFimer, a novel framework for mathematical reasoning step expansion inspired by the ''Fill-in-the-middle'' task from code reasoning. By decomposing solution chains into prefix-suffix pairs and training models to reconstruct missing intermediate steps, we develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-FIM dataset. We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains, creating MathFimer-expanded versions. Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct, MetaMathQA and etc., we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on original data across various benchmarks such as GSM8K and MATH. Our approach offers a practical, scalable solution for enhancing mathematical reasoning capabilities in LLMs without relying on powerful external models or expensive inference procedures.


Poster
P3-#2006
Benefits and Limitations of Communication in Multi-Agent Reasoning

Michael Rizvi-Martel ⋅ Satwik Bhattamishra ⋅ Neil Rathi ⋅ Guillaume Rabusseau ⋅ Michael Hahn

Chain-of-thought prompting has popularized step-by-step reasoning in large language models, yet model performance still degrades as problem complexity and context length grow. By decomposing difficult tasks with long contexts into shorter, manageable ones, recent multi-agent paradigms offer a promising near-term solution to this problem. However, the fundamental capacities of such systems are poorly understood. In this work, we propose a theoretical framework to analyze the expressivity of multi-agent systems. We apply our framework to three algorithmic families: state tracking, recall, and $k$-hop reasoning. We derive bounds on (i) the number of agents required to solve the task exactly, (ii) the quantity and structure of inter-agent communication, and (iii) the achievable speedups as problem size and context scale. Our results identify regimes where communication is provably beneficial, delineate tradeoffs between agent count and bandwidth, and expose intrinsic limitations when either resource is constrained. We complement our theoretical analysis with a set of experiments on pretrained LLMs using controlled synthetic benchmarks. Empirical outcomes confirm the tradeoffs between key quantities predicted by our theory. Collectively, our analysis offers principled guidance for designing scalable multi-agent reasoning systems.


Poster
P3-#2005
Hybrid Reinforcement: when reward is sparse, better to be dense

Leitian Tao ⋅ Ilia Kulikov ⋅ Swarnadeep Saha ⋅ Tianlu Wang ⋅ Jing Xu ⋅ Sharon Li ⋅ Jason E Weston ⋅ Ping Yu

Post-training for reasoning in large language models has increasingly relied on verifiable rewards: deterministic checkers that provide $0$–$1$ correctness signals. While reliable, such binary feedback is brittle—many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates sparse verifier signals with dense reward model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms reward model-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.


Blog Track Poster
P3-#2004
AI Fundamentals: Valuing AI Agents & Data Assets

Qingyun Sun ⋅ Zhenheng Tang ⋅ Huacan Wang

Large Language Model (LLM) agents now read the world through managed-context pipelines, write to it via tool-calling APIs, and continuously re-wire themselves with fresh experience. Stakeholders therefore need a Generally Accepted Accounting Principles (GAAP) compatible method to price both (i) the agent's labour-like output and (ii) the data traces that fuel learning. We formalise a single unifying metric - agent Economic Value (AEV)- and demonstrate that these metrics are measurable today. We then extend the template to reinforcement-learning regimes in which grounded rewards equal cash flows. Lastly, we propose a financial settlement layer, which transforms the agent from a passive software user into an active economic participant.